Ask Your Question

Aditya Vora's profile - activity

2016-07-13 00:04:42 -0600 commented question Does opencv gpu functions scale up depending upon the number of cores on the gpu?

Hi, I have posted the nvprof output of my code. Can you give your suggestions where the bottleneck is? Almost 92% of the time is consumed in the cudaMallocPitch. Can you give an idea how can I overcome this bottleneck?

2016-07-13 00:01:45 -0600 received badge  Editor (source)
2016-07-12 12:57:46 -0600 asked a question Does opencv gpu functions scale up depending upon the number of cores on the gpu?

I am new to CUDA and having some problem with executing GPU version of farneback optical flow. I was recently using gpu version of farneback optical flow provided in opencv for one of my application of action recognition in videos. I executed the farneback optical flow (GPU Version) for one of the sample video, and it took around 12 seconds to calculate the optical flow on NVidea Geforce Gpu with 96 cores.

However the same code I am trying to run it on a more advanced GPU (TitanX) which has around 3072 cores but I don't know why it is taking almost the same time to compute the optical flow as it used to take for NVidea Geforce Gpu. Would it be possible that my code has some flaws or is it possible that the function itself is allocating limited number of cores for the program to execute irrespective of the GPU? Can I have the access on the number of threads a function can allocate to the GPU so that I can manually set the number of threads in order to speed up the code for a more advanced GPU. The cuda file of farneback optical flow have some code lines such as dim3 block(128) which allocate some block size to the GPU. Would my program run fast if I try to increase the block size from 128 to 256? I tried it but it is showing some error like OpenCV Error: Gpu API call (invalid device function) in call, file /home/aditya-vision/opencv-2.4.9/modules/gpu/include/opencv2/gpu/device/detail/transform_detail.hpp. Kindly help me through this by giving your valuable suggestions.

I am also posting the nvprof output through which I visualized the bootlenecks in the code. Kindly give your suggestions based on that.

==10935== NVPROF is profiling process 10935, command: ./farneback_flow resizeVideo.avi 
It took 14 second(s). 
==10935== Profiling application: ./farneback_flow resizeVideo.avi
==10935== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
62.38%  543.05ms      9870  55.020us  23.265us  263.63us  cv::gpu::device::optflow_farneback::boxFilter5(int, int, cv::gpu::PtrStep<float>, int, float, cv::gpu::PtrStep<float>)
17.11%  148.94ms      9870  15.090us  5.9840us  58.978us  cv::gpu::device::optflow_farneback::updateMatrices(int, int, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>,cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>)
5.12%  44.550ms      1974  22.568us  15.328us  41.090us  void cv::gpu::device::optflow_farneback::gaussianBlur<cv::gpu::device::BrdReflect101<float>>(int, int, cv::gpu::PtrStep<float>, int, float, cv::gpu::PtrStep)
5.07%  44.137ms      9870  4.4710us  2.7200us  16.225us  cv::gpu::device::optflow_farneback::updateFlow(int, int, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>)
3.72%  32.371ms       658  49.196us  48.898us  53.090us  [CUDA memcpy DtoH]
3.05%  26.517ms      1974  13.433us  5.9200us  31.265us  void cv::gpu::device::optflow_farneback::polynomialExpansion<int=5>(int, int, cv::gpu::PtrStep<float ...
(more)