2016-07-13 00:04:42 -0600 | commented question | Does opencv gpu functions scale up depending upon the number of cores on the gpu? Hi, I have posted the nvprof output of my code. Can you give your suggestions where the bottleneck is? Almost 92% of the time is consumed in the cudaMallocPitch. Can you give an idea how can I overcome this bottleneck? |
2016-07-13 00:01:45 -0600 | received badge | ● Editor (source) |
2016-07-12 12:57:46 -0600 | asked a question | Does opencv gpu functions scale up depending upon the number of cores on the gpu? I am new to CUDA and having some problem with executing GPU version of farneback optical flow. I was recently using gpu version of farneback optical flow provided in opencv for one of my application of action recognition in videos. I executed the farneback optical flow (GPU Version) for one of the sample video, and it took around 12 seconds to calculate the optical flow on NVidea Geforce Gpu with 96 cores. However the same code I am trying to run it on a more advanced GPU (TitanX) which has around 3072 cores but I don't know why it is taking almost the same time to compute the optical flow as it used to take for NVidea Geforce Gpu. Would it be possible that my code has some flaws or is it possible that the function itself is allocating limited number of cores for the program to execute irrespective of the GPU? Can I have the access on the number of threads a function can allocate to the GPU so that I can manually set the number of threads in order to speed up the code for a more advanced GPU. The cuda file of farneback optical flow have some code lines such as dim3 block(128) which allocate some block size to the GPU. Would my program run fast if I try to increase the block size from 128 to 256? I tried it but it is showing some error like OpenCV Error: Gpu API call (invalid device function) in call, file /home/aditya-vision/opencv-2.4.9/modules/gpu/include/opencv2/gpu/device/detail/transform_detail.hpp. Kindly help me through this by giving your valuable suggestions. I am also posting the nvprof output through which I visualized the bootlenecks in the code. Kindly give your suggestions based on that. (more) |