Aditya Vora's profile - activity

overview network karma followed questions activity

2016-07-13 00:04:42 -0600	commented question	Does opencv gpu functions scale up depending upon the number of cores on the gpu? Hi, I have posted the nvprof output of my code. Can you give your suggestions where the bottleneck is? Almost 92% of the time is consumed in the cudaMallocPitch. Can you give an idea how can I overcome this bottleneck?
2016-07-13 00:01:45 -0600	received badge	● Editor (source)
2016-07-12 12:57:46 -0600	asked a question	Does opencv gpu functions scale up depending upon the number of cores on the gpu? I am new to CUDA and having some problem with executing GPU version of farneback optical flow. I was recently using gpu version of farneback optical flow provided in opencv for one of my application of action recognition in videos. I executed the farneback optical flow (GPU Version) for one of the sample video, and it took around 12 seconds to calculate the optical flow on NVidea Geforce Gpu with 96 cores. However the same code I am trying to run it on a more advanced GPU (TitanX) which has around 3072 cores but I don't know why it is taking almost the same time to compute the optical flow as it used to take for NVidea Geforce Gpu. Would it be possible that my code has some flaws or is it possible that the function itself is allocating limited number of cores for the program to execute irrespective of the GPU? Can I have the access on the number of threads a function can allocate to the GPU so that I can manually set the number of threads in order to speed up the code for a more advanced GPU. The cuda file of farneback optical flow have some code lines such as dim3 block(128) which allocate some block size to the GPU. Would my program run fast if I try to increase the block size from 128 to 256? I tried it but it is showing some error like OpenCV Error: Gpu API call (invalid device function) in call, file /home/aditya-vision/opencv-2.4.9/modules/gpu/include/opencv2/gpu/device/detail/transform_detail.hpp. Kindly help me through this by giving your valuable suggestions. I am also posting the nvprof output through which I visualized the bootlenecks in the code. Kindly give your suggestions based on that. ==10935== NVPROF is profiling process 10935, command: ./farneback_flow resizeVideo.avi It took 14 second(s). ==10935== Profiling application: ./farneback_flow resizeVideo.avi ==10935== Profiling result: Time(%) Time Calls Avg Min Max Name 62.38% 543.05ms 9870 55.020us 23.265us 263.63us cv::gpu::device::optflow_farneback::boxFilter5(int, int, cv::gpu::PtrStep<float>, int, float, cv::gpu::PtrStep<float>) 17.11% 148.94ms 9870 15.090us 5.9840us 58.978us cv::gpu::device::optflow_farneback::updateMatrices(int, int, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>,cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>) 5.12% 44.550ms 1974 22.568us 15.328us 41.090us void cv::gpu::device::optflow_farneback::gaussianBlur<cv::gpu::device::BrdReflect101<float>>(int, int, cv::gpu::PtrStep<float>, int, float, cv::gpu::PtrStep) 5.07% 44.137ms 9870 4.4710us 2.7200us 16.225us cv::gpu::device::optflow_farneback::updateFlow(int, int, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>) 3.72% 32.371ms 658 49.196us 48.898us 53.090us [CUDA memcpy DtoH] 3.05% 26.517ms 1974 13.433us 5.9200us 31.265us void cv::gpu::device::optflow_farneback::polynomialExpansion<int=5>(int, int, cv::gpu::PtrStep<float ... (more)