Ask Your Question

How does opencv with cuda work?

asked 2019-11-19 08:44:23 -0500

JordiGC gravatar image


I am new on using OpenCV with Cuda and I still do not know how it works.

I am using a Jetson Nano and I did several tests with 1-2 images where I could see that the cv::cuda was performing 2-3 times slower than the CPU version. I am running FullHD images and the functions I am using are createGoodFeaturesToTrackDetector and SparsePyrLKOpticalFlow.

Will the performance improve when using more than 1-2 images, so a FullHD video? Should I measure the time of every call to these functions or the overall after finishing the video?

Is the performance of the OpenCV functions decreased if there is a code using the output of these functions running on the CPU?

Is there any page where I can check for more examples?

Thank you very much in advance.

edit retag flag offensive close merge delete


for better understanding your aim, could you provide code example please?

murkoc gravatar imagemurkoc ( 2019-11-20 03:24:46 -0500 )edit

1 answer

Sort by ยป oldest newest most voted

answered 2019-11-20 03:37:08 -0500

updated 2019-11-20 03:40:55 -0500

Whilst I don't have the same hardware or implementation the CUDA implementation (here) of the OpenCV CPU example on my hardware (CPU i7-8700, Mobile GPU RTX 2080) was 50% faster on the small resolution video. I would expect on larger video for the performance increase to be greater but this will depend on how the algorithm is implemented. If you are timing the same function calls with high resolution timers on the Jetson and the CUDA implementation is 2-3 times slower then I would guess the best you can do with any implementation is to get the same performance on the GPU as the CPU.

Regarding your other questions:

  1. If you are timing the execution of SparsePyrLKOpticalFlow.calc() only, then processing a video won't make any difference. If you are timing the execution of the entire pipeline and using CUDA streams then this could make a difference depending on the implementation of SparsePyrLKOpticalFlow.calc() (e.g. if the alg is iterative with each iteration depending on the previous one then there will most likely be fixed device sync points which will stall execution on every iteration even if you use CUDA streams).

  2. If the calculations inside SparsePyrLKOpticalFlow.calc() can be placed in separate streams without any forced synchronization (as described above) then it is possible to overlap host (CPU) and device (GPU) computation to avoid performance degradation when the host relies on the output from the device.

  3. If the above regarding CUDA streams is confusing then Accelerating OpenCV with CUDA streams in Python may be useful. Although the example is in python the concepts apart from the pre-allocation of arrays remain the same.

edit flag offensive delete link more

Question Tools

1 follower


Asked: 2019-11-19 08:44:23 -0500

Seen: 404 times

Last updated: Nov 20 '19