Ask Your Question

CUDA and OpenCV performance

asked 2018-07-11 03:04:41 -0600

darioc85 gravatar image

updated 2018-07-11 03:10:26 -0600


I have a quite big project with several image processing parts implemented with OpenCV 3. In general, I am noticing that the CPU seems to be faster in terms of speed then the part programmed with cv::cuda functions. For example, considering the two portions of code:

    cv::GaussianBlur( image, image, cv::Size(3,3), 0,0);


cv::cuda::GpuMat cuda_image;
cv::Ptr<cv::cuda::Filter> filter = cv::cuda::createGaussianFilter(cuda_image.type(),
     cuda_image.type(), cv::Size(3,3), 0, 0);
filter->apply(cuda_image, cuda_image);
image = cv::Mat(cuda_image);

it happens that the second one is much slower. Please note that I put this portion in a long loop before taking average time (ignoring the first iteration, even slower).

I understand that in this particular case the overhead in communication could be bigger than the effective computation time (image in this example is 1280X720), but it happens, in general, for each function cv::cuda that I use, even things like solvePnPRansac that does not process directly images. Note that I tested the solution in a workstation with Quadro M1200 (with a powerful CPU), but the same occurs with a Tegra TX2, where CPU capabilities are very limited. Until now, the big improvement has been obtained by using OpenMP (of course, mostly on the workstation).

Considering these results, I am having some doubt about how OpenCV compiled with CUDA support works. Could it be possible that since I compiled OpenCV with CUDA support, calling methods like cv::GaussianBlur automatically imply calling the GPU, in a more efficient way than my second portion of code posted? In this case, which are the guidelines and best practices? Or should I consider this just an overhead problem?

Thank you

edit retag flag offensive close merge delete



cuda will work fine, if you can keep your data on the gpu as long as possible.

if you only have a single operation, the costs for up & downloading will outweight any profits.

I have a quite big project with several image processing parts

so, what are your "other" operations ? you'd want to upload your image once, then use cuda processing all the way down, and only download it again at the end of the pipeline

berak gravatar imageberak ( 2018-07-11 03:36:28 -0600 )edit

Yes basically the difference between gpu and cpu.

holger gravatar imageholger ( 2018-07-11 03:40:25 -0600 )edit

I understand, it is as I suspected. Unfortunately many other operations are not concerning pure image processing. Focusing on the image processing part, sometime it also happens that I cannot make usage of cv::cuda since there isn't such equivalent method implementation available (like cv::findContours, cv::text::ERFilter, cv::text::erGrouping and so on, implying that I should download and upload several times the image). My doubt is now concerning what happens if I compile OpenCV with CUDA support and then I don't call cv::cuda methods. Are these functions executed always and purely on the CPU, 100% of the times, or it can happen sometimes that OpenCV automatically send to the GPU, because of the internal implementation, of course depending on the function in exam?

darioc85 gravatar imagedarioc85 ( 2018-07-11 03:56:25 -0600 )edit

I hope it will fall back to using open cl(open cl also includes gpu support) in this case - but maybe @berak or someone else with better knowledge of the source can comment on this?

holger gravatar imageholger ( 2018-07-11 04:43:09 -0600 )edit

Yes, it would be great to receive such information. An automatic call to OpenCL or CUDA implementation in that case would explain everything.

darioc85 gravatar imagedarioc85 ( 2018-07-12 02:23:27 -0600 )edit

1 answer

Sort by ยป oldest newest most voted

answered 2018-07-15 04:44:53 -0600

updated 2018-07-15 05:08:19 -0600

As berak pointed out you need to keep your data on the GPU, performing some operations on the CPU and some on the GPU without asynchronous operations and a well thought through processing pipeline will probably be slower even with a top of the range GPU.

If your GPU is the M1200, then you have only 80.2 GB/s bandwidth and 1399 GFLOPS which is really pretty slow for a modern GPU. A mid range desktop card like the GTX1060 has over twice the bandwidth and nearly three times the processing power.

Check out this performance comparison to get an idea of the operations which will benefit most from GPU acceleration and note that this is excluding the costly transfers between the host and the device. According to this spreadsheet If you had a better card (1060) without the overhead then you may see a 2x speedup for the Gaussian filter operation.

edit flag offensive delete link more

Question Tools

1 follower


Asked: 2018-07-11 03:04:41 -0600

Seen: 5,248 times

Last updated: Jul 15 '18