cv::cuda morphology much slower than cv::morphologyEx

asked 2020-04-03 06:57:29 -0500

robin32 gravatar image

updated 2020-04-04 11:40:48 -0500


I am trying to process the two images from a stereo camera from a video that I recorded in a loop, so for every frame recorded I am doing the same operations. The program is running on an Nvidia Jetson Nano, and to speed it up I want to use CUDA to run the operations on the GPU. The image size is 2208x1242 with 4 channels.

To run the morphological operations on the GPU, I used the following code:

morph_filter_open = cv::cuda::createMorphologyFilter(cv::MORPH_OPEN, img_type, open_kernel);
morph_filter_close = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, img_type, close_kernel);


void Morphology::open(cv::cuda::GpuMat img, cv::cuda::GpuMat out){
    morph_filter_open->apply(img, out);

void Morphology::close(cv::cuda::GpuMat img, cv::cuda::GpuMat out){
    morph_filter_close->apply(img, out);

The kernel is a standard cv::Mat, img_type is just an int with value 0. The functions are called like this:


start = std::chrono::high_resolution_clock::now();, img_left_gpu);, img_right_gpu);
morphology.close(img_left_gpu, img_left_gpu);
morphology.close(img_right_gpu, img_right_gpu);
finish = std::chrono::high_resolution_clock::now();

Opening and closing on the GPU takes about 1.5s for both images, whereas the same operation with cv::morphologyEx on the CPU only take about 0.07s.

As you see, I upload the images to the GPU before starting the timer, so my understanding it that the copy-operation, although it also may take relatively long, cannot be the problem here, or am I wrong?

Thank you for your help!


I made another experiment with the "morphology.cpp" sample from opencv-4.1.0/samples/gpu, which I modified a bit. Running the open and close functions implemented with CUDA in that file takes 0.92s, running an open/close operation with morphologyEx and the same structuring element takes about 0.008s.

I also ran the operation 1000 times (only both calls of the "apply"-function, not the construction of the pointer etc.), which takes about 57s for the CUDA-version and 7s. So the copy of the matrices seems to consume a fair amount of time and with many operations the difference becomes smaller, but still, the CPU-version is much faster.

Do you have a clue about what the problem is?
Or can someone confirm my observations?

edit retag flag offensive close merge delete


there are programs by Intel, NVIDIA, ... that help you analyze performance of code that runs on their respective devices. try your code on different systems. maybe that "jetson nano" isn't all that fast. check whether you need to upload the kernel too.

crackwitz gravatar imagecrackwitz ( 2020-04-04 13:29:33 -0500 )edit

Thanks for your suggestions. Unfortunately, I do not have another system with an Nvidia Graphics Card that I could test the performance on. The GPU of the Jetson Nano is not the fastest, but it should still excel its CPU by far. I also compared cv::cvtColor with cuda::cvtColor, and the CUDA-version is roughly 5x faster (which is not that much, but still), so the problem really seems to be in the morphology-function. The GPU-load is at about 98% during the kernel execution, which is also good. Maybe there is a problem with the thread indices in the CUDA-kernel, e.g. every thread working on the same data?

I also tried using a GpuMat as the kernel, but the function seems to expect an instance of cv::Mat, so cuda::GpuMat does not work.

robin32 gravatar imagerobin32 ( 2020-04-05 05:49:21 -0500 )edit

I now also tried Gaussian Blur, and the CUDA-version is faster for a sufficiently sized kernel. The problem really only seems to occur for the morphology function.

robin32 gravatar imagerobin32 ( 2020-04-05 06:44:33 -0500 )edit