Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

cv::cuda morphology much slower than cv::morphologyEx

Hi,

I am trying to process the two images from a stereo camera from a video that I recorded in a loop, so for every frame recorded I am doing the same operations. The program is running on an Nvidia Jetson Nano, and to speed it up I want to use CUDA to run the operations on the GPU. The image size is 2208x1242 with 4 channels.

To run the morphological operations on the GPU, I used the following code:

morph_filter_open = cv::cuda::createMorphologyFilter(cv::MORPH_OPEN, img_type, open_kernel);
morph_filter_close = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, img_type, close_kernel);

and

void Morphology::open(cv::cuda::GpuMat img, cv::cuda::GpuMat out){
    morph_filter_open->apply(img, out);
};

void Morphology::close(cv::cuda::GpuMat img, cv::cuda::GpuMat out){
    morph_filter_close->apply(img, out);
};

The kernel is a standard cv::Mat, img_type is just an int with value 0. The functions are called like this:

img_left_gpu.upload(img_left);
img_right_gpu.upload(img_right);

start = std::chrono::high_resolution_clock::now();
morphology.open(img_left_gpu, img_left_gpu);
morphology.open(img_right_gpu, img_right_gpu);
morphology.close(img_left_gpu, img_left_gpu);
morphology.close(img_right_gpu, img_right_gpu);
finish = std::chrono::high_resolution_clock::now();

Opening and closing on the GPU takes about 1.5s for both images, whereas the same operation with cv::morphologyEx on the CPU only take about 0.07s.

As you see, I upload the images to the GPU before starting the timer, so my understanding it that the copy-operation, although it also may take relatively long, cannot be the problem here, or am I wrong?

Thank you for your help!

cv::cuda morphology much slower than cv::morphologyEx

Hi,

I am trying to process the two images from a stereo camera from a video that I recorded in a loop, so for every frame recorded I am doing the same operations. The program is running on an Nvidia Jetson Nano, and to speed it up I want to use CUDA to run the operations on the GPU. The image size is 2208x1242 with 4 channels.

To run the morphological operations on the GPU, I used the following code:

morph_filter_open = cv::cuda::createMorphologyFilter(cv::MORPH_OPEN, img_type, open_kernel);
morph_filter_close = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, img_type, close_kernel);

and

void Morphology::open(cv::cuda::GpuMat img, cv::cuda::GpuMat out){
    morph_filter_open->apply(img, out);
};

void Morphology::close(cv::cuda::GpuMat img, cv::cuda::GpuMat out){
    morph_filter_close->apply(img, out);
};

The kernel is a standard cv::Mat, img_type is just an int with value 0. The functions are called like this:

img_left_gpu.upload(img_left);
img_right_gpu.upload(img_right);

start = std::chrono::high_resolution_clock::now();
morphology.open(img_left_gpu, img_left_gpu);
morphology.open(img_right_gpu, img_right_gpu);
morphology.close(img_left_gpu, img_left_gpu);
morphology.close(img_right_gpu, img_right_gpu);
finish = std::chrono::high_resolution_clock::now();

Opening and closing on the GPU takes about 1.5s for both images, whereas the same operation with cv::morphologyEx on the CPU only take about 0.07s.

As you see, I upload the images to the GPU before starting the timer, so my understanding it that the copy-operation, although it also may take relatively long, cannot be the problem here, or am I wrong?

Thank you for your help!

EDIT:

I made another experiment with the "morphology.cpp" sample from opencv-4.1.0/samples/gpu, which I modified a bit. Running the open and close functions implemented with CUDA in that file takes 0.92s, running an open/close operation with morphologyEx and the same structuring element takes about 0.008s.

I also ran the operation 1000 times (only both calls of the "apply"-function, not the construction of the pointer etc.), which takes about 57s for the CUDA-version and 7s. So the copy of the matrices seems to consume a fair amount of time and with many operations the difference becomes smaller, but still, the CPU-version is much faster.

Do you have a clue about what the problem is?
Or can someone confirm my observations?