I'm trying to perform some image dilation using OpenCV & CUDA. I invoke two calls to filter->apply(...)
with a different filter
object and on a different Mat
, after each other, every time specifying a different stream to work with. They DO get executed in different streams, as can be seen from the attached nvvp profiling info, but they run sequentially, instead of in parallel. This seems to be caused, for some reason, by the CPU waiting for the stream (cudaStreamSynchronize
).
Why could OpenCV do that? I'm not calling the wait for the stream explicitly or anything, what else could be wrong?
Here's the actual code:
cv::Mat hIm1, hIm2;
cv::imread("/path/im1.png", cv::IMREAD_GRAYSCALE).convertTo(hIm1, CV_32FC1);
cv::imread("/path/im2.png", cv::IMREAD_GRAYSCALE).convertTo(hIm2, CV_32FC1);
cv::cuda::GpuMat dIm1(hIm1);
cv::cuda::GpuMat dIm2(hIm2);
cv::cuda::Stream stream1, stream2;
const cv::Mat strel1 = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(41, 41));
cv::Ptr<cv::cuda::Filter> filter1 =
cv::cuda::createMorphologyFilter(cv::MORPH_DILATE, dIm1.type(), strel1);
const cv::Mat strel2 = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(41, 41));
cv::Ptr<cv::cuda::Filter> filter2 =
cv::cuda::createMorphologyFilter(cv::MORPH_DILATE, dIm2.type(), strel2);
cudaDeviceSynchronize();
filter1->apply(dIm1, dIm1, stream1);
filter2->apply(dIm2, dIm2, stream2);
cudaDeviceSynchronize();
The images are sized 512×512; I tried it with smaller ones (down to 64×64) but to no avail!