Why are large kernels on CUDA slower than on CPU?

asked 2018-11-26 06:24:32 -0500

es gravatar image

updated 2018-11-27 09:01:13 -0500

I have modern *buntu with stock drivers.

My hardware is i7-7700 and Nvidia 1080 oc.

I tried OpenCV 3.4.3 and 4.0.0.

I am creating my filter like that:

 cv::Mat k = KernelMaker::muKernel3(p, a); // 100x100
 filter = cv::cuda::createLinearFilter(
    CV_32F, CV_32F, k, cv::Point(-1, -1), cv::BORDER_REFLECT));

and later using like that:

cv::cuda::GpuMat gbw; //20 megapixels
cv::cuda::GpuMat stmp(gbw.size(), CV_32FC1, cv::Scalar(0));
filter->apply(gbw, stmp, stream);

But its many times slower than CPU implementation with Filter2D.

If I run it with a small 3x3 dummy kernel then GPU implementation gets much faster. So it's not memory copy or anything like that i think.

Can y get my large kernel to run faster?

edit retag flag offensive close merge delete

Comments

maybe use Nsight to profile the program, see which part does it spend most time.

blues gravatar imageblues ( 2018-11-27 20:03:28 -0500 )edit