Why are large kernels on CUDA slower than on CPU?
I have modern *buntu with stock drivers.
My hardware is i7-7700 and Nvidia 1080 oc.
I tried OpenCV 3.4.3 and 4.0.0.
I am creating my filter like that:
cv::Mat k = KernelMaker::muKernel3(p, a); // 100x100
filter = cv::cuda::createLinearFilter(
CV_32F, CV_32F, k, cv::Point(-1, -1), cv::BORDER_REFLECT));
and later using like that:
cv::cuda::GpuMat gbw; //20 megapixels
cv::cuda::GpuMat stmp(gbw.size(), CV_32FC1, cv::Scalar(0));
filter->apply(gbw, stmp, stream);
But its many times slower than CPU implementation with Filter2D.
If I run it with a small 3x3 dummy kernel then GPU implementation gets much faster. So it's not memory copy or anything like that i think.
Can y get my large kernel to run faster?
maybe use Nsight to profile the program, see which part does it spend most time.