Cuda Convolve VS filter2D openCV 3.1.0

asked 2016-12-29 12:22:23 -0500

federocchi gravatar image

updated 2016-12-29 12:29:00 -0500

Hello, I'm using OpenCV 3.1.0 with CUDA on Intel Xeon 5110 @ 1.60 Ghz x2 CPU + Nvidia Quadro 600 + 4GB RAM with Qt on Fedora 23 OS and I'm concerned about convolution speed. What I've got from my test code is that filter2D convolution of an image with a 3x3 kernel is much faster than cuda Convolve as far as the image size is not too big (threshold around 1280x1024) and surprisingly always faster than separate convolution (first with 3x1 then 1x3 kernels), I was expecting from theory 2/3 processing time (3+3 rather than 3x3). Moreover the output image size with cuda convolve is smaller than the original one, I was expecting same size from documentation.

Is there anything wrong in what I'm doing? Any suggestion to speed up convolution for images around 640x480? You can find below the test code I used:

cv::cuda::GpuMat temp2; // ---- is a B/W image different size

//-----fill up the temp2 image

....

//---------------------------

Mat dst_x; Mat dst_x1; Mat dst_x2; Mat tmp_2; cv::cuda::GpuMat fx;

Mat kernel_x = (Mat_<double>(3,3) << 2, 0, -2, 4, 0, -4, 2, 0, -2);

Mat kernel_x1 = (Mat_<double>(3,1) << 2, 4, 2); //----separate x convolution

Mat kernel_x2 = (Mat_<double>(1,3) << 1, 0, -1);

temp2.download(tmp_2);

int64 t1 = getTickCount();

cv::filter2D(tmp_2, dst_x1, -1,kernel_x1);

cv::filter2D(dst_x1, dst_x2, -1,kernel_x2);

int64 t2 = getTick();

std::cout << "Time passed in ms: " << (((t2 - t1) / 1e9)*1000.) << std::endl;

//int64 t1 = getTickCount();

cv::filter2D(tmp_2, dst_x, -1,kernel_x);

//int64 t2 = getTick();

//std::cout << "Time passed in ms: " << (((t2 - t1) / 1e9)*1000.) << std::endl;

//----CUDA convolution---------

kernel_x.convertTo(kernel_x,CV_32FC1);

//int64 t1 = getTickCount();

Ptr<cuda::convolution> convolver = cuda::createConvolution(Size(3, 3));

convolver->convolve(temp2, kernel_x, fx);

//int64 t2 = getTick();

//std::cout << "Time passed in ms: " << (((t2 - t1) / 1e9)*1000.) << std::endl;

//----END CUDA convolution---------

I can sum up the results as follows:

Image size (30,40) (rows,cols)

Time passed in ms: 0.083827filter2D convolution with kernel size (3,3)output image same size

Time passed in ms: 0.044761filter2D separated convolution with kernel size (1,3) and (3,1)output image same size

Time passed in ms: 5.95849CUDA convolve convolution with kernel size (3,3)output image size (28,38);

Image size (118,158)

Time passed in ms: 0.204968filter2D convolution with kernel size (3,3)output image same size

Time passed in ms: 0.27658filter2D separated convolution with kernel size (1,3) and (3,1)output image same size

Time passed in ms: 7.03869CUDA convolve convolution with kernel size (3,3)output image size (116,156);

Image size (469,629)

Time passed in ms: 2.51682filter2D convolution with kernel size (3,3)output image same size

Time passed in ms: 5.72645filter2D separated convolution with kernel size (1,3) and (3,1)output image same size

Time passed in ms: 9.31991CUDA convolve convolution with kernel size (3 ... (more)

edit retag flag offensive close merge delete

Comments

Always do a bunch of iterations. Around each of the things you're timing, put a for(int iter = 0; iter<1000; ++iter)

That way any initialization gets spread out over several tests.

Tetragramm gravatar imageTetragramm ( 2016-12-29 12:32:10 -0500 )edit