Ask Your Question
1

cuda::dft speed issues (too slow)

asked 2015-05-31 05:29:28 -0500

huygens gravatar image

updated 2015-05-31 12:36:37 -0500

I am tying to do some image Fourier transforms (FFT) in OpenCV 3.0 RC1. In order to speed up the process, I decided to use the cuda module in OpenCV. However, the results is disappointing.

To test the speed, I did DFT to a 512x512 random complex matrix using CPU and GPU respectively. On my computer, the CUP takes 2.1 milliseconds (ms) to do it, while GPU takes 1.5 ms. I understand that copying data from memory to video memory is time consuming, so the data transferring time was excluded from the test results.

Since MATLAB also supports cuda acceleration, I ran a similar test in MATLAB 2014b. The gpu version of FFT in MATLAB was surprisingly faster. The CUP takes 5 ms, GPU only takes 0.007 ms.

So the question is, if both OpenCV and MATLAB are using the same cuda dft function (I assume), why is OpenCV so much slower?

OpenCV code I used is here:

#include <opencv2/core/core.hpp>
#include <opencv2/core/utility.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/highgui/highgui.hpp>

// CUDA structures and methods
#include <opencv2/core/cuda.hpp>
#include <opencv2/cudaarithm.hpp>

#include <iostream>

using namespace cv;
using namespace std;


int main(int argc, char ** argv)
{
    // create a random complex image that is to be FFTed
    Mat complexImg = Mat(512, 512, CV_32FC2);
    randu(complexImg, Scalar::all(0), Scalar::all(255)); 

    Mat imgFFT;

    // DFT speed test on CPU
    double t = getTickCount();
    int NN = 100; //iteration number
    for (int i = 0; i < NN; i++)
    {
        dft(complexImg, imgFFT, DFT_COMPLEX_OUTPUT);
    }
    t = 1000 * ((double)getTickCount() - t) / getTickFrequency() / NN;
    cout << "CPU TIME: " << t << " ms" << endl;

    // DFT speed test on GPU
    cuda::GpuMat imageG, imgFFTG;
    imageG.upload(complexImg);
    cuda::dft(imageG, imgFFTG, imageG.size());  
    t = getTickCount();
    for (int i = 0; i < NN; i++)
    {
        cuda::dft(imageG, imgFFTG, imageG.size());
    }
    t = 1000 * ((double)getTickCount() - t) / getTickFrequency() / NN;
    cout << "GPU TIME: " << t << " ms" << endl;

    return 0;
}

MATLAB code I used is here:

M = double(rand(512,512,2));
N = zeros(size(M));

NN = 100; % iteration number
% CPU speed test
tic;
for i = 1:NN
    N = fft2(M);
end
elapsedTime = toc/NN;
disp(elapsedTime);

A = gpuArray(M);
B = fft2(A);

% GPU speed test
tic;
for i = 1:NN
    B = fft2(A);
end
elapsedTime = toc/NN;
disp(elapsedTime);
edit retag flag offensive close merge delete

Comments

First of all I never used OpenCV with CUDA support.

In cuda::dft(complexImg, imgFFTG, imageG.size());

should it be

cuda::dft(imageG, imgFFTG, imageG.size());

to use the cuda::GpuMat imageG instead of Mat complexImg ?

Eduardo gravatar imageEduardo ( 2015-05-31 10:29:42 -0500 )edit

Thanks Eduardo! I don't know why I made that mistake. After fixed, the GPU dft time decreased to 1.6 ms, but still slow compared to MATLAB.

huygens gravatar imagehuygens ( 2015-05-31 12:38:15 -0500 )edit

1 answer

Sort by ยป oldest newest most voted
1

answered 2015-05-31 17:21:06 -0500

Eduardo gravatar image

Some results after I decided to test OpenCV with CUDA. I built OpenCV-3.0.0-rc1 with CUDA 7.0, Windows 7 64 bits and VS2010.

I have also tested with plain Cuda code to compare with OpenCV Cuda module.

512x512
CPU TIME: ~3.74 ms
GPU TIME: ~1.88 ms ; Speed-Up: x1.99
Cuda time: ~1.96 ms ; Speed-Up: x1.91

1024x1024
CPU TIME:  ~12.22 ms
GPU TIME:  ~2.53 ms ; Speed-Up: x4.83
Cuda time: ~1.33 ms ; Speed-Up: x9.19

2048x2048
CPU TIME:  ~58.55 ms
GPU TIME:  ~4.79 ms ; Speed-Up: x12.22
Cuda time: ~1.19 ms ; Speed-Up: x49.2

8192x8192
CPU TIME:  ~1025.3 ms
GPU TIME:  ~66.63 ms ; Speed-Up: x15.39
Cuda time: ~1.29 ms ; Speed-Up: x794.81

It seems that the speed-up with OpenCV and GPU depends also of the image size. The case of Cuda code is different because when I monitor my GPU (with GPU-Z) I see the memory usage that increases but not the GPU load (whereas it increases with cuda OpenCV). There should be something wrong with my code.

Maybe you could try to increase the image size and compare the results between Cuda OpenCV and Cuda Matlab to see if the difference is still huge.

The strange thing is that the Cuda Matlab time is so low compared to my Cuda result. Are you sure you did't forget the conversion for the elapsed time you wrote as tic and toc should return time in second ?

Finally, the link to the source code for cv::cuda::dft() function and the code I used to test the plain Cuda code (using cuFFT example):

#include <cuda_runtime.h>
#include <cufft.h>

#include <iostream>
#include <time.h>

#include <opencv2/opencv.hpp>


int main(int argc, char ** argv)
{
    int NX = 2560;
    int NY = 2560;
    int NN = 1000;

    if(argc == 4)
    {
        NX = atoi(argv[1]);
        NY = atoi(argv[2]);
        NN = atoi(argv[3]);
    }

    std::cout << "NX=" << NX << " ; NY=" << NY << " ; NN=" << NN << std::endl;

    cufftHandle plan;
    cufftComplex *data, *res;
    cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY);
    cudaMalloc((void**)&res, sizeof(cufftComplex)*NX*NY);

    /* Try to do the same thing than cv::randu() */
    cufftComplex* host_data;
    host_data = (cufftComplex *) malloc(sizeof(cufftComplex)*NX*NY);

    srand(time(NULL));
    for(int i = 0; i < NX*NY; i++)
    {
        host_data[i] = make_cuComplex(rand() % 256, rand() % 256);
        //host_data[i].x = rand() % 256;
        //host_data[i].y = rand() % 256;
    }

    cudaMemcpy(host_data, data, sizeof(cufftComplex)*NX*NY, cudaMemcpyHostToDevice);

    /* Warm up ? */
    /* Create a 3D FFT plan. */
    cufftPlan2d(&plan, NX, NY, CUFFT_C2C);

    /* Transform the first signal in place. */
    cufftExecC2C(plan, data, data, CUFFT_FORWARD);

    double t = cv::getTickCount();

    for (int i = 0; i < NN; i++)
    {
        /* Create a 2D FFT plan. */
        cufftPlan2d(&plan, NX, NY, CUFFT_C2C);

        /* Transform the first signal in place. */
        cufftExecC2C(plan, data, res, CUFFT_FORWARD);
    }

    t = 1000 * ((double)cv::getTickCount() - t) / cv::getTickFrequency() / NN;
    std::cout << "Cuda time=" << t << " ms" << std::endl;

    /* Destroy the cuFFT plan. */
    cufftDestroy(plan);
    cudaFree(data);

    return 0;
}
edit flag offensive delete link more

Comments

I retested MATLAB and found that MATLAB is using some trick that if I do FFT to the same matrix many times, MATLAB is clever enough to just do it once. So the time I got is complete wrong.

I rewrote the MATLAB code and tested it again, I think the speed is comparable to OpenCV.

huygens gravatar imagehuygens ( 2015-05-31 19:44:45 -0500 )edit

Looks like something like JIT (Just In Time compilation ) is involved. It would be great if you could add the updated results.

Eduardo gravatar imageEduardo ( 2015-06-01 12:53:09 -0500 )edit

Question Tools

1 follower

Stats

Asked: 2015-05-31 05:27:04 -0500

Seen: 2,514 times

Last updated: May 31 '15