cuda::dft speed issues (too slow)

huygens
11 ●1 ●3

I am tying to do some image Fourier transforms (FFT) in OpenCV 3.0 RC1. In order to speed up the process, I decided to use the cuda module in OpenCV. However, the results is disappointing.

To test the speed, I did DFT to a 512x512 random complex matrix using CPU and GPU respectively. On my computer, the CUP takes 2.1 milliseconds (ms) to do it, while GPU takes 1.5 ms. I understand that copying data from memory to video memory is time consuming, so the data transferring time was excluded from the test results.

Since MATLAB also supports cuda acceleration, I ran a similar test in MATLAB 2014b. The gpu version of FFT in MATLAB was surprisingly faster. The CUP takes 5 ms, GPU only takes 0.007 ms.

So the question is, if both OpenCV and MATLAB are using the same cuda dft function (I assume), why is OpenCV so much slower?

OpenCV code I used is here:

#include <opencv2/core/core.hpp>
#include <opencv2/core/utility.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/highgui/highgui.hpp>

// CUDA structures and methods
#include <opencv2/core/cuda.hpp>
#include <opencv2/cudaarithm.hpp>

#include <iostream>

using namespace cv;
using namespace std;


int main(int argc, char ** argv)
{
    // create a random complex image that is to be FFTed
    Mat complexImg = Mat(512, 512, CV_32FC2);
    randu(complexImg, Scalar::all(0), Scalar::all(255)); 

    Mat imgFFT;

    // DFT speed test on CPU
    double t = getTickCount();
    int NN = 100; //iteration number
    for (int i = 0; i < NN; i++)
    {
        dft(complexImg, imgFFT, DFT_COMPLEX_OUTPUT);
    }
    t = 1000 * ((double)getTickCount() - t) / getTickFrequency() / NN;
    cout << "CPU TIME: " << t << " ms" << endl;

    // DFT speed test on GPU
    cuda::GpuMat imageG, imgFFTG;
    imageG.upload(complexImg);
    cuda::dft(imageG, imgFFTG, imageG.size());  
    t = getTickCount();
    for (int i = 0; i < NN; i++)
    {
        cuda::dft(imageG, imgFFTG, imageG.size());
    }
    t = 1000 * ((double)getTickCount() - t) / getTickFrequency() / NN;
    cout << "GPU TIME: " << t << " ms" << endl;

    return 0;
}

MATLAB code I used is here:

M = double(rand(512,512,2));
N = zeros(size(M));

NN = 100; % iteration number
% CPU speed test
tic;
for i = 1:NN
    N = fft2(M);
end
elapsedTime = toc/NN;
disp(elapsedTime);

A = gpuArray(M);
B = fft2(A);

% GPU speed test
tic;
for i = 1:NN
    B = fft2(A);
end
elapsedTime = toc/NN;
disp(elapsedTime);

Comments

First of all I never used OpenCV with CUDA support.

In cuda::dft(complexImg, imgFFTG, imageG.size());

should it be

cuda::dft(imageG, imgFFTG, imageG.size());

to use the cuda::GpuMat imageG instead of Mat complexImg ?

Eduardo (May 31 '15)edit

Thanks Eduardo! I don't know why I made that mistake. After fixed, the GPU dft time decreased to 1.6 ms, but still slow compared to MATLAB.

huygens (May 31 '15)edit

add a comment

512x512 CPU TIME: ~3.74 ms GPU TIME: ~1.88 ms ; Speed-Up: x1.99 Cuda time: ~1.96 ms ; Speed-Up: x1.91 1024x1024 CPU TIME: ~12.22 ms GPU TIME: ~2.53 ms ; Speed-Up: x4.83 Cuda time: ~1.33 ms ; Speed-Up: x9.19 2048x2048 CPU TIME: ~58.55 ms GPU TIME: ~4.79 ms ; Speed-Up: x12.22 Cuda time: ~1.19 ms ; Speed-Up: x49.2 8192x8192 CPU TIME: ~1025.3 ms GPU TIME: ~66.63 ms ; Speed-Up: x15.39 Cuda time: ~1.29 ms ; Speed-Up: x794.81

#include <cuda_runtime.h> #include <cufft.h> #include <iostream> #include <time.h> #include <opencv2/opencv.hpp> int main(int argc, char ** argv) { int NX = 2560; int NY = 2560; int NN = 1000; if(argc == 4) { NX = atoi(argv[1]); NY = atoi(argv[2]); NN = atoi(argv[3]); } std::cout << "NX=" << NX << " ; NY=" << NY << " ; NN=" << NN << std::endl; cufftHandle plan; cufftComplex *data, *res; cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY); cudaMalloc((void**)&res, sizeof(cufftComplex)*NX*NY); /* Try to do the same thing than cv::randu() */ cufftComplex* host_data; host_data = (cufftComplex *) malloc(sizeof(cufftComplex)*NX*NY); srand(time(NULL)); for(int i = 0; i < NX*NY; i++) { host_data[i] = make_cuComplex(rand() % 256, rand() % 256); //host_data[i].x = rand() % 256; //host_data[i].y = rand() % 256; } cudaMemcpy(host_data, data, sizeof(cufftComplex)*NX*NY, cudaMemcpyHostToDevice); /* Warm up ? */ /* Create a 3D FFT plan. */ cufftPlan2d(&plan, NX, NY, CUFFT_C2C); /* Transform the first signal in place. */ cufftExecC2C(plan, data, data, CUFFT_FORWARD); double t = cv::getTickCount(); for (int i = 0; i < NN; i++) { /* Create a 2D FFT plan. */ cufftPlan2d(&plan, NX, NY, CUFFT_C2C); /* Transform the first signal in place. */ cufftExecC2C(plan, data, res, CUFFT_FORWARD); } t = 1000 * ((double)cv::getTickCount() - t) / cv::getTickFrequency() / NN; std::cout << "Cuda time=" << t << " ms" << std::endl; /* Destroy the cuFFT plan. */ cufftDestroy(plan); cudaFree(data); return 0; }

Comments

I retested MATLAB and found that MATLAB is using some trick that if I do FFT to the same matrix many times, MATLAB is clever enough to just do it once. So the time I got is complete wrong.

I rewrote the MATLAB code and tested it again, I think the speed is comparable to OpenCV.

huygens (Jun 1 '15)edit

Looks like something like JIT (Just In Time compilation ) is involved. It would be great if you could add the updated results.

Eduardo (Jun 1 '15)edit

add a comment

cuda::dft speed issues (too slow)

Comments

1 answer

Comments

Links

Question Tools

Stats

Related questions

cuda::dft speed issues (too slow) edit savecancel

Comments

1 answer

Comments

Links

Question Tools

Stats

Related questions

cuda::dft speed issues (too slow)