Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Some results after I decided to test OpenCV with CUDA. I built OpenCV-3.0.0-rc1 with CUDA 7.0, Windows 7 64 bits and VS2010.

I have also tested with plain Cuda code to compare with OpenCV Cuda module.

512x512
CPU TIME: ~3.74 ms
GPU TIME: ~1.88 ms ; Speed-Up: x1.99
Cuda time: ~1.96 ms ; Speed-Up: x1.91

1024x1024
CPU TIME:  ~12.22 ms
GPU TIME:  ~2.53 ms ; Speed-Up: x4.83
Cuda time: ~1.33 ms ; Speed-Up: x9.19

2048x2048
CPU TIME:  ~58.55 ms
GPU TIME:  ~4.79 ms ; Speed-Up: x12.22
Cuda time: ~1.19 ms ; Speed-Up: x49.2

8192x8192
CPU TIME:  ~1025.3 ms
GPU TIME:  ~66.63 ms ; Speed-Up: x15.39
Cuda time: ~1.29 ms ; Speed-Up: x794.81

It seems that the speed-up with OpenCV and GPU depends also of the image size. The case of Cuda code is different because when I monitor my GPU (with GPU-Z) I see the memory usage that increases but not the GPU load (whereas it increases with cuda OpenCV). There should be something wrong with my code.

Maybe you could try to increase the image size and compare the results between Cuda OpenCV and Cuda Matlab to see if the difference is still huge.

The strange thing is that the Cuda Matlab time is so low compared to my Cuda result. Are you sure you did't forget the conversion for the elapsed time you wrote as tic and toc should return time in second ?

Finally, the link to the source code for cv::cuda::dft() function and the code I used to test the plain Cuda code (using cuFFT example):

#include <cuda_runtime.h>
#include <cufft.h>

#include <iostream>
#include <time.h>

#include <opencv2/opencv.hpp>


int main(int argc, char ** argv)
{
    int NX = 2560;
    int NY = 2560;
    int NN = 1000;

    if(argc == 4)
    {
        NX = atoi(argv[1]);
        NY = atoi(argv[2]);
        NN = atoi(argv[3]);
    }

    std::cout << "NX=" << NX << " ; NY=" << NY << " ; NN=" << NN << std::endl;

    cufftHandle plan;
    cufftComplex *data, *res;
    cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY);
    cudaMalloc((void**)&res, sizeof(cufftComplex)*NX*NY);

    /* Try to do the same thing than cv::randu() */
    cufftComplex* host_data;
    host_data = (cufftComplex *) malloc(sizeof(cufftComplex)*NX*NY);

    srand(time(NULL));
    for(int i = 0; i < NX*NY; i++)
    {
        host_data[i] = make_cuComplex(rand() % 256, rand() % 256);
        //host_data[i].x = rand() % 256;
        //host_data[i].y = rand() % 256;
    }

    cudaMemcpy(host_data, data, sizeof(cufftComplex)*NX*NY, cudaMemcpyHostToDevice);

    /* Warm up ? */
    /* Create a 3D FFT plan. */
    cufftPlan2d(&plan, NX, NY, CUFFT_C2C);

    /* Transform the first signal in place. */
    cufftExecC2C(plan, data, data, CUFFT_FORWARD);

    double t = cv::getTickCount();

    for (int i = 0; i < NN; i++)
    {
        /* Create a 2D FFT plan. */
        cufftPlan2d(&plan, NX, NY, CUFFT_C2C);

        /* Transform the first signal in place. */
        cufftExecC2C(plan, data, res, CUFFT_FORWARD);
    }

    t = 1000 * ((double)cv::getTickCount() - t) / cv::getTickFrequency() / NN;
    std::cout << "Cuda time=" << t << " ms" << std::endl;

    /* Destroy the cuFFT plan. */
    cufftDestroy(plan);
    cudaFree(data);

    return 0;
}