parallelise CUDA with multiple GPUs

asked 2018-08-05 07:23:17 -0500

Rado gravatar image

I have a routine that is perfect for parallelisation. I use cv::parallel_for_ just fine for the cv::Mat version of the code.

It also works fine for cv::cuda::GpuMat (with associated cv::cuda:: routines replacing standard cv:: versions). But what I would really like to do is to take advantage of multiple GPUs. I've tried adding a cv::cuda::setDevice at the beginning of each loop, and this seems to function ok. But, it looks like this is causing the entire code to grind to a halt, as if it's being throttle by a single CPU.

Here's a snippet:

cv::Mat originalMatImage; // already loaded
const int nDevices = cv::cuda::getCudaEnabledDeviceCount();
const int nRotations; // a number, much larger than the number of CPU cores

cv::parallel_for_(cv::Range(0, nRotations), [&](const cv::Range& range){
    for (auto i=range.start; i<range.end; i++) {

        cv::cuda::setDevice(i % nDevices);
        cv::cuda::Stream stream;
        cv::cuda::GpuMat* originalCUDAimage = new cv::cuda::GpuMat();
        originalCUDAimage->upload(originalMatImage, stream);

        const double rotationAngle = double(180 * i) / double(nRotations);
        cv::cuda::GpuMat warpedCUDAimage(nRotations, nRotations, originalCUDAimage->type(), 0.0);
        cv::Mat rotationMatrix(cv::getRotationMatrix2D(cv::Point2d(nRotations / 2.0, nRotations / 2.0), rotationAngle, 1.0));

        cv::cuda::warpAffine(*originalCUDAimage, warpedCUDAimage, rotationMatrix, cv::Size2d(nRotations, nRotations), cv::INTER_CUBIC, cv::BORDER_CONSTANT, 0.0, stream);

        delete originalCUDAimage;

        // do something interesting with the warpedCUDAimage
    }
});

What's the appropriate way to get this type of routine humming along, taking advantage of multiple CPUs and GPUs?

edit retag flag offensive close merge delete