parallelise CUDA with multiple GPUs

asked 2018-08-05 07:23:17 -0500

Rado gravatar image

I have a routine that is perfect for parallelisation. I use cv::parallel_for_ just fine for the cv::Mat version of the code.

It also works fine for cv::cuda::GpuMat (with associated cv::cuda:: routines replacing standard cv:: versions). But what I would really like to do is to take advantage of multiple GPUs. I've tried adding a cv::cuda::setDevice at the beginning of each loop, and this seems to function ok. But, it looks like this is causing the entire code to grind to a halt, as if it's being throttle by a single CPU.

Here's a snippet:

cv::Mat originalMatImage; // already loaded
const int nDevices = cv::cuda::getCudaEnabledDeviceCount();
const int nRotations; // a number, much larger than the number of CPU cores

cv::parallel_for_(cv::Range(0, nRotations), [&](const cv::Range& range){
    for (auto i=range.start; i<range.end; i++) {

        cv::cuda::setDevice(i % nDevices);
        cv::cuda::Stream stream;
        cv::cuda::GpuMat* originalCUDAimage = new cv::cuda::GpuMat();
        originalCUDAimage->upload(originalMatImage, stream);

        const double rotationAngle = double(180 * i) / double(nRotations);
        cv::cuda::GpuMat warpedCUDAimage(nRotations, nRotations, originalCUDAimage->type(), 0.0);
        cv::Mat rotationMatrix(cv::getRotationMatrix2D(cv::Point2d(nRotations / 2.0, nRotations / 2.0), rotationAngle, 1.0));

        cv::cuda::warpAffine(*originalCUDAimage, warpedCUDAimage, rotationMatrix, cv::Size2d(nRotations, nRotations), cv::INTER_CUBIC, cv::BORDER_CONSTANT, 0.0, stream);

        delete originalCUDAimage;

        // do something interesting with the warpedCUDAimage

What's the appropriate way to get this type of routine humming along, taking advantage of multiple CPUs and GPUs?

edit retag flag offensive close merge delete