I have a routine that is perfect for parallelisation. I use cv::parallel_for_
just fine for the cv::Mat
version of the code.
It also works fine for cv::cuda::GpuMat
(with associated cv::cuda::
routines replacing standard cv::
versions). But what I would really like to do is to take advantage of multiple GPUs. I've tried adding a cv::cuda::setDevice
at the beginning of each loop, and this seems to function ok. But, it looks like this is causing the entire code to grind to a halt, as if it's being throttle by a single CPU.
Here's a snippet:
cv::Mat originalMatImage; // already loaded
const int nDevices = cv::cuda::getCudaEnabledDeviceCount();
const int nRotations; // a number, much larger than the number of CPU cores
cv::parallel_for_(cv::Range(0, nRotations), [&](const cv::Range& range){
for (auto i=range.start; i<range.end; i++) {
cv::cuda::setDevice(i % nDevices);
cv::cuda::Stream stream;
cv::cuda::GpuMat* originalCUDAimage = new cv::cuda::GpuMat();
originalCUDAimage->upload(originalMatImage, stream);
const double rotationAngle = double(180 * i) / double(nRotations);
cv::cuda::GpuMat warpedCUDAimage(nRotations, nRotations, originalCUDAimage->type(), 0.0);
cv::Mat rotationMatrix(cv::getRotationMatrix2D(cv::Point2d(nRotations / 2.0, nRotations / 2.0), rotationAngle, 1.0));
cv::cuda::warpAffine(*originalCUDAimage, warpedCUDAimage, rotationMatrix, cv::Size2d(nRotations, nRotations), cv::INTER_CUBIC, cv::BORDER_CONSTANT, 0.0, stream);
delete originalCUDAimage;
// do something interesting with the warpedCUDAimage
}
});
What's the appropriate way to get this type of routine humming along, taking advantage of multiple CPUs and GPUs?