Best practice for CUDA streams --> How to get OpenCV GPU module to work asynchronously??

asked 2015-03-31 10:05:49 -0500

Wolf gravatar image

updated 2015-04-02 10:18:44 -0500

I am using CUDA via opencv GPU module. It works good but I am not sure if there isn't some performance improvement possible in my application.

I use multiple threads and multiple cuda streams, i. e. gpu::Streams. All calls (upload, download and processing) are done asynchronously on the streams. If one thread has finshied calling the operations on it's image it sleeps and waits for the stream synchronization. Afterwards the processing of the image is done.

However, recently I noted that cudaMalloc and cudaFree are synchronous methods, i. e. they will wait for all streams (of all threads?) to synchronize before the action is done. I my case I create an empty GpuMat and stream when the processing of one image is started and then start the uploading (enqueueUpload) and processing on the stream. When the processing of the image is done the GpuMat goes out of scope, i. e. device mem is released. So here cudaMalloc and cudaFree are called. I guess this will cause my entire program to have mostly sychronous behaviour??

What would be the best practice here for processing images in an asynchronous pipeline? Allocate a number of GpuMat images at the startup and then only use those images for copying? Would be okay, but does not seem so nice because then again one would have to manage which Images are in use and which not, a charge I liked to get rid of thanks to the ref counting of the cv:Mat and cv::GpuMat.. Are there better methods?

Edit:

Getting OpenCV GPU module to work !!really!! asynchronously appears to be an issue even beyond what I already mentioned. It might be nice if API had some more clarity on that. (As the CUDA API itself should have!)

Example:

When calling into cv::gpu::warpPerspective giving it some properly preallocated cv::gpu::GpuMat matrices and a proper cv::gpu::Stream I think one would assume it having asynchronous behaviour. However, internally it calls cudaMemcpyToSymbol (and there are more functions that do that) which is indeed a synchronous call. So even if it is called with preallocated matrices and proper stream it will result in an at least partly synchronous call (Moreover, it also causes all other currently active streams on the GPU to synchronize).

Are there any ideas/plans how to deal with that?

edit retag flag offensive close merge delete

Comments

+1 would like to know.

abhiguru gravatar imageabhiguru ( 2015-06-12 05:20:56 -0500 )edit

@Wolf: Could you please share the solution of the above problem. I am also working on a similar problem where I have multi-threading. One thread captures the image and other 2 threads process the image using their own streams. But the computational time has not reduced. Please share the approach you followed to solve the above situation.

user1234 gravatar imageuser1234 ( 2015-07-14 07:52:13 -0500 )edit