I am using CUDA via opencv GPU module. It works good but I am not sure if there isn't some performance improvement possible in my application.
I use multiple threads and multiple cuda streams, i. e. gpu::Streams. All calls (upload, download and processing) are done asynchronously on the streams. If one thread has finshied calling the operations on it's image it sleeps and waits for the stream synchronization. Afterwards the processing of the image is done.
However, recently I noted that cudaMalloc and cudaFree are synchronous methods, i. e. they will wait for all streams (of all threads?) to synchronize before the action is done. I my case I create an empty GpuMat and stream when the processing of one image is started and then start the uploading (enqueueUpload) and processing on the stream. When the processing of the image is done the GpuMat goes out of scope, i. e. device mem is released. So here cudaMalloc and cudaFree are called. I guess this will cause my entire program to have mostly sychronous behaviour??
What would be the best practice here for processing images in an asynchronous pipeline? Allocate a number of GpuMat images at the startup and then only use those images for copying? Would be okay, but does not seem so nice because then again one would have to manage which Images are in use and which not, a charge I liked to get rid of thanks to the ref counting of the cv:Mat and cv::GpuMat.. Are there better methods?