Ask Your Question

No effect from using cuda::Stream?

asked 2017-03-05 23:51:25 -0500

Brandon212 gravatar image

updated 2017-03-05 23:56:38 -0500

I was just experimenting with using Streams with cuda functions to see what kind of performance impact they have, but it seems like very often the function takes just as long to launch with and without using the stream parameter. I was expecting the call with the stream to launch an asynchronous operation which would essentially take 0 time, but it seems to be the full time of the operation itself.

For example, with a GpuMat uploaded I tried;


cv::cuda::Stream myStream;, myStream);

and I've seen no time difference whatsoever.

Is there something I'm missing in how to use these?

edit retag flag offensive close merge delete


Same problem for me. Feels like Stream has no effect/not implemented. @edit - I found the problem. In order for Stream to have effect, you need to make sure all the destination GPUMat have memory allocated to them...

NadavB gravatar imageNadavB ( 2017-08-03 03:30:49 -0500 )edit

Could you elaborate on allocating memory to the GpuMat ? I have experienced the same issue when timing an asynchronous call and it taking as long as a synchronous call

gerardWalsh gravatar imagegerardWalsh ( 2018-07-13 08:48:02 -0500 )edit

3 answers

Sort by ยป oldest newest most voted

answered 2020-02-05 04:02:50 -0500

pwuertz gravatar image

updated 2020-02-05 04:13:51 -0500

I think the real issue is that CUDA needs pinned / pagelocked host-memory to do asynchronous transfers to the GPU. If your memory for cpuImage does not qualify as such, the transfer is performed synchronously.

Maybe like this:


Confirmed. By simply using pinned host memory the upload/download + stream methods work asynchronously as expected. I used numba for allocating a pinned numpy array:

data_cpu = numba.cuda.pinned_array(shape=(2*8192, 2*8192), dtype=np.float32)
edit flag offensive delete link more

answered 2020-02-05 04:16:42 -0500

updated 2020-02-05 12:48:13 -0500

In OpenCV steams are effective for asynchronous data transfer if as suggested by @pwuertz you pin the memory first. This can be achieved with either cv::cuda::HostMem or cv::cuda::registerPageLocked.

They are also effective for overlapping host and device computation (if the OpenCV function doesn't have its own fixed host/device sync points for intermediate calculations) and not just for Multithreaded computation.

See Accelerating OpenCV with CUDA streams in Python for an overview of how they can be used to optimize a single threaded toy problem.

edit flag offensive delete link more

answered 2017-03-06 18:12:43 -0500

Tetragramm gravatar image

One instance is not enough to do anything. Streams are for multi-threaded programs.

edit flag offensive delete link more


I get that part, but what I was really asking about is the fact that I would have assumed that the addition of the stream would make the function call non-blocking, so a time measurement around it would show essentially zero time for launch (analogous to launching a CUDA kernel normally, or even a standard c++ thread). If the function call waits for the operation to complete even with a stream included, then it wouldn't be possible to have two different streams doing two different things, because you have to be able to launch them both in a non-blocking fashion.

Brandon212 gravatar imageBrandon212 ( 2017-03-07 15:45:12 -0500 )edit

Hmm. Are you sure? Try doing one that doesn't involve memory transfers, and explicitly compare the same code with and without the stream.waitForCompletion(). Also, make sure to run the code once before the timing loop starts so the appropriate memory is allocated.

Tetragramm gravatar imageTetragramm ( 2017-03-07 17:17:56 -0500 )edit

Question Tools

1 follower


Asked: 2017-03-05 23:51:25 -0500

Seen: 1,386 times

Last updated: Feb 05 '20