No effect from using cuda::Stream?

CUDA
stream

asked 2017-03-05 23:51:25 -0600

Brandon212
186 ●2 ●6

updated 2017-03-05 23:56:38 -0600

I was just experimenting with using Streams with cuda functions to see what kind of performance impact they have, but it seems like very often the function takes just as long to launch with and without using the stream parameter. I was expecting the call with the stream to launch an asynchronous operation which would essentially take 0 time, but it seems to be the full time of the operation itself.

For example, with a GpuMat uploaded I tried

gpuImage.download(cpuImage);

and

cv::cuda::Stream myStream;
gpuImage.download(cpuImage, myStream);

and I've seen no time difference whatsoever.

Is there something I'm missing in how to use these?

edit retag flag offensive close merge delete

Comments

Same problem for me. Feels like Stream has no effect/not implemented. @edit - I found the problem. In order for Stream to have effect, you need to make sure all the destination GPUMat have memory allocated to them...

NadavB ( 2017-08-03 03:30:49 -0600 )edit

Could you elaborate on allocating memory to the GpuMat ? I have experienced the same issue when timing an asynchronous call and it taking as long as a synchronous call

gerardWalsh ( 2018-07-13 08:48:02 -0600 )edit

add a comment

1

answered 2020-02-05 04:02:50 -0600

pwuertz
31 ●1 ●3

updated 2020-02-05 04:13:51 -0600

I think the real issue is that CUDA needs pinned / pagelocked host-memory to do asynchronous transfers to the GPU. If your memory for cpuImage does not qualify as such, the transfer is performed synchronously.

Maybe like this: https://answers.opencv.org/question/1...

Edit:

Confirmed. By simply using pinned host memory the upload/download + stream methods work asynchronously as expected. I used numba for allocating a pinned numpy array:

data_cpu = numba.cuda.pinned_array(shape=(2*8192, 2*8192), dtype=np.float32)

edit flag offensive delete link

add a comment

1

answered 2020-02-05 04:16:42 -0600

cudawarped
869 ●3 ●10 http://jamesbowley.co.uk

updated 2020-02-05 12:48:13 -0600

In OpenCV steams are effective for asynchronous data transfer if as suggested by @pwuertz you pin the memory first. This can be achieved with either cv::cuda::HostMem or cv::cuda::registerPageLocked.

They are also effective for overlapping host and device computation (if the OpenCV function doesn't have its own fixed host/device sync points for intermediate calculations) and not just for Multithreaded computation.

See Accelerating OpenCV with CUDA streams in Python for an overview of how they can be used to optimize a single threaded toy problem.

edit flag offensive delete link

add a comment

0

answered 2017-03-06 18:12:43 -0600

Tetragramm

7376 ●13 ●37

One instance is not enough to do anything. Streams are for multi-threaded programs.

https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

edit flag offensive delete link

Comments

I get that part, but what I was really asking about is the fact that I would have assumed that the addition of the stream would make the function call non-blocking, so a time measurement around it would show essentially zero time for launch (analogous to launching a CUDA kernel normally, or even a standard c++ thread). If the function call waits for the operation to complete even with a stream included, then it wouldn't be possible to have two different streams doing two different things, because you have to be able to launch them both in a non-blocking fashion.

Brandon212 ( 2017-03-07 15:45:12 -0600 )edit

Hmm. Are you sure? Try doing one that doesn't involve memory transfers, and explicitly compare the same code with and without the stream.waitForCompletion(). Also, make sure to run the code once before the timing loop starts so the appropriate memory is allocated.

Tetragramm ( 2017-03-07 17:17:56 -0600 )edit

add a comment

No effect from using cuda::Stream?

Comments

3 answers

Comments

Links

Question Tools

Stats

Related questions

No effect from using cuda::Stream? edit

Comments

3 answers

Comments

Links

Question Tools

Stats

Related questions

No effect from using cuda::Stream?