I was just experimenting with using Streams with cuda functions to see what kind of performance impact they have, but it seems like very often the function takes just as long to launch with and without using the stream parameter. I was expecting the call with the stream to launch an asynchronous operation which would essentially take 0 time, but it seems to be the full time of the operation itself.
For example, with a gpu image uploaded I tried
gpuImage.download(cpuImage);
and
cv::cuda::Stream myStream;
gpuImage.download(cpuImage, myStream);
and I've seen no time difference whatsoever.
Is there something I'm missing in how to use these?