Ask Your Question

Asynchronous performance lacking

asked 2018-07-13 10:19:22 -0500

gerardWalsh gravatar image

updated 2018-07-13 10:21:14 -0500

When using CUDA to try implement a pipe-lined execution model, timing an asynchronous function call results in the measured time being the same as when the synchronous function is called :

cv::cuda::Stream stream1;
cuda::GpuMat img1, descR, KeyP;
t2 = clock();
gpuOrb->detectAndComputeAsync(img1, cv::noArray(), KeyP, descR, false, stream1);
r3 = (clock()- t2)/(double)CLOCKS_PER_SEC;
std::cout << "Asynchronous call : " << r3 << std::endl;

Could anyone elaborate why this is the case? In theory the time measured for the function call should insignificant compared to the synchronous call, as the device is synchronized after the function call was measured

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted

answered 2018-07-15 04:30:15 -0500

updated 2018-07-16 10:52:44 -0500


Does the timing change if you include




and after

gpuOrb->detectAndComputeAsync(img1, cv::noArray(), KeyP, descR, false, stream1);

for either the version with or without streams?

edit flag offensive delete link more


I just notice your call to


is after you have finished timing and not before. This would definitely explain the strange results.

cudawarped gravatar imagecudawarped ( 2018-07-16 10:55:06 -0500 )edit

Surely this is the point? We are trying to measure how long the function call takes (initiate the kernel on the device and pass the data to the device), and as such it should be very small ?

gerardWalsh gravatar imagegerardWalsh ( 2018-07-16 15:17:35 -0500 )edit

Of course you are correct, for this experiment you do not want the synchronization call.

Which OS are you using, what is the resolution of the clock() timer and what are the times you are getting in both cases, with and without streams?

I had a quick look at the source for the CUDA kernels which are called (line 159, 409 and 440) and unless I am mistaken the function you are calling should utilize streams correctly unless for some reason the stream object you created is null.

cudawarped gravatar imagecudawarped ( 2018-07-17 03:45:31 -0500 )edit

I am using Ubuntu 16.04, and the resolution of clock is in ms.

gerardWalsh gravatar imagegerardWalsh ( 2018-08-08 09:14:32 -0500 )edit

Would there be a way to keep the kernel's code in the GPU memory and only pass the new information on each iteration of a loop?

gerardWalsh gravatar imagegerardWalsh ( 2018-08-23 08:16:18 -0500 )edit

Question Tools

1 follower


Asked: 2018-07-13 10:19:22 -0500

Seen: 319 times

Last updated: Jul 16 '18