Asynchronous performance lacking

asked 2018-07-13 10:19:22 -0600

gerardWalsh
36 ●1 ●5

updated 2018-07-13 10:21:14 -0600

When using CUDA to try implement a pipe-lined execution model, timing an asynchronous function call results in the measured time being the same as when the synchronous function is called :

cv::cuda::Stream stream1;
cuda::GpuMat img1, descR, KeyP;
img1.upload(im_gray);
t2 = clock();
gpuOrb->detectAndComputeAsync(img1, cv::noArray(), KeyP, descR, false, stream1);
r3 = (clock()- t2)/(double)CLOCKS_PER_SEC;
std::cout << "Asynchronous call : " << r3 << std::endl;
stream1.waitForCompletion();

Could anyone elaborate why this is the case? In theory the time measured for the function call should insignificant compared to the synchronous call, as the device is synchronized after the function call was measured

answered 2018-07-15 04:30:15 -0600

cudawarped
869 ●3 ●10 http://jamesbowley.co.uk

updated 2018-07-16 10:52:44 -0600

Hi,

Does the timing change if you include

cudaDeviceSynchronize();

after

img1.upload(im_gray);

and after

gpuOrb->detectAndComputeAsync(img1, cv::noArray(), KeyP, descR, false, stream1);

for either the version with or without streams?

edit flag offensive delete link

Comments

I just notice your call to

stream1.waitForCompletion();

is after you have finished timing and not before. This would definitely explain the strange results.

cudawarped ( 2018-07-16 10:55:06 -0600 )edit

Surely this is the point? We are trying to measure how long the function call takes (initiate the kernel on the device and pass the data to the device), and as such it should be very small ?

gerardWalsh ( 2018-07-16 15:17:35 -0600 )edit

Of course you are correct, for this experiment you do not want the synchronization call.

Which OS are you using, what is the resolution of the clock() timer and what are the times you are getting in both cases, with and without streams?

I had a quick look at the source for the CUDA kernels which are called (line 159, 409 and 440) and unless I am mistaken the function you are calling should utilize streams correctly unless for some reason the stream object you created is null.

cudawarped ( 2018-07-17 03:45:31 -0600 )edit

I am using Ubuntu 16.04, and the resolution of clock is in ms.

gerardWalsh ( 2018-08-08 09:14:32 -0600 )edit

Would there be a way to keep the kernel's code in the GPU memory and only pass the new information on each iteration of a loop?

gerardWalsh ( 2018-08-23 08:16:18 -0600 )edit

add a comment

Asynchronous performance lacking

1 answer

Comments

Links

Question Tools

Stats

Related questions

Asynchronous performance lacking edit

1 answer

Comments

Links

Question Tools

Stats

Related questions

Asynchronous performance lacking