Ask Your Question
0

Asynchronous performance lacking

asked 2018-07-13 10:19:22 -0600

gerardWalsh gravatar image

updated 2018-07-13 10:21:14 -0600

When using CUDA to try implement a pipe-lined execution model, timing an asynchronous function call results in the measured time being the same as when the synchronous function is called :

cv::cuda::Stream stream1;
cuda::GpuMat img1, descR, KeyP;
img1.upload(im_gray);
t2 = clock();
gpuOrb->detectAndComputeAsync(img1, cv::noArray(), KeyP, descR, false, stream1);
r3 = (clock()- t2)/(double)CLOCKS_PER_SEC;
std::cout << "Asynchronous call : " << r3 << std::endl;
stream1.waitForCompletion();

Could anyone elaborate why this is the case? In theory the time measured for the function call should insignificant compared to the synchronous call, as the device is synchronized after the function call was measured

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted
0

answered 2018-07-15 04:30:15 -0600

updated 2018-07-16 10:52:44 -0600

Hi,

Does the timing change if you include

cudaDeviceSynchronize();

after

img1.upload(im_gray);

and after

gpuOrb->detectAndComputeAsync(img1, cv::noArray(), KeyP, descR, false, stream1);

for either the version with or without streams?

edit flag offensive delete link more

Comments

I just notice your call to

stream1.waitForCompletion();

is after you have finished timing and not before. This would definitely explain the strange results.

cudawarped gravatar imagecudawarped ( 2018-07-16 10:55:06 -0600 )edit

Surely this is the point? We are trying to measure how long the function call takes (initiate the kernel on the device and pass the data to the device), and as such it should be very small ?

gerardWalsh gravatar imagegerardWalsh ( 2018-07-16 15:17:35 -0600 )edit

Of course you are correct, for this experiment you do not want the synchronization call.

Which OS are you using, what is the resolution of the clock() timer and what are the times you are getting in both cases, with and without streams?

I had a quick look at the source for the CUDA kernels which are called (line 159, 409 and 440) and unless I am mistaken the function you are calling should utilize streams correctly unless for some reason the stream object you created is null.

cudawarped gravatar imagecudawarped ( 2018-07-17 03:45:31 -0600 )edit

I am using Ubuntu 16.04, and the resolution of clock is in ms.

gerardWalsh gravatar imagegerardWalsh ( 2018-08-08 09:14:32 -0600 )edit

Would there be a way to keep the kernel's code in the GPU memory and only pass the new information on each iteration of a loop?

gerardWalsh gravatar imagegerardWalsh ( 2018-08-23 08:16:18 -0600 )edit

Question Tools

1 follower

Stats

Asked: 2018-07-13 10:19:22 -0600

Seen: 383 times

Last updated: Jul 16 '18