Asynchronous performance lacking
When using CUDA to try implement a pipe-lined execution model, timing an asynchronous function call results in the measured time being the same as when the synchronous function is called :
cv::cuda::Stream stream1;
cuda::GpuMat img1, descR, KeyP;
img1.upload(im_gray);
t2 = clock();
gpuOrb->detectAndComputeAsync(img1, cv::noArray(), KeyP, descR, false, stream1);
r3 = (clock()- t2)/(double)CLOCKS_PER_SEC;
std::cout << "Asynchronous call : " << r3 << std::endl;
stream1.waitForCompletion();
Could anyone elaborate why this is the case? In theory the time measured for the function call should insignificant compared to the synchronous call, as the device is synchronized after the function call was measured