Is the question relating to OpenCV coming up?
Regarding your purely CUDA question.
- Have you set the environmental variable CUDA_LAUNCH_BLOCKING = 1? I ask because you are timing asynchronous kernel launches not kernel execution. If that flag is not specified you want to calculate the end time after the call to cudaDeviceSynchronize() so that you are timing the kernel execution not just the time for the kernel launch in the runtime api.
- Your timers have a resolution of 10-16ms and your results are between 2-30ms which is not going to work. You need to use high resolution timers or better still CUDA events.
- Before proceeding you can confirm the execution time of the kernels by simply launching your application with nvprof on the command line or the nvvp GUI (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vXX.x\libnvvp\nvvp.exe) and examining the output.
- It is possible that your kernel is not operating as expected, the formatting is off and I cannot confirm if it is correct.
Additionally you have not mentioned what GPU/CPU you are using. When the above are addressed you may still have poor performance if the GPU is slow.