Revision history [back]

Keep in mind that when comparing CPU versus GPU implementations there are several bottlenecks to consider, which can lead to a faster processing on CPU than on GPU.

OpenCV Canny operator is one of those functions that has been heavily optimizes using TBB, SSE, AVX, ... this makes that the edge detector is a more than real time processing on any normal sized resolutions. This directly relates to the fact you have a CPU that has 8 logical cores to process on, which gives you a huge speed benefit at CPU level.
There is always a bottleneck of pushing data to and from the GPU, which you are timing now also. Since you are each pushing a single image to the GPU and processing it, then getting the data back, this is a big bottleneck. One of the best solutions for this is to create a batch set of data, push that to GPU memory and retrieve all results at once.
Finally at the beginning OpenCV does a GPU initialization, on the actual GPU function call. You should thus only time from iteration 2 until N, and average over that, to have a correct GPU timing.