Keep in mind that when comparing CPU versus GPU implementations there are several bottlenecks to consider, which can lead to a faster processing on CPU than on GPU.
- OpenCV Canny operator is one of those functions that has been heavily optimizes using TBB, SSE, AVX, ... this makes that the edge detector is a more than real time processing on any normal sized resolutions. This directly relates to the fact you have a CPU that has 8 logical cores to process on, which gives you a huge speed benefit at CPU level.
- There is always a bottleneck of pushing data to and from the GPU, which you are timing now also. Since you are each pushing a single image to the GPU and processing it, then getting the data back, this is a big bottleneck. One of the best solutions for this is to create a batch set of data, push that to GPU memory and retrieve all results at once.
- Finally at the beginning OpenCV does a GPU initialization, on the actual GPU function call. You should thus only time from iteration 2 until N, and average over that, to have a correct GPU timing.