Performance GPU implementation on NVIDIA Jetson TK1
Hello,
I recently have implemented a simple algorithm for GPU that works on the pixels of an image. It performs operations such as additions, subtractions, multiplications, divisions, max, split, merge etc. I am using NVIDIA Jetson TK1 for the experiments. I am comparing the performance of the same algorithm using GPU and CPU (of course I adapted the GPU code for CPU). The image is RGB of size 720x576.
I concluded that the algorithm on GPU is much much slower than that on CPU.
The majority of the time is used for the allocation of the memory. For example, this operation
cv::gpu::GpuMat img_gpu(host_img.size(), CV_8UC3);
img_gpu.upload(host_img);
takes on average 2 seconds. If I don't allocate the memory first and I do the upload, letting upload allocate the memory (I guess), i.e.
cv::gpu::GpuMat img_gpu;
img_gpu.upload(host_img);
it takes (roughly) the same time.
Also there are other operations that are faster on CPU, like split, converTo, Max. The code is pretty simple. It is just a bunch of operations, one after the other, done on the RGB channels independently.
My questions are. Am I doing something wrong? Am I missing something? Is the cache of the GPU too small to handle an image of 720x576?
Thanks in advance,
Durden