Poor OpenCL performance

Hi, I am trying to perform the detectMultiScale function on GPU using OpenCL module. It is supposed to run faster but it is not. In fact it is even 3-4 times slower than the CPU implementation. I have tested it on both Intel HD Graphics 4000 and NVidia GT650M, and I got the same result. I want to know if anyone ran into the same problem, and if there is a solution.

It would be nice to know with which CPU you compare those GPUs. The GPUs you mention arent really powerful ones.

Hello, I have run detectMultiScale with CUDA on a Dell T7600 with 2 CPU (each has 4 cores, 1.8 Ghz) and Quadro 4000 as well as TBB CPU version of that function but the results I got is not the same as yours: + TBB (4.2 update 2) CPU version only utilized about 60% of CPU resources and can work with a 14 fps rate. + CUDA version (5.5) just used 1 CPU core and reached 24 fps rate. I used opencv 3.0.0-dev built with VS 2012 update 4 on Windows 7 32 bit, CUDA 5.5. I think you should build Opencv with CUDA on your own to have better results.

