Canny doesn´t run faster when using OpenMP
I want to test the benefits when building OpenCV with -D WITH_OPENMP=ON I have following test code:
clock_gettime(CLOCK_REALTIME, &requestStart);
for(int i = 0; i <= 100 ; i++){
Canny(Szene, temp,20,40,3);
}
clock_gettime(CLOCK_REALTIME, &requestEnd);
accum = ( requestEnd.tv_sec - requestStart.tv_sec ) + ( requestEnd.tv_nsec - requestStart.tv_nsec )/ BILLION;
cout << " ________________________________________________________________" << endl;
cout << " Canny-Filter: "<<accum/100 << " sec." << endl;
cout << " ________________________________________________________________" << endl;
When I´m building the OpenCV library without the flag the canny-filter needs 0.0121046 sec. and with the flags it is 0.0122894 sec. So there is no benefit when using OpenMP. (tested also with median and gaussian) Am i doing something wrong enabeling OpenMP? Do I just need to set the flag while building or do i need to get something else?
Test-System: Raspberry PI 2, 4 Cores
check the cmake output (or cv::getBuildInformation()). there should be a
Parallel framework:
entrycv::getBuildInformation() gives me: Parallel framework: OpenMP
What OpenCV version do you use?
In OpenCV 2.4 Canny isn't parallelized at all. Even in OpenCV 3.0 and 3.1 Canny is only parallelized with TBB and not with OpenCV's parallel framework (including OpenMP).
Another thing is, if you use
cv::UMat
, OpenCL is prefered before CPU implementation.Next I don't know if IPP works with ARM processors. If so, and your OpenCV version is 3.2, IPP will always be prefered when using
apertureSize=3
andL2gradient=false
. In this case you can setL2gradient=true
to get multithreading support with OpenMP.^^ imho, that would would have been a perfect answer !
Thanks for the answer matman | I´m using OpenCV 3.2. | I don´t use OpenCLs UMat | i set L2gradient=true but there is no improvement regards to speed between OpenMP version and the "normal" build | I tried to do the same task with TBB on: I build the library with -DWITH_TBB=ON and -DBUILD_TBB=ON but the timinganalysis is the same (no difference when using gauss, canny, median), after that I build the library for the Intel i7 with the TBB support but on this system these filters also get no benefit
cv::medianBlur
isn't parallelizied at all. In OpenCV 3.3cv::GaussianBlur
is only parallelized with IPP (but disabled?). I don't know how it is done in 3.2. It should be possible to split the image into stripes and process each stripe in a thread independently.For
cv::Canny
your results are odd. There should be at least a performance difference when setting different number of threads. Can you make a debug-build and debug into the functions to check if multithreading works correct? And please check whatcv::getNumberOfThreads()
returns. Try to set the number of threads to the number of your physical CPU cores (not logical).When you use Windows and you don't set a multithreading library I think the Concurrency framework is used. Please check this in your "normal build", too.
cv::getNumberOfThreads() returns 4 (i´m running a Linux VM with 4 Cores) | when debuging the programm steps into (with TBB on): parallel_for_(Range(0, src.rows), parallelCanny(src, map, low, high, aperture_size, L2gradient, &borderPeaksParallel), numOfThreads); Do you get a time change for the Cannyruntime when you are building you library with TBB support? | if I run my normal build pthread is active, when i run canny with the "normal" build it needs 0.00478609 sec but when set it down to 1 canny needs 0.00427457, so its faster when not parallelized?