Ask Your Question

Canny doesn´t run faster when using OpenMP

asked 2017-09-13 05:45:51 -0500

TADRIAN gravatar image

updated 2017-09-13 06:33:42 -0500

I want to test the benefits when building OpenCV with -D WITH_OPENMP=ON I have following test code:

clock_gettime(CLOCK_REALTIME, &requestStart);
for(int i = 0; i <= 100 ; i++){
Canny(Szene, temp,20,40,3);
clock_gettime(CLOCK_REALTIME, &requestEnd);
accum = ( requestEnd.tv_sec - requestStart.tv_sec ) + ( requestEnd.tv_nsec - requestStart.tv_nsec )/ BILLION;
cout << "  ________________________________________________________________" << endl;
cout << "  Canny-Filter: "<<accum/100 << " sec." << endl;
cout << "  ________________________________________________________________" << endl;

When I´m building the OpenCV library without the flag the canny-filter needs 0.0121046 sec. and with the flags it is 0.0122894 sec. So there is no benefit when using OpenMP. (tested also with median and gaussian) Am i doing something wrong enabeling OpenMP? Do I just need to set the flag while building or do i need to get something else?

Test-System: Raspberry PI 2, 4 Cores

edit retag flag offensive close merge delete


check the cmake output (or cv::getBuildInformation()). there should be a Parallel framework: entry

berak gravatar imageberak ( 2017-09-13 06:00:32 -0500 )edit

cv::getBuildInformation() gives me: Parallel framework: OpenMP

TADRIAN gravatar imageTADRIAN ( 2017-09-13 06:33:26 -0500 )edit

What OpenCV version do you use?

In OpenCV 2.4 Canny isn't parallelized at all. Even in OpenCV 3.0 and 3.1 Canny is only parallelized with TBB and not with OpenCV's parallel framework (including OpenMP).

Another thing is, if you use cv::UMat, OpenCL is prefered before CPU implementation.

Next I don't know if IPP works with ARM processors. If so, and your OpenCV version is 3.2, IPP will always be prefered when using apertureSize=3 and L2gradient=false. In this case you can set L2gradient=true to get multithreading support with OpenMP.

matman gravatar imagematman ( 2017-09-13 11:10:11 -0500 )edit

^^ imho, that would would have been a perfect answer !

berak gravatar imageberak ( 2017-09-13 11:12:52 -0500 )edit

Thanks for the answer matman | I´m using OpenCV 3.2. | I don´t use OpenCLs UMat | i set L2gradient=true but there is no improvement regards to speed between OpenMP version and the "normal" build | I tried to do the same task with TBB on: I build the library with -DWITH_TBB=ON and -DBUILD_TBB=ON but the timinganalysis is the same (no difference when using gauss, canny, median), after that I build the library for the Intel i7 with the TBB support but on this system these filters also get no benefit

TADRIAN gravatar imageTADRIAN ( 2017-09-14 01:00:43 -0500 )edit

cv::medianBlur isn't parallelizied at all. In OpenCV 3.3 cv::GaussianBlur is only parallelized with IPP (but disabled?). I don't know how it is done in 3.2. It should be possible to split the image into stripes and process each stripe in a thread independently.

For cv::Canny your results are odd. There should be at least a performance difference when setting different number of threads. Can you make a debug-build and debug into the functions to check if multithreading works correct? And please check what cv::getNumberOfThreads() returns. Try to set the number of threads to the number of your physical CPU cores (not logical).

When you use Windows and you don't set a multithreading library I think the Concurrency framework is used. Please check this in your "normal build", too.

matman gravatar imagematman ( 2017-09-14 11:38:37 -0500 )edit

cv::getNumberOfThreads() returns 4 (i´m running a Linux VM with 4 Cores) | when debuging the programm steps into (with TBB on): parallel_for_(Range(0, src.rows), parallelCanny(src, map, low, high, aperture_size, L2gradient, &borderPeaksParallel), numOfThreads); Do you get a time change for the Cannyruntime when you are building you library with TBB support? | if I run my normal build pthread is active, when i run canny with the "normal" build it needs 0.00478609 sec but when set it down to 1 canny needs 0.00427457, so its faster when not parallelized?

TADRIAN gravatar imageTADRIAN ( 2017-09-15 04:05:55 -0500 )edit

1 answer

Sort by » oldest newest most voted

answered 2017-09-15 10:57:57 -0500

matman gravatar image

updated 2017-09-15 11:27:46 -0500

I made a quick test (no loops, but some iterations manually) for Canny and GaussianBlur inside my image processing library at work. My system is a i7 7700 with 4 cores and 8 threads on Windows with Visual Studio 2015. I used a random grayscale image with 4MP. The results are:


  • 1 thread 13.5ms
  • 2 threads 9ms
  • 4 threads 6.5ms
  • 8 threads 6.5ms

And for GaussianBlur:

  • 1 thread: 4ms
  • 2 threads: 3ms
  • 4 threads: 2ms
  • 8 threads: 1ms

For parallel GaussianBlur I used this implementation:

class ParallelGaussianBlurImpl_ : public ParallelLoopBody
    ParallelGaussianBlurImpl_(const Mat &_src, Mat &_dst, Size _kSize, double _sigmaX, double _sigmaY, int _borderType) :
        src(_src), dst(_dst), kSize(_kSize), sigmaX(_sigmaX), sigmaY(_sigmaY), borderType(_borderType)

    ParallelGaussianBlurImpl_& operator=(const ParallelGaussianBlurImpl_&) { return *this; } 

    inline void operator()(const Range &r) const {
        cv::GaussianBlur(src.rowRange(r.start, r.end), dst.rowRange(r.start, r.end), kSize, sigmaX, sigmaY, borderType);

    const Mat &src;
    Mat &dst;
    Size kSize;
    double sigmaX, sigmaY;
    int borderType;

void parallelGaussianBlur(InputArray _src, OutputArray _dst, Size kSize, double sigmaX, double sigmaY, int borderType) {
    const int numThreads = ocl::useOpenCL() ? 1 : max(1, min(getNumThreads(), getNumberOfCPUs())); 

    if(numThreads == 1 || borderType & BORDER_ISOLATED) {
        cv::GaussianBlur(_src, _dst, kSize, sigmaX, sigmaY, borderType);
    } else {
        _dst.create(_src.size(), _src.type());
        Mat src = _src.getMat(), dst = _dst.getMat();
        parallel_for_(Range(0, _src.rows()), ParallelGaussianBlurImpl_(src, dst, kSize, sigmaX, sigmaY, borderType), numThreads);

It seems that Canny does not scale that good with hyperthreading, but GaussianBlur do. It's possible, that some other processes falsify the test a bit, but the tendency is unambiguous. Did you make tests with setNumThreads() or just with different builds?

EDIT: whatever comes to my mind: You said that you are working in a VM. Check if you set the number of cores in your VM > 1. Otherwise I have no more ideas.

edit flag offensive delete link more


Set the Cores in the VM to 4, cv::getNumberOfCPUs() returns 4

I tried the implementation of your gaussianblur class for a picture size of 1620 x 1080, and TBB on ( Parallel framework: TBB (ver 4.4 interface 9003)) i got following results:

setNumThreads = 1: 0.0166213 sec.

setNumThreads = 2: 0.0123465 sec.

setNumThreads = 4: 0.0164041 sec.

setNumThreads = 8: 0.0133491 sec.

Does these results make sense? why its faster with 2 oder 8 treads than with 4? Thanks for your help

TADRIAN gravatar imageTADRIAN ( 2017-09-18 01:31:55 -0500 )edit

No that makes no sense in my opinion. Are the results of GaussianBlur the same? Stuck of performance could occur if something else is a bottleneck, for example memory. Have you tried to make the test in your host system?

matman gravatar imagematman ( 2017-09-18 12:55:15 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2017-09-13 05:45:51 -0500

Seen: 408 times

Last updated: Sep 15 '17