Ask Your Question

TBB affecting HoughCircles?

asked 2016-09-01 03:46:56 -0600

bradfield gravatar image

updated 2016-09-02 07:52:09 -0600

I recently installed OpenCV via cmake with the option WITH_TBB=ON on a raspberry pi 3.

The code I wanted to accalerate is basically a circle detection with HoughCircles. Unfortunately the CPU usage is the same as before having TBB enabled, that is somewhere below 30%.

Why is it that way? I supposed HoughCircles is highly parallelizable, according to this


taking a look at the source code mentioned by matman (see here), at line 1058 there are two for-loops which as far as I can see are just filling up the accumulator. How can they be parallelized with parallel_for_, and would that actually help speeding up the detection?

edit retag flag offensive close merge delete


I think the for loop on line 1058 is independend from rows, so you could put this one intro a class which inherits from ParallelLoopBody.

To give an example, look here starting from line 639. Calling the function is at line 820. I think this should be a relative simple implementation

matman gravatar imagematman ( 2016-09-02 14:03:31 -0600 )edit

mein deutsch ist nicht perfekt. aber vielleicht können wir uns über die implementierung auf andere weise unterhalten? ich tue mich in moment noch schwer damit zu verstehen, wie parallel_for_ überhaupt funktioniert. ich habe zwar einige beispiele gesehen, aber aufgrund meiner schwachen c++ kenntnisse liegt die betonung auf 'relativ' einfach.

bradfield gravatar imagebradfield ( 2016-09-02 17:29:20 -0600 )edit

can you tell me the resources which explain the basics behind that, which needs to be known for the implementation? I havnt yet looked through that parallel_for_ wrapper, although I have already seen multiple examples.

bradfield gravatar imagebradfield ( 2016-09-03 02:38:37 -0600 )edit

1 answer

Sort by » oldest newest most voted

answered 2016-09-01 11:44:22 -0600

matman gravatar image

updated 2016-10-09 05:19:47 -0600

Actually HoughCircles is not multithreaded as you can see in the sources (link). You are free to make a pull request.

For multi threading you should OpenCV's parallel_for_ wrapper to make it available for all multi thread libraries. As far as I can see parallelize HoughCircles should be possible with existing code, perhabs parts of the algorithm are also vectorizable using SIMD intrinsics.


I think there is no good documentation. I try to explain it in a plain example. I hope this is correct so far. Consider that writing to same memory location is problematic.

class parallelClass : public cv::ParallelLoopBody {
    // Constructor: here you can pass variables from outside
    // and those who shall be passed to outside
    parallelClass(const cv::Mat &_in, cv::Mat &_out) :    
    in(_in), out(_out)  // assign variables from outside to global internal variables
        // Do something which affects all threads and global variables
        out.create(in.size(), in.type());
        temp.create(in.size(), in.type());

    // For completeness
    ~parallelClass() {}
    parallelClass& parallelClass=(const parallelClass&) {return *this;}

    // This is an overloaded () operator which executes the calculation for every thread
    // Range will be splitted by parallel_for_ automaticly
    void operator()(const Range &boundaries) const
        // Do the loop which you will parallelize.
        // boundaries.start and boundaries.end will be set from parallel_for_ for every thread
        for(int i = boundaries.start; i < boundaries.end; ++i) {

            // Do loop stuff example (makes no sense)
            for(int j = 0; j < src.cols; ++j) {
      <char>(i, j) =<char>(i, j) - 1;
      <char>(i, j) =<char>(i, j) + 1;

    // Declare global variables
    const cv::Mat &in;
    cv::Mat &out, temp;

Executing the class is here:

cv::Mat src, dst;
src = cv::imread("Path/to/image.bmp", IMREAD_GRAYSCALE);
int numberOfThreads = 4;

// This is done by parallel_class_obj
dst.create(src.size(), src.type());
temp.create(src.size(), src.type());

for(int i = 0; i < src.rows; ++i) {
    for(int j = 0; j < src.cols; ++j) {<char>(i, j) =<char>(i, j) - 1;<char>(i, j) =<char>(i, j) + 1;

// Creates object and calls the constructor of parallelClass
parallelClass parallel_class_obj(src, dst);

// Executes the calculation and splits the calculation with numberOfThreads threads
cv::parallel_for_(Range(0, src.rows), parallel_class_obj, numberOfThreads);


I made a pull request for parallelizing HoughCircles here. Please test it and it would be great to get some feedback about performance and issues. The complete fork is here.

edit flag offensive delete link more

Question Tools

1 follower


Asked: 2016-09-01 03:46:56 -0600

Seen: 682 times

Last updated: Oct 09 '16