Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

OpenCV parallel_for does not use multiple processors

I just saw in the new opencv 2.4.3 that they added a universal parallel_for. So following this example: http://answers.opencv.org/question/3730/how-to-use-parallel_for/

I tried to implement it myself. I got it all functioning with my code, but when I timed its processing vs a similar loop done in a typical serial fashion with a regular "for" command, the results were insignificantly faster, or often a tiny bit slower!

I thought maybe this had something to do with my pushing into vectors or something (im a pretty big noob to parallel processing), so i set up a test loop of just running through a big number and it still doesn't work

check it out:

code: class Parallel_Test : public cv::ParallelLoopBody { private: double* const mypointer;

public:
Parallel_Test(double* pointer)
: mypointer(pointer){

}
     void operator() (const Range& range) const
{
         //This constructor needs to be here otherwise it is considered an abstract class.
//             qDebug()<<"This should never be called";
}

    void operator ()(const cv::BlockedRange& range) const
    {

        for (int x = range.begin(); x < range.end(); ++x){

            mypointer[x]=x;

        }


    }



};


 //TODO Loop pixels in parallel
     double t = (double)getTickCount();

    //TEST PARALELL LOOPING AT ALL
    double data1[1000000];



        cv::parallel_for(BlockedRange(0, 1000000),  Parallel_Test(data1));

        t = ((double)getTickCount() - t)/getTickFrequency();
        qDebug() << "Parallel TEST time " << t << endl;


        t = (double)getTickCount();

        for(int i =0; i<1000000; i++){

            data1[i]=i;
        }
        t = ((double)getTickCount() - t)/getTickFrequency();
        qDebug() << "SERIAL Scan time " << t << endl;

Here's the output: output:

Parallel TEST time 0.00415479

SERIAL Scan time 0.00204597

that example was just a test case, my actual loop that i hope to parallelize takes about 1.5 seconds normally (i'm doing ICP registration over millions of 3D points) and the parallel_for does not improve that at all. What's even more telling is that only one processor is ever used at a time. Even if calling the threads was inefficient, it should at least be doing this with multiple cores. This leads me to believe that something is wrong.

click to hide/show revision 2
No.2 Revision

OpenCV parallel_for does not use multiple processors

I just saw in the new opencv 2.4.3 that they added a universal parallel_for. So following this example: http://answers.opencv.org/question/3730/how-to-use-parallel_for/

I tried to implement it myself. I got it all functioning with my code, but when I timed its processing vs a similar loop done in a typical serial fashion with a regular "for" command, the results were insignificantly faster, or often a tiny bit slower!

I thought maybe this had something to do with my pushing into vectors or something (im a pretty big noob to parallel processing), so i set up a test loop of just running through a big number and it still doesn't work

check it out:

code: code:

class Parallel_Test : public cv::ParallelLoopBody
 {
 private:
 double* const mypointer;

mypointer;



public:
Parallel_Test(double* pointer)
: mypointer(pointer){

}
     void operator() (const Range& range) const
{
         //This constructor needs to be here otherwise it is considered an abstract class.
//             qDebug()<<"This should never be called";
}

    void operator ()(const cv::BlockedRange& range) const
    {

        for (int x = range.begin(); x < range.end(); ++x){

            mypointer[x]=x;

        }


    }



};


 //TODO Loop pixels in parallel
     double t = (double)getTickCount();

    //TEST PARALELL LOOPING AT ALL
    double data1[1000000];



        cv::parallel_for(BlockedRange(0, 1000000),  Parallel_Test(data1));

        t = ((double)getTickCount() - t)/getTickFrequency();
        qDebug() << "Parallel TEST time " << t << endl;


        t = (double)getTickCount();

        for(int i =0; i<1000000; i++){

            data1[i]=i;
        }
        t = ((double)getTickCount() - t)/getTickFrequency();
        qDebug() << "SERIAL Scan time " << t << endl;

Here's the output: output:

Parallel TEST time 0.00415479

SERIAL Scan time 0.00204597

that example was just a test case, my actual loop that i hope to parallelize takes about 1.5 seconds normally (i'm doing ICP registration over millions of 3D points) and the parallel_for does not improve that at all. What's even more telling is that only one processor is ever used at a time. Even if calling the threads was inefficient, it should at least be doing this with multiple cores. This leads me to believe that something is wrong.