Revision history [back]

@maythe4thbewithu already gave pretty good answer to the question, but there still few missing points that I would like to address.

1) The image is not necessarily continuous in memory (for example if ROI is used). Thus it should have two 'for' loops, only one of which is parallel_for_. There many ways to implement them, and most of those implementations are pretty inefficient. So knowing how to do it properly is important.

2) Dereferencing pointer is faster than accessing it by index. Thus if you need really good performance you should use *p and ++p instead of p[i] and ++i.

3) And last but not the least: use lookup tables. You have an image of unsigned char. That means 256 possible inputs to your function. Calculate them once, and put result to lookup table. Then you will be able to assign values to each pixel in your image from the lookup table without performing any calculation. It may give you speedup of more than order of magnitude, if your function is heavy.

Getting back to example that was used by maythe4thbewithu. Faster code should look like this:

class Parallel_Cos: public ParallelLoopBody
{   
public:
Parallel_Cos(Mat &imgg) : img(imgg)
{
    for(int i=0; i<256; i++)
        lookupTable[i] = (uchar)cos( (float) i );
}

void operator() (const Range &r) const
{
    for(int j=r.start; j<r.end; ++j)
    {
        unsigned char* current = const_cast<unsigned char*>(img.ptr(j));
        unsigned char* last = current + img.cols;
        for (; current != last; ++current)
            *current = lookupTable[*current];
    }
}

private:
    Mat img;
    unsigned char lookupTable[256];
};


parallel_for_( Range(0,old2.rows) , Parallel_Cos(old2)) ;

@maythe4thbewithu already gave a pretty good answer to the question, but there still few missing points that I would like to address.

1) The image is not necessarily continuous in memory (for example if ROI is used). Thus ~~it should have~~ single 'for' loop that runs on width*height values may give wrong results. As a result two 'for' ~~loops,~~ loops are needed, only one of which is parallel_for_. There many ways to implement them, and most of those implementations are pretty inefficient. So knowing how to do it properly is important.

2) Dereferencing pointer is faster than accessing it by index. Thus if you need really good performance you should use *p and ++p instead of p[i] and ++i.

3) And last but definitely not the least: use lookup tables. You have an image of unsigned char. That means 256 possible inputs to your function. Calculate them once, and ~~put~~ store result to in lookup table. Then you will be able to assign values to each pixel in your image from the lookup table without performing any ~~calculation.~~ calculations. It may give you speedup of more than order of ~~magnitude,~~ magnitude if your function ~~is heavy.~~requires heavy computations (cos for example).

Getting back to example that was used by maythe4thbewithu. Faster code should look like this:

class Parallel_Cos: public ParallelLoopBody
{   
public:
Parallel_Cos(Mat &imgg) : img(imgg)
{
    for(int i=0; i<256; i++)
        lookupTable[i] = (uchar)cos( (float) i );
}

void operator() (const Range &r) const
{
    for(int j=r.start; j<r.end; ++j)
    {
        unsigned char* current = const_cast<unsigned char*>(img.ptr(j));
        unsigned char* last = current + img.cols;
        for (; current != last; ++current)
            *current = lookupTable[*current];
    }
}

private:
    Mat img;
    unsigned char lookupTable[256];
};

 // notice that the range is height of the image
parallel_for_( Range(0,old2.rows) , Parallel_Cos(old2)) ;