Ask Your Question

Best way to apply a function to each element of Mat.

asked 2013-10-08 09:28:35 -0600

Kholofelo gravatar image

updated 2013-10-08 09:30:12 -0600


I need to apply a function to each element of a matrix (in a Mat object). For example, I need to calculate the hyperbolic tangent (tanh) of each value in the Mat.

I know that I can access each element of a Mat M by<T>(i,j) as such I can implement the algorithm (just pseudocode) as follows:

for( i = ... )//through each row in Mat M
    for(j = ...)//through each column<T>(i,j) = tanh(<T>(i,j) );

This works. But it is rather slow as I'm working on a real time system that needs to do the same operation over and over.

Is there is perhaps someway to collectively apply the same function (perhaps by passing a function pointer) to each element in a Mat at the same time? That would be quite helpful.

Thanks in advance :)

edit retag flag offensive close merge delete



For a quick solution, see . I agree that providing a template-based opencv foreach function is a good long-term solution.

rwong gravatar imagerwong ( 2013-10-08 10:43:23 -0600 )edit

Agree too, but other langauge do not support template, if openCV do provide a template-based foreach function, this has a high change to become a c++ only api

stereomatching gravatar imagestereomatching ( 2013-10-08 20:34:23 -0600 )edit

4 answers

Sort by ยป oldest newest most voted

answered 2013-10-08 22:59:52 -0600

maythe4thbewithu gravatar image

updated 2013-10-10 04:44:46 -0600

As suggested by Guanta below and

I compare 3 ways which are normal, TBB, and opencv_parallel.

( with 1920x1080 to 19200x10800 , uchar, single channel Mat under 2.3GHz Core i5 MBP )

The built in OpenCV ParallelLoopBody win !!

Here is the code,

#include <iostream>
#include <cmath>
#include <tbb/tbb.h>                                    // for tbb
#include <opencv2/highgui/highgui.hpp>    // ParallelLoopBody is included (core.hpp)

using namespace std ;
using namespace tbb ;
using namespace cv ;

// this class is for tbb, delete it if you don't needed it 
class parallel_pixel
    uchar *p ;
    parallel_pixel(uchar *ptr ) : p(ptr) { }

    void operator() ( const blocked_range<int>& r ) const
        for ( int i = r.begin(); i != r.end(); i++ ) {
            p[i] = (uchar)cos( p[i] )  ;    // I just use cos()
} ;

// this class is for OpenCV ParallelLoopBody
class Parallel_pixel_opencv : public ParallelLoopBody
    uchar *p ;
    Parallel_pixel_opencv(uchar* ptr ) : p(ptr) {}

    virtual void operator()( const Range &r ) const
        for ( register int i = r.start; i != r.end; ++i)
            p[i] = (uchar)cos( p[i] )  ;

int main()
    int width = 1920 *3;
    int height = 1080 *3;

    // If too small nElements the tbb will take longer time, since tbb need to be started and copy
    int nElements = width*height ;     // only for single channel

    Mat src( Size(width,height) , CV_8UC1 ) ;       // for one_by_one run
    Mat old ;                                       // clone for tbb
    Mat old2 ;                                     // clone for ParallelLoopBody

    // just put some initial value
    int v = 0 ;
    for( int w = 0 ; w < src.rows ; ++w )
        for( int h = 0 ; h < src.cols ; ++h )
  <uchar>(w,h) = saturate_cast<uchar>(v) ;
            v++ ;
    // initial end
    old = src.clone() ;    // save a copy
    old2 = src.clone() ;

    // --------- normal way ----------- 
    uchar* p1 = ;    // p1 for normal way

    // normal way : one_by_one iteration
    // timing start
    for( int i = 0 ; i < nElements ; ++i )
        p1[i] = (uchar)cos( p1[i] ) ;
    // timing stop

    // --------- TBB way -----------
    task_scheduler_init init ;    // start tbb
    uchar* p2 = ;    // p2 for tbb way

    // timing tbb start, 
    // parameter = 800 is testing on my computer has best performance

    parallel_for(blocked_range<int>(0, nElements, 800), parallel_pixel(p2) ) ;

    // timing tbb stop

    // --------- opencv way ----------
    uchar* p3 = ;

    // timing ParallelLoopBody start

    parallel_for_( Range(0,nElements) , Parallel_pixel_opencv(p3)) ;

    // timing ParallelLoopBody stop

    // checking if normal way has the same result as tbb way
    for( int i = 0 ; i < nElements ; ++i ) {
        if( p1[i] != p2[i] ) {
            cout << i << " tbb answer not match" <<  endl;
        if( p1[i] != p3[i] )  {
            cout << i << " opencv answer not match" <<  endl;

    return 0;

The result is :

normal time: 754.778 ms

TBB time: 223.938 ms

opencv time: 200.656 ms

normal/tbb = 3.37048 (sorry, in last post I report 2.7 because my cpu is doing something else)

normal/opencv = 3.76155

edit flag offensive delete link more


Always hope that openCV could provide us something like parallel_for, c++14 do have proposal about parallel programming, hope that c++14 could give us good news.

stereomatching gravatar imagestereomatching ( 2013-10-08 23:51:37 -0600 )edit

There exist parallel_for_ in opencv. It supports tbb, openmp, etc. The syntax is very similar to the one of TBB (but you don't need to initialize it).

Guanta gravatar imageGuanta ( 2013-10-09 05:45:08 -0600 )edit

@Guanta : This is cool, looks like it called "ParallelLoopBody"?, I can't find it in the index pages, do they forget to add it?But this one need to inherit, this is a little bit clumsy.TBB could use lambda to do trivial job, I like the api like TBB more.

stereomatching gravatar imagestereomatching ( 2013-10-09 16:00:50 -0600 )edit

Yes, it is not well documented and some functionality of e.g. TBB is not available, however you can use it if you want to be independent of the backend you use. List of supported backends: . Usage:

Guanta gravatar imageGuanta ( 2013-10-10 02:58:27 -0600 )edit

This is a good answer, but note that buffer that holds image may not be continuous in memory. It is safer to apply parallel_for (or parallel_for_) on each row of image separately, or at least check whether old2.isContinuous() is true.

Michael Burdinov gravatar imageMichael Burdinov ( 2013-10-10 06:56:04 -0600 )edit

Thanks @Guanta for the suggestion. I wasn't aware of parallel_for_ . Also thanks @maythe4thbewithu for the testing :) made my life easier.

Kholofelo gravatar imageKholofelo ( 2013-10-10 17:52:59 -0600 )edit

answered 2013-10-15 09:09:34 -0600

Michael Burdinov gravatar image

updated 2013-10-17 01:42:12 -0600

@maythe4thbewithu already gave a pretty good answer to the question, but there still few missing points that I would like to address.

1) The image is not necessarily continuous in memory (for example if ROI is used). Thus single 'for' loop that runs on width*height values may give wrong results. As a result two 'for' loops are needed, only one of which is parallel_for_. There many ways to implement them, and most of those implementations are pretty inefficient. So knowing how to do it properly is important.

2) Dereferencing pointer is faster than accessing it by index. Thus if you need really good performance you should use *p and ++p instead of p[i] and ++i.

3) And last but definitely not the least: use lookup tables. You have an image of unsigned char. That means 256 possible inputs to your function. Calculate them once, and store result in lookup table. Then you will be able to assign values to each pixel in your image from the lookup table without performing any calculations. It may give you speedup of more than order of magnitude if your function requires heavy computations (cos for example).

Getting back to example that was used by maythe4thbewithu. Faster code should look like this:

class Parallel_Cos: public ParallelLoopBody
Parallel_Cos(Mat &imgg) : img(imgg)
    for(int i=0; i<256; i++)
        lookupTable[i] = (uchar)cos( (float) i );

void operator() (const Range &r) const
    for(int j=r.start; j<r.end; ++j)
        unsigned char* current = const_cast<unsigned char*>(img.ptr(j));
        unsigned char* last = current + img.cols;
        for (; current != last; ++current)
            *current = lookupTable[*current];

    Mat img;
    unsigned char lookupTable[256];

// notice that the range is height of the image
parallel_for_( Range(0,old2.rows) , Parallel_Cos(old2)) ;
edit flag offensive delete link more

answered 2013-10-08 11:58:22 -0600

Vladislav Vinogradov gravatar image

The fastest way to access all Mat elements is:

Size contSize = mat.size();
if (mat.isContinous())
    contSize.width *= contSize.height;
    contSize.height = 1;
for (int i = 0; i < contSize.height; ++i)
    T* ptr = mat.ptr<T>(y);
    for (int j = 0; j < contSize.width; ++j)
        ptr[j] = ...;
edit flag offensive delete link more

answered 2013-10-08 20:45:22 -0600

stereomatching gravatar image

updated 2013-10-08 20:51:24 -0600

I could offer you a generic for_each loop, you could design you own generic algorithms by your own needs.

 *@brief apply stl like for_each algorithm on a channel
 * @param T : the type of the channel(ex, uchar, float, double and so on)
 * @param func : Unary function that accepts an element in the range as argument
 *@return :
 *  return func
template<typename T, typename UnaryFunc, typename Mat>
inline UnaryFunc for_each_channels(Mat &&input, UnaryFunc func)
    int rows = input.rows;
    int cols = input.cols;

        cols = * input.channels();
        rows = 1;

    for(int row = 0; row != rows; ++row){
        auto input_ptr = input.template ptr<T>(row);
        for(int col = 0; col != cols; ++col){

    return func;

you could do your transform like this

for_each_channels<T>(input, [](T &data){ data = std::tanh(data); })

with the helps of c++11 lambda(c++14 will become even better), stl like algorithms become much more easier to use.

There are still rooms to speed up the algorithms, like vectorize?(I don't know how to do it yet) Multithread, although I can develop it with std::thread, but don't know how to make a thread pool, without a decent thread pool the parallel for_each algorithm is not ready yet.Although other language do not support template, I think this kind of generic algorithm could speed up the development speed of the openCV source and make the life of the c++ programmers easier.

I use && to make sure it could accept cv::Mat& and cv::Mat const&. This link explain the principle of the algorithm with details(only principle) generic algorithm

edit flag offensive delete link more


Thanks for your answer. However, it seems parallel_for_ is the best. As suggested by Guanta and tested by maythe4thbewithu

Kholofelo gravatar imageKholofelo ( 2013-10-10 17:50:25 -0600 )edit

Question Tools



Asked: 2013-10-08 09:28:35 -0600

Seen: 20,794 times

Last updated: Oct 17 '13