# Best way to apply a function to each element of Mat.

Hi,

I need to apply a function to each element of a matrix (in a Mat object). For example, I need to calculate the hyperbolic tangent (tanh) of each value in the Mat.

I know that I can access each element of a Mat M by M.at<T>(i,j) as such I can implement the algorithm (just pseudocode) as follows:

for( i = ... )//through each row in Mat M
for(j = ...)//through each column
M.at<T>(i,j) = tanh( M.at<T>(i,j) );


This works. But it is rather slow as I'm working on a real time system that needs to do the same operation over and over.

Is there is perhaps someway to collectively apply the same function (perhaps by passing a function pointer) to each element in a Mat at the same time? That would be quite helpful.

edit retag close merge delete

1

For a quick solution, see http://answers.opencv.org/question/830/hyperbolic-tangent-on-image/ . I agree that providing a template-based opencv foreach function is a good long-term solution.

( 2013-10-08 10:43:23 -0600 )edit

Agree too, but other langauge do not support template, if openCV do provide a template-based foreach function, this has a high change to become a c++ only api

( 2013-10-08 20:34:23 -0600 )edit

Sort by ยป oldest newest most voted

As suggested by Guanta below and http://answers.opencv.org/question/3730/how-to-use-parallel_for/

I compare 3 ways which are normal, TBB, and opencv_parallel.

( with 1920x1080 to 19200x10800 , uchar, single channel Mat under 2.3GHz Core i5 MBP )

The built in OpenCV ParallelLoopBody win !!

Here is the code,

#include <iostream>
#include <cmath>
#include <tbb/tbb.h>                                    // for tbb
#include <opencv2/highgui/highgui.hpp>    // ParallelLoopBody is included (core.hpp)

using namespace std ;
using namespace tbb ;
using namespace cv ;

// this class is for tbb, delete it if you don't needed it
class parallel_pixel
{
private:
uchar *p ;
public:
parallel_pixel(uchar *ptr ) : p(ptr) { }

void operator() ( const blocked_range<int>& r ) const
{
for ( int i = r.begin(); i != r.end(); i++ ) {
p[i] = (uchar)cos( p[i] )  ;    // I just use cos()
}
}
} ;

// this class is for OpenCV ParallelLoopBody
class Parallel_pixel_opencv : public ParallelLoopBody
{
private:
uchar *p ;
public:
Parallel_pixel_opencv(uchar* ptr ) : p(ptr) {}

virtual void operator()( const Range &r ) const
{
for ( register int i = r.start; i != r.end; ++i)
{
p[i] = (uchar)cos( p[i] )  ;
}
}
};

int main()
{
int width = 1920 *3;
int height = 1080 *3;

// If too small nElements the tbb will take longer time, since tbb need to be started and copy
int nElements = width*height ;     // only for single channel

Mat src( Size(width,height) , CV_8UC1 ) ;       // for one_by_one run
Mat old ;                                       // clone for tbb
Mat old2 ;                                     // clone for ParallelLoopBody

// just put some initial value
int v = 0 ;
for( int w = 0 ; w < src.rows ; ++w )
{
for( int h = 0 ; h < src.cols ; ++h )
{
src.at<uchar>(w,h) = saturate_cast<uchar>(v) ;
v++ ;
}
}
// initial end
old = src.clone() ;    // save a copy
old2 = src.clone() ;

// --------- normal way -----------
uchar* p1 = src.data ;    // p1 for normal way

// normal way : one_by_one iteration
// timing start
for( int i = 0 ; i < nElements ; ++i )
{
p1[i] = (uchar)cos( p1[i] ) ;
}
// timing stop

// --------- TBB way -----------
task_scheduler_init init ;    // start tbb
uchar* p2 = old.data ;    // p2 for tbb way

// timing tbb start,
// parameter = 800 is testing on my computer has best performance

parallel_for(blocked_range<int>(0, nElements, 800), parallel_pixel(p2) ) ;

// timing tbb stop

// --------- opencv way ----------
uchar* p3 = old2.data ;

// timing ParallelLoopBody start

parallel_for_( Range(0,nElements) , Parallel_pixel_opencv(p3)) ;

// timing ParallelLoopBody stop

// checking if normal way has the same result as tbb way
for( int i = 0 ; i < nElements ; ++i ) {
if( p1[i] != p2[i] ) {
cout << i << " tbb answer not match" <<  endl;
}
if( p1[i] != p3[i] )  {
cout << i << " opencv answer not match" <<  endl;
}
}

return 0;
}


The result is :

normal time: 754.778 ms

TBB time: 223.938 ms

opencv time: 200.656 ms

normal/tbb = 3.37048 (sorry, in last post I report 2.7 because my cpu is doing something else)

normal/opencv = 3.76155

more

Always hope that openCV could provide us something like parallel_for, c++14 do have proposal about parallel programming, hope that c++14 could give us good news.

( 2013-10-08 23:51:37 -0600 )edit
1

There exist parallel_for_ in opencv. It supports tbb, openmp, etc. The syntax is very similar to the one of TBB (but you don't need to initialize it).

( 2013-10-09 05:45:08 -0600 )edit

@Guanta : This is cool, looks like it called "ParallelLoopBody"?, I can't find it in the index pages, do they forget to add it?But this one need to inherit, this is a little bit clumsy.TBB could use lambda to do trivial job, I like the api like TBB more.

( 2013-10-09 16:00:50 -0600 )edit
2

Yes, it is not well documented and some functionality of e.g. TBB is not available, however you can use it if you want to be independent of the backend you use. List of supported backends: http://answers.opencv.org/question/9095/parallel-computing-in-opencv-244/ . Usage: http://answers.opencv.org/question/3730/how-to-use-parallel_for/

( 2013-10-10 02:58:27 -0600 )edit
2

This is a good answer, but note that buffer that holds image may not be continuous in memory. It is safer to apply parallel_for (or parallel_for_) on each row of image separately, or at least check whether old2.isContinuous() is true.

( 2013-10-10 06:56:04 -0600 )edit

Thanks @Guanta for the suggestion. I wasn't aware of parallel_for_ . Also thanks @maythe4thbewithu for the testing :) made my life easier.

( 2013-10-10 17:52:59 -0600 )edit

@maythe4thbewithu already gave a pretty good answer to the question, but there still few missing points that I would like to address.

1) The image is not necessarily continuous in memory (for example if ROI is used). Thus single 'for' loop that runs on width*height values may give wrong results. As a result two 'for' loops are needed, only one of which is parallel_for_. There many ways to implement them, and most of those implementations are pretty inefficient. So knowing how to do it properly is important.

2) Dereferencing pointer is faster than accessing it by index. Thus if you need really good performance you should use *p and ++p instead of p[i] and ++i.

3) And last but definitely not the least: use lookup tables. You have an image of unsigned char. That means 256 possible inputs to your function. Calculate them once, and store result in lookup table. Then you will be able to assign values to each pixel in your image from the lookup table without performing any calculations. It may give you speedup of more than order of magnitude if your function requires heavy computations (cos for example).

Getting back to example that was used by maythe4thbewithu. Faster code should look like this:

class Parallel_Cos: public ParallelLoopBody
{
public:
Parallel_Cos(Mat &imgg) : img(imgg)
{
for(int i=0; i<256; i++)
lookupTable[i] = (uchar)cos( (float) i );
}

void operator() (const Range &r) const
{
for(int j=r.start; j<r.end; ++j)
{
unsigned char* current = const_cast<unsigned char*>(img.ptr(j));
unsigned char* last = current + img.cols;
for (; current != last; ++current)
*current = lookupTable[*current];
}
}

private:
Mat img;
unsigned char lookupTable[256];
};

// notice that the range is height of the image
parallel_for_( Range(0,old2.rows) , Parallel_Cos(old2)) ;

more

I could offer you a generic for_each loop, you could design you own generic algorithms by your own needs.

/**
*@brief apply stl like for_each algorithm on a channel
*
* @param T : the type of the channel(ex, uchar, float, double and so on)
* @param func : Unary function that accepts an element in the range as argument
*
*@return :
*  return func
*/
template<typename T, typename UnaryFunc, typename Mat>
inline UnaryFunc for_each_channels(Mat &&input, UnaryFunc func)
{
int rows = input.rows;
int cols = input.cols;

if(input.isContinuous()){
cols = input.total() * input.channels();
rows = 1;
}

for(int row = 0; row != rows; ++row){
auto input_ptr = input.template ptr<T>(row);
for(int col = 0; col != cols; ++col){
func(input_ptr[col]);
}
}

return func;
}


you could do your transform like this

for_each_channels<T>(input, [](T &data){ data = std::tanh(data); })


with the helps of c++11 lambda(c++14 will become even better), stl like algorithms become much more easier to use.

There are still rooms to speed up the algorithms, like vectorize?(I don't know how to do it yet) Multithread, although I can develop it with std::thread, but don't know how to make a thread pool, without a decent thread pool the parallel for_each algorithm is not ready yet.Although other language do not support template, I think this kind of generic algorithm could speed up the development speed of the openCV source and make the life of the c++ programmers easier.

I use && to make sure it could accept cv::Mat& and cv::Mat const&. This link explain the principle of the algorithm with details(only principle) generic algorithm

more

Thanks for your answer. However, it seems parallel_for_ is the best. As suggested by Guanta and tested by maythe4thbewithu

( 2013-10-10 17:50:25 -0600 )edit

The fastest way to access all Mat elements is:

Size contSize = mat.size();
if (mat.isContinous())
{
contSize.width *= contSize.height;
contSize.height = 1;
}
for (int i = 0; i < contSize.height; ++i)
{
T* ptr = mat.ptr<T>(y);
for (int j = 0; j < contSize.width; ++j)
{
ptr[j] = ...;
}
}

more

Official site

GitHub

Wiki

Documentation