# OpenCV optimisation - multiplying a set of matrices by a set of scalars and summing.

I have an optimisation problem and I'm wondering what the best way to approach the problem is.

At present I have a set of matrices (mats) that I need to scale by a set of values held in a vector and then summed together. I have written the following code nut it seems to be pretty painfully slow (far more so than I would have thought).

cv::Mat sum = cv::Mat::zeros( mats[0].rows, mats[0].cols, cvType );
for( int m = 0; m < mats.size(); m++ )
{
const Type val = rowVec.at< Type >( m );
sum += val * mats[m];
}


Can anyone suggest a faster way of doing the above loop?

Edit:

I wrote a little function to try and aid my performance:

template< typename Type >
void ScaleMatAndSum( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
for( int r = 0; r < mats.rows; r++ )
{
for( int c = 0; c < mats.cols; c++ )
{
scale.at< Type >( r, c )    += mats.at< Type >( r, c ) * val;
}
}
}


This is about 8 times faster when multi-threaded than the original code. Can anyone explain what is going on?

edit retag close merge delete

Sort by ยป oldest newest most voted

Try using the scaleAdd function. It has SMID optimizations built in.

cv::Mat sum = cv::Mat::zeros( mats[0].rows, mats[0].cols, cvType );
for( int m = 0; m < mats.size(); m++ )
{
const Type val = rowVec.at< Type >( m );
cv::scaleAdd(mats[m], val, sum, sum);
}


Ok, ran the benchmarks.

Original method: 6.17536 s

That's a 65% speedup, whereas the best of the other answer was a 45% speedup.

more

The following are my guesses as a non expert.

I think that in the first case, the product val * mats[m] will produce a temporary mat variable with the appropriate size whereas mats.at< Type >( r, c ) * val needs only a temporary Type variable.

What library did you use for multithreading your second code? Is it 8 times faster when you compare the second code multi-threaded vs your first code multi-threaded or vs your first code single-threaded? Also, what is the size of your matrix as some optimization methods are effective only on large data?

Here some comparisons / attempts to improve the performance:

The results are on my computer (50 iterations, 100 matrices of size: 1000x800):

Original method: sum1=[5.0812e+009, 0, 0, 0] ; t1=8.87578 s
Matrix method: sum2=[5.0812e+009, 0, 0, 0] ; t2=17.1907 s
Pointer access: sum3=[5.0812e+009, 0, 0, 0] ; t3=5.40258 s
Pointer access + unroll loop: sum4=[5.0812e+009, 0, 0, 0] ; t4=5.24404 s
Pointer access + unroll loop + parallel_for_: sum5=[5.0812e+009, 0, 0, 0] ; t5=4.79474 s


The best improvment seems to be achieved when switching from matrix multiplication with a scalar to iterating and perform directly the multiplication on matrix elements (17.1907 s vs 8.87578 s).

Switching to pointer access gives also a resonable improvment (8.87578 s vs 5.40258 s).

The code I used for benchmarking:

#include <opencv2/opencv.hpp>

//Original method
template< typename Type >
void ScaleMatAndSum( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
for( int r = 0; r < mats.rows; r++ )
{
for( int c = 0; c < mats.cols; c++ )
{
scale.at< Type >( r, c ) += mats.at< Type >( r, c ) * val;
}
}
}

//Pointer access
template< typename Type >
void ScaleMatAndSum_ptr( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
Type *ptr_scale_rows;
const Type *ptr_mat_rows;
for( int r = 0; r < mats.rows; r++ )
{
ptr_scale_rows = scale.ptr<Type>(r);
ptr_mat_rows = mats.ptr<Type>(r);

for( int c = 0; c < mats.cols; c++ )
{
ptr_scale_rows[c] += ptr_mat_rows[c] * val;
}
}
}

//Pointer access + unroll loop
template< typename Type >
void ScaleMatAndSum_ptr_unroll_loop( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
Type *ptr_scale_rows;
const Type *ptr_mat_rows;
for( int r = 0; r < mats.rows; r++ )
{
ptr_scale_rows = scale.ptr<Type>(r);
ptr_mat_rows = mats.ptr<Type>(r);

for( int c = 0; c < mats.cols; c += 4 )
{
ptr_scale_rows[c] += ptr_mat_rows[c] * val;
ptr_scale_rows[c+1] += ptr_mat_rows[c+1] * val;
ptr_scale_rows[c+2] += ptr_mat_rows[c+2] * val;
ptr_scale_rows[c+3] += ptr_mat_rows[c+3] * val;
}
}
}

//Pointer access + unroll loop + ParallelLoopBody
template <class Type>
class Parallel_ScaleAndSum: public cv::ParallelLoopBody
{
private:
cv::Mat m_mat;
Type m_mul;
cv::Mat *m_result;

public:
Parallel_ScaleAndSum(cv::Mat *result, const cv::Mat &mat, const Type &mul)
: m_mat ...
more

Official site

GitHub

Wiki

Documentation

## Stats

Asked: 2016-04-01 05:50:16 -0500

Seen: 582 times

Last updated: Apr 01 '16