The following are my guesses as a non expert.

I think that in the first case, the product `val * mats[m]`

will produce a temporary mat variable with the appropriate size whereas `mats.at< Type >( r, c ) * val`

needs only a temporary Type variable.

What library did you use for multithreading your second code? Is it 8 times faster when you compare the second code multi-threaded vs your first code multi-threaded or vs your first code single-threaded? Also, what is the size of your matrix as some optimization methods are effective only on large data?

Here some comparisons / attempts to improve the performance:

The results are on my computer (50 iterations, 100 matrices of size: 1000x800):

```
Original method: sum1=[5.0812e+009, 0, 0, 0] ; t1=8.87578 s
Matrix method: sum2=[5.0812e+009, 0, 0, 0] ; t2=17.1907 s
Pointer access: sum3=[5.0812e+009, 0, 0, 0] ; t3=5.40258 s
Pointer access + unroll loop: sum4=[5.0812e+009, 0, 0, 0] ; t4=5.24404 s
Pointer access + unroll loop + parallel_for_: sum5=[5.0812e+009, 0, 0, 0] ; t5=4.79474 s
```

The best improvment seems to be achieved when switching from matrix multiplication with a scalar to iterating and perform directly the multiplication on matrix elements (17.1907 s vs 8.87578 s).

Switching to pointer access gives also a resonable improvment (8.87578 s vs 5.40258 s).

The code I used for benchmarking:

```
#include <opencv2/opencv.hpp>
//Original method
template< typename Type >
void ScaleMatAndSum( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
for( int r = 0; r < mats.rows; r++ )
{
for( int c = 0; c < mats.cols; c++ )
{
scale.at< Type >( r, c ) += mats.at< Type >( r, c ) * val;
}
}
}
//Pointer access
template< typename Type >
void ScaleMatAndSum_ptr( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
Type *ptr_scale_rows;
const Type *ptr_mat_rows;
for( int r = 0; r < mats.rows; r++ )
{
ptr_scale_rows = scale.ptr<Type>(r);
ptr_mat_rows = mats.ptr<Type>(r);
for( int c = 0; c < mats.cols; c++ )
{
ptr_scale_rows[c] += ptr_mat_rows[c] * val;
}
}
}
//Pointer access + unroll loop
template< typename Type >
void ScaleMatAndSum_ptr_unroll_loop( cv::Mat& scale, const cv::Mat& mats, const Type val )
{
Type *ptr_scale_rows;
const Type *ptr_mat_rows;
for( int r = 0; r < mats.rows; r++ )
{
ptr_scale_rows = scale.ptr<Type>(r);
ptr_mat_rows = mats.ptr<Type>(r);
for( int c = 0; c < mats.cols; c += 4 )
{
ptr_scale_rows[c] += ptr_mat_rows[c] * val;
ptr_scale_rows[c+1] += ptr_mat_rows[c+1] * val;
ptr_scale_rows[c+2] += ptr_mat_rows[c+2] * val;
ptr_scale_rows[c+3] += ptr_mat_rows[c+3] * val;
}
}
}
//Pointer access + unroll loop + ParallelLoopBody
template <class Type>
class Parallel_ScaleAndSum: public cv::ParallelLoopBody
{
private:
cv::Mat m_mat;
Type m_mul;
cv::Mat *m_result;
public:
Parallel_ScaleAndSum(cv::Mat *result, const cv::Mat &mat, const Type &mul)
: m_mat ...
```

(more)