Ask Your Question
0

Matrix multiplication without memory allocation

asked 2017-01-07 07:07:47 -0600

acajic gravatar image

updated 2017-01-07 07:10:00 -0600

Is it possible to speed up the overloaded matrix multiplication operator (*) in OpenCV by using preallocated cv::Mat instance with correct dimensions as a placeholder for where the result is being written into?

Something like the existing function:

CV_EXPORTS_W void gemm(InputArray src1, InputArray src2, double alpha,
                       InputArray src3, double beta, OutputArray dst, int flags = 0);

only simpler. I would like to have something like this:

CV_EXPORTS_W void matmul(InputArray src1, InputArray src2, OutputArray dst);

My concern is performance. Is it possible that

res = m1 * m2;

is equally fast as the hypothetical function:

matmul(m1, m1, res)

?

edit retag flag offensive close merge delete

1 answer

Sort by » oldest newest most voted
3

answered 2017-01-07 09:32:28 -0600

berak gravatar image

the short answer is: you should not worry at all about this.

all your functions will be calling gemm() one way or the other, and the only "overhead" would be the allocation cost of the return value, which is neglible, compared to the cost of a full matrix multiplication.

what you should care about is: building opencv libs with all optimization available, TBB. IPP, opencl, BLAS, and such, as below small example shows:

Mat  *             191.494
Mat  gemm no alloc 193.061
Mat  gemm    alloc 190.75
UMat gemm no alloc 63.9974  * opencv3
UMat gemm    alloc 65.2547  * opencv3

and here's the code:

Mat A(500,500,CV_32F);
Mat B(500,500,CV_32F);

int64 t0 = getTickCount();
for (int i=0; i<500; i++) {
    Mat C = A * B;
}
int64 t1 = getTickCount();
cerr << "Mat  *             " << (t1-t0)/getTickFrequency() << endl;

for (int i=0; i<500; i++) {
    Mat C;
    gemm(A,B,1,noArray(),0,C);
}
int64 t2 = getTickCount();
cerr << "Mat  gemm no alloc " << (t2-t1)/getTickFrequency() << endl;

Mat C(500,500,CV_32F); // preallocated
for (int i=0; i<500; i++) {
    gemm(A,B,1,noArray(),0,C);
}
int64 t3 = getTickCount();
cerr << "Mat  gemm    alloc " << (t3-t2)/getTickFrequency() << endl;

UMat D(500,500,CV_32F);
UMat E(500,500,CV_32F);
int64 t4 = getTickCount();
for (int i=0; i<500; i++) {
    UMat F;
    gemm(D,E,1,noArray(),0,F);
}
int64 t5 = getTickCount();
cerr << "UMat gemm no alloc " << (t5-t4)/getTickFrequency() << endl;

UMat F(500,500,CV_32F); // preallocated
for (int i=0; i<500; i++) {
    gemm(D,E,1,noArray(),0,F);
}
int64 t6 = getTickCount();
cerr << "UMat gemm    alloc " << (t6-t5)/getTickFrequency() << endl;
edit flag offensive delete link more

Comments

2

with my configuration i7-5820 it is :

Mat  *             1.83911
Mat  gemm no alloc 1.53328
Mat  gemm    alloc 1.45606
UMat gemm no alloc 0.586018
UMat gemm    alloc 0.0264339

Parallel framework:            Concurrency
  Other third-party libraries:
    Use IPP:                     9.0.1 [9.0.1]
         at:                     G:/Lib/opencv/static2015/3rdparty/ippicv/ippicv_win
    Use IPP Async:               NO
    Use Lapack:                  YES (C:/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64/mkl_core.lib C:/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64/mkl_intel_lp64.lib C:/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64/mkl_sequential
LBerger gravatar imageLBerger ( 2017-01-07 09:49:44 -0600 )edit

may be difference is here

OpenCV now can use vendor-provided OpenVX and LAPACK/BLAS (including Intel MKL, Apple’s Accelerate, OpenBLAS and Atlas) for acceleration

or a mistake when you copy source file...

LBerger gravatar imageLBerger ( 2017-01-07 09:52:54 -0600 )edit

@LBerger, imho your measurement might be far more exact than mine (taken on a single cpu laptop w/o any other optim than opencl)

berak gravatar imageberak ( 2017-01-07 10:02:08 -0600 )edit

my processor benchmark is here and your is here :-O ?

LBerger gravatar imageLBerger ( 2017-01-07 10:11:38 -0600 )edit

it's actually an N2480, but yea, other end of the spectrum ;(

berak gravatar imageberak ( 2017-01-07 10:24:01 -0600 )edit
LBerger gravatar imageLBerger ( 2017-01-07 10:38:39 -0600 )edit
1

Ideally, you would not want to time also the printing in certain cases. It should not matter as you perform multiple iterations though.

Eduardo gravatar imageEduardo ( 2017-01-07 12:30:10 -0600 )edit

Question Tools

1 follower

Stats

Asked: 2017-01-07 07:07:47 -0600

Seen: 1,687 times

Last updated: Jan 07 '17