Matrix multiplication without memory allocation

asked 2017-01-07 07:07:47 -0600

acajic
46 ●1 ●4

updated 2017-01-07 07:10:00 -0600

Is it possible to speed up the overloaded matrix multiplication operator (*) in OpenCV by using preallocated cv::Mat instance with correct dimensions as a placeholder for where the result is being written into?

Something like the existing function:

CV_EXPORTS_W void gemm(InputArray src1, InputArray src2, double alpha,
                       InputArray src3, double beta, OutputArray dst, int flags = 0);

only simpler. I would like to have something like this:

CV_EXPORTS_W void matmul(InputArray src1, InputArray src2, OutputArray dst);

My concern is performance. Is it possible that

res = m1 * m2;

is equally fast as the hypothetical function:

matmul(m1, m1, res)

edit retag flag offensive close merge delete

add a comment

answered 2017-01-07 09:32:28 -0600

berak
32993 ●7 ●81 ●312

the short answer is: you should not worry at all about this.

all your functions will be calling gemm() one way or the other, and the only "overhead" would be the allocation cost of the return value, which is neglible, compared to the cost of a full matrix multiplication.

what you should care about is: building opencv libs with all optimization available, TBB. IPP, opencl, BLAS, and such, as below small example shows:

Mat  *             191.494
Mat  gemm no alloc 193.061
Mat  gemm    alloc 190.75
UMat gemm no alloc 63.9974  * opencv3
UMat gemm    alloc 65.2547  * opencv3

and here's the code:

Mat A(500,500,CV_32F);
Mat B(500,500,CV_32F);

int64 t0 = getTickCount();
for (int i=0; i<500; i++) {
    Mat C = A * B;
}
int64 t1 = getTickCount();
cerr << "Mat  *             " << (t1-t0)/getTickFrequency() << endl;

for (int i=0; i<500; i++) {
    Mat C;
    gemm(A,B,1,noArray(),0,C);
}
int64 t2 = getTickCount();
cerr << "Mat  gemm no alloc " << (t2-t1)/getTickFrequency() << endl;

Mat C(500,500,CV_32F); // preallocated
for (int i=0; i<500; i++) {
    gemm(A,B,1,noArray(),0,C);
}
int64 t3 = getTickCount();
cerr << "Mat  gemm    alloc " << (t3-t2)/getTickFrequency() << endl;

UMat D(500,500,CV_32F);
UMat E(500,500,CV_32F);
int64 t4 = getTickCount();
for (int i=0; i<500; i++) {
    UMat F;
    gemm(D,E,1,noArray(),0,F);
}
int64 t5 = getTickCount();
cerr << "UMat gemm no alloc " << (t5-t4)/getTickFrequency() << endl;

UMat F(500,500,CV_32F); // preallocated
for (int i=0; i<500; i++) {
    gemm(D,E,1,noArray(),0,F);
}
int64 t6 = getTickCount();
cerr << "UMat gemm    alloc " << (t6-t5)/getTickFrequency() << endl;

edit flag offensive delete link

Comments

with my configuration i7-5820 it is :

Mat  *             1.83911
Mat  gemm no alloc 1.53328
Mat  gemm    alloc 1.45606
UMat gemm no alloc 0.586018
UMat gemm    alloc 0.0264339

Parallel framework:            Concurrency
  Other third-party libraries:
    Use IPP:                     9.0.1 [9.0.1]
         at:                     G:/Lib/opencv/static2015/3rdparty/ippicv/ippicv_win
    Use IPP Async:               NO
    Use Lapack:                  YES (C:/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64/mkl_core.lib C:/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64/mkl_intel_lp64.lib C:/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64/mkl_sequential

LBerger ( 2017-01-07 09:49:44 -0600 )edit

may be difference is here

OpenCV now can use vendor-provided OpenVX and LAPACK/BLAS (including Intel MKL, Apple’s Accelerate, OpenBLAS and Atlas) for acceleration

or a mistake when you copy source file...

LBerger ( 2017-01-07 09:52:54 -0600 )edit

@LBerger, imho your measurement might be far more exact than mine (taken on a single cpu laptop w/o any other optim than opencl)

berak ( 2017-01-07 10:02:08 -0600 )edit

my processor benchmark is here and your is here :-O ?

LBerger ( 2017-01-07 10:11:38 -0600 )edit

it's actually an N2480, but yea, other end of the spectrum ;(

berak ( 2017-01-07 10:24:01 -0600 )edit

it could be worst :)

LBerger ( 2017-01-07 10:38:39 -0600 )edit

Ideally, you would not want to time also the printing in certain cases. It should not matter as you perform multiple iterations though.

Eduardo ( 2017-01-07 12:30:10 -0600 )edit

add a comment

Matrix multiplication without memory allocation

1 answer

Comments

Links

Question Tools

Stats

Related questions

Matrix multiplication without memory allocation edit

1 answer

Comments

Links

Question Tools

Stats

Related questions

Matrix multiplication without memory allocation