Revision history [back]

Is this code correct to allocate two aligned cv::Mat?

Disclamer: I am a SIMD newbie, but I have a good knowledge of OpenCV.

Let's suppose I have basic OpenCV code.

cv::Mat1f a = cv::Mat(n, n);
cv::Mat1f b = cv::Mat(n, n);
cv::Mat1f x;
//fill a and b somehow
x = a+b;

Now, let's suppose I want to use the code above on AVX2 or even AVX-512 machine. Having the data 32-aligned could be a great benefit for performances. The data above should not be aligned. I'm almost certain of it because in the optimization report files generated by the Intel Compiler it's said that the data is unaligned, but I could be wrong.

So what if I allocate 32-aligned pointers and use them for the operations above? Something like:

float *apt = __mm_alloc(sizeof(float)*m*n, 32);
float *bpt = __mm_alloc(sizeof(float)*m*n, 32);
float *xpt = __mm_alloc(sizeof(float)*m*n, 32);
cv::Mat1f a = cv::Mat(n, m, apt);
cv::Mat1f b = cv::Mat(m, n, bpt);
cv::Mat1f x = cv::Mat(n, n, xpt);
x = a+b;

I see three problems that concerns me in the code above:

Matrix multiplications in cv::Mat returns a new cv::Mat object which, unfortunately, would be not aligned.
Am I reinventing the wheel? I don't know if OpenCV includes already this kind of optimizations, even though I didn't find nothing useful on Google.
Reading a little bit on Intel Intrinsics, when we use __mm_malloc, we should use __mm_free. However, cv::Mat are "kinda" of smart pointers, so they're allocated and de-allocated behind-the-hood, where probably some free or delete function is called.

To overcome problem 1, in very critical sections of code where data alignment could improve the performance, I could get the pointer back and do the sum operation using pointers. Something like:

//do the code above
float *aptr = a.ptr<float>(0);
float *bptr = b.ptr<float>(0);
float *xptr = c.ptr<float>(0);
for(int i=0; i<n ; i++)
  for(int j=0; j<m; j++)
    xptr[i*m+j] = a[i*m+j] + b[i*m+j];

But again, I'm afraid that I'm reinventing the wheel. Notice that this is a toy example to make it MCV, my actual code is much more complicated than this and compiler optimizations may not be obvious.

I compile my code with icpc and the following flags:

INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl