# Different results for matrix multiplication gemm() for mat vs umat

I am trying to perform huge matrix multiplication using gemm() function. When I use Mat variables it takes a long time, so I switched to UMat. But I got very different results when using UMat for the same operation. Some of the values were also NaN. Here is the sample that I ran afterwards

```
int main(int argc, char** argv){
cv::Mat m1 = cv::Mat::ones(5, 1, CV_32FC1);
cv::Mat m2 = cv::Mat::zeros(1, 5, CV_32FC1);
cv::Mat output;
cv::gemm(m1, m2, 1.0, noArray(), 0.0, output, GEMM_1_T + GEMM_2_T);
std::cout << output;
return 0;
}
```

Output: [0]

```
int main(int argc, char** argv){
cv::UMat m1 = cv::UMat::ones(5, 1, CV_32FC1);
cv::UMat m2 = cv::UMat::zeros(1, 5, CV_32FC1);
cv::UMat output;
cv::gemm(m1, m2, 1.0, noArray(), 0.0, output, GEMM_1_T + GEMM_2_T);
std::cout << output;
return 0;
}
```

Output: [5.4256896e+35]

Can someone please tell me why there are different values for same operation and how do I correct it. I cannot simply use Mat since I want to use GPU to reduce time taken.

Edit: I narrowed down the problem and it seems it only occurs when a single row matrix or a single column matrix is involved in either of the matrices being mulitplied.

```
int main(int argc, char** argv){
UMat m1(1, 5, CV_32FC1);
UMat m2(5, 1, CV_32FC1);
randu(m1,Scalar::all(0),Scalar::all(1));
randu(m2, Scalar::all(0), Scalar::all(1));
UMat output;
gemm(m1, m2, 1.0, noArray(), 0.0, output);
cout << m1 << endl;
cout << m2 << endl;
cout << output << endl;
return 0;
}
```

Output:

```
int main(int argc, char** argv){
UMat m1(2, 5, CV_32FC1);
UMat m2(2, 5, CV_32FC1);
randu(m1,Scalar::all(0),Scalar::all(1));
randu(m2, Scalar::all(0), Scalar::all(1));
UMat output;
gemm(m1, m2, 1.0, noArray(), 0.0, output, GEMM_2_T);
cout << m1 << endl;
cout << m2 << endl;
cout << output << endl;
return 0;
}
```

Output:

Please remove screenshot and copy program and results as text

Changed, also the problem only seems to be occurring when I have to transpose a matrix.

No problem with result using your code my opencv version is 3.4.1-dev and platform windows 10 msvc 2017 win 64and graphics card [ INFO:0] Preparing OpenCL cache configuration for context: NVIDIA_Corporation--GeForce_GTX_970--390_77.

What is your version and platform? (you can insert in post getbuidinformation :cout << getBuildInformation() << endl;

BUT there is an exception thrown : [ INFO:0] Preparing OpenCL cache configuration for context: NVIDIA_Corporation--GeForce_GTX_970--390_77 OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54) during call: clEnqueueNDRangeKernel('transpose', dims=2, globalsize=1x8x1, localsize=32x8x1) sync=false

Opencv version is 3.4.0, windows 8.1 64bit and graphics card [ INFO 0: ] Preparing OpenCL cache configuration for context: 32-bit--Advanced_Micro_Devices__Inc_--Hainan--2348_4

Also, turns out that transpose is not the problem, it gives wrong result for row/column matrices.

You can update to 3.4.1 I don't think it will solve problem. Can you update driver gpu ? In nvidia opencl compiler is nvopencl.dll. Can you update this compiler (amd of course)?