I am trying to perform huge matrix multiplication using gemm() function. When I use Mat variables it takes a long time, so I switched to UMat. But I got very different results when using UMat for the same operation. Some of the values were also NaN.
Here is the sample that I ran afterwards

```
int main(int argc, char** argv){
cv::Mat m1 = cv::Mat::ones(5, 1, CV_32FC1);
cv::Mat m2 = cv::Mat::zeros(1, 5, CV_32FC1);
cv::Mat output;
cv::gemm(m1, m2, 1.0, noArray(), 0.0, output, GEMM_1_T + GEMM_2_T);
std::cout << output;
return 0;
}
```

Output: [0]

```
int main(int argc, char** argv){
cv::UMat m1 = cv::UMat::ones(5, 1, CV_32FC1);
cv::UMat m2 = cv::UMat::zeros(1, 5, CV_32FC1);
cv::UMat output;
cv::gemm(m1, m2, 1.0, noArray(), 0.0, output, GEMM_1_T + GEMM_2_T);
std::cout << output;
return 0;
}
```

Output: [5.4256896e+35]

Can someone please tell me why there are different values for same operation and how do I correct it. I cannot simply use Mat since I want to use GPU to reduce time taken.

Edit: ~~It ~~I narrowed down the problem and it seems it only ~~seems to happen ~~occurs when ~~I have to transpose ~~a ~~matrix, otherwise it gives the correct result.~~single row matrix or a single column matrix is involved in either of the matrices being mulitplied.

```
int main(int argc, char** argv){
UMat m1(1, 5, CV_32FC1);
UMat m2(5, 1, CV_32FC1);
randu(m1,Scalar::all(0),Scalar::all(1));
randu(m2, Scalar::all(0), Scalar::all(1));
UMat output;
gemm(m1, m2, 1.0, noArray(), 0.0, output);
cout << m1 << std::endl;
cout << m2 << endl;
cout << output << endl;
return 0;
}
```

Output:

```
int main(int argc, char** argv){
UMat m1(2, 5, CV_32FC1);
UMat m2(2, 5, CV_32FC1);
randu(m1,Scalar::all(0),Scalar::all(1));
randu(m2, Scalar::all(0), Scalar::all(1));
UMat output;
gemm(m1, m2, 1.0, noArray(), 0.0, output, GEMM_2_T);
cout << m1 << std::endl;
cout << m2 << endl;
cout << output << endl;
return 0;
}
```

Output: