Sequence of calls in the cv::gemm(...) function
Recently I'have digged into the opencv sources, the reason was low performance of the opencv_dnn module. Searching around I have came to the cv::gemm(...) function. So, I see that function can be divided in two parts. First, optional part that calls optimized version of gemm routine from ocl module or if clAMDBLAS defined from clAMDBLAS. The second part makes some optional transpositions and in the end calls gemm32/64f(...) whitch (track the chain calls) will call "manual" non optimized gemmImpl(...) function! From the source code we can see, that two mentioned parts are independent, so if cv::gemm will be called both of them will be executed... and the performance drammatically drops. If I comment second part, I get 15x speed up, BUT also different results on the same data. It means that the second part does something very important, but I can not find what exactly. So, is this an issue, what does exactly cv::gemm(...) do?