1 | initial version |
GpuMat's step is always in bytes, so you should access diffsqr_matrix elements in this way:
float* diffsqr_row = (float*)((char*)diffsqr_matrix + threadIdx.x * diffaqr_step);
diffsqr_row[blockIdx.x*cols + threadIdx.y] = (float) diffsqr;
Also I recommend you to swap threadIdx.x and threadIdx.y usage (threadIdx.y - row, threadIdx.x - col). This give you coalesced memory access.