I've struggled through this StereoBM code too, and while I did not come to complete enlightenment, I came to an understanding. The data structure dynamically allocated and here indexed by bufSize0, is a cache of intermediate results that allows optimizing the algorithm for a much faster calculation.
The basic StereoBM algorithm is relatively easy to comprehend, but the algorithm scales as a product of dimensions, roughly O(cols * rows * disparity_range * windowsize * windowsize). You could write a simple, straightforward algorithm to do this, and it would be correct, but it would be slow.
But, if you can use a memory buffer (appropriately sized and dimensioned based on above parameters) to cache already calculated results, then you can save a lot of time by not recalculating block match scores exhaustively.
The first optimization is to realize that many of the addressed block matches at a given column, row, disparity offset, window offset (dx, dy), are exactly the same as a previously calculated block match score result . So, cache those intermediate results and reuse them without recalculating them.
The second optimization is to realize that if you've simply moved one row or column beyond a previously calculated block, since the block match score is a sum of absolute differences, you can start with the prior similar block match result, subtract the score result of the the row or column you are leaving behind, and add the score result from the row or column you are moving toward.
The details of the indexed data structure are complicated, and its addressing is complicated, but it is performing the above sorts of optimizations to store and reuse intermediate results, to vastly reduce the amount of time to exhaustively recalculate every block match score pixel by pixel.