Revision history [back]

I think the real issue is that CUDA needs pinned / pagelocked host-memory to do asynchronous transfers to the GPU. If your memory for cpuImage does not qualify as such, the transfer is performed synchronously.

Maybe like this: https://answers.opencv.org/question/168354/how-to-assign-cvmat-to-point-to-the-page-locked-memory-pinned-memory/

Edit:

Confirmed. By simply using pinned host memory the upload/download + stream methods work asynchronously as expected. I used numba for allocating a pinned numpy array:

data_cpu = numba.cuda.pinned_array(shape=(2*8192, 2*8192), dtype=np.float32)