1 | initial version |
I think the real issue is that CUDA needs pinned / pagelocked host-memory to do asynchronous transfers to the GPU. If your memory for cpuImage does not qualify as such, the transfer is performed synchronously.
Maybe like this: https://answers.opencv.org/question/168354/how-to-assign-cvmat-to-point-to-the-page-locked-memory-pinned-memory/
2 | No.2 Revision |
I think the real issue is that CUDA needs pinned / pagelocked host-memory to do asynchronous transfers to the GPU. If your memory for cpuImage does not qualify as such, the transfer is performed synchronously.
Maybe like this: https://answers.opencv.org/question/168354/how-to-assign-cvmat-to-point-to-the-page-locked-memory-pinned-memory/
Edit:
Confirmed. By simply using pinned host memory the upload/download + stream methods work asynchronously as expected. I used numba for allocating a pinned numpy array:
data_cpu = numba.cuda.pinned_array(shape=(2*8192, 2*8192), dtype=np.float32)