# warpAffine and use the memory allocated on CUDA devices

Hi folks, I' trying to to the following pipeline in C++ using OpenCV 3.4.7:

feature extraction -> transform and crop 32x32 patches around features -> use the patches in LibTorch.

The feature extraction is leveraged to a SIFT implementation made on CUDA since in my scenario I can't sacrifice the execution time. After this is done, the data is retrieved from the GPU memory to the host memory and I compute the transformation matrices Ms for every keypoint. After that, I crop the patches around the keypoint by calling warpAffine with the matrix Ms and the following parameters for the interpolation WARP_INVERSE_MAP + INTER_LINEAR + WARP_FILL_OUTLIERS, and the following for the border: BORDER_REPLICATE (so I cannot use the Npp primitive). After a lot of time (well, not a lot but still too much for my scenario, 400ms) I get the patches and I can feed it to LibTorch.

Actually the computation of the Ms matrices and the warpAffine calls is leveraged to the CPU by a loop that I can parallelize even with a parallel_for_ loop, but dealing with indices is not that easy stuff. I would like to avoid that much I/O operation and do all of the hard work in the GPU since I perform the Sift extraction on it and I've all the data stored inside the GPU memory. Moreover, LibTorch can run on a CUDA device and doing all the job on the GPU avoid all the upload/download calls. I made a CUDA kernel that compute the Ms matrices for every keypoint. I haven't tested yet but I'm going to do it, and it is feasible. Then, I have to extract the patches: I can do it using cv::cuda::warpAffine to leverage the computation to the graphic card and use the data stored inside. I checked the code and it doesn't seems that cv::cuda::warpAffine expect the Ms matrix to be already stored in the GPU but rather a creation and allocation is performed

The questions is the following:

Is there a way to call the warpAffine OpenCV CUDA kernel directly with the parameters that I said on multiple keypoints? Since the final patches are just 32x32 I think there will be not that much of problems afterall. I've seen the .cu file and I've seen that there are 2 dispatcher WarpDispatcherStream and WarpDispatcherNonStream and there is also the kernel warp but I cannot find anything regarding the B<work_type>, BorderReader and Filter declaration. I found something in the cudev interface but I don't know how to use everything together. Calling the proper kernel directly should do the trick for my use case scenario. in that way I should be able to put the images in the shared memory and work directly on the keypoints avoiding function calling and any other overhead.

EDIT: I figure it out what BorderReader and Filter do and are defined. I still didn't get why ...

edit retag close merge delete