Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

warpAffine and use the memory allocated on CUDA devices

Hi folks, I' trying to to the following pipeline in C++ using OpenCV 3.4.7:

feature extraction -> transform and crop 32x32 patches around features -> use the patches in LibTorch.

The feature extraction is leveraged to a SIFT implementation made on CUDA since in my scenario I can't sacrifice the execution time. After this is done, the data is retrieved from the GPU memory to the host memory and I compute the transformation matrices Ms for every keypoint. After that, I crop the patches around the keypoint by calling warpAffine with the matrix Ms and the following parameters for the interpolation WARP_INVERSE_MAP + INTER_LINEAR + WARP_FILL_OUTLIERS, and the following for the border: BORDER_REPLICATE. After a lot of time (well, not a lot but still too much for my scenario) I get the patches and I can feed it to LibTorch.

Actually the computation of the Ms matrices and the warpAffine calls is leveraged to the CPU by a loop that I can parallelize even with a parallel_for_ loop, but dealing with indices is not that easy stuff. I would like to avoid that much I/O operation and do all of the hard work in the GPU since I perform the Sift extraction on it and I've all the data stored inside the GPU memory. Moreover, LibTorch can run on a CUDA device and doing all the job on the GPU avoid all the upload/download calls. I made a CUDA kernel that compute the Ms matrices for every keypoint. I haven't tested yet but I'm going to do it, and it is feasible. Then, I have to extract the patches: I can do it using cv::cuda::warpAffine to leverage the computation to the graphic card and use the data stored inside. I checked the code and it doesn't seems that cv::cuda::warpAffine expect the Ms matrix to be already stored in the GPU but rather a creation and allocation is performed

The questions is the following:

Is there a way to call the warpAffine OpenCV CUDA kernel directly with the parameters that I said on multiple keypoints? Since the final patches are just 32x32 I think there will be not that much of problems afterall. I've seen the .cu file and I've seen that there are 2 dispatcher WarpDispatcherStream and WarpDispatcherNonStream and there is also the kernel warp but I cannot find anything regarding the B<work_type>, BorderReader and Filter declaration. I found something in the cudev interface but I don't know how to use everything together. Calling the proper kernel directly should do the trick for my use case scenario. in that way I should be able to put the images in the shared memory and work directly on the keypoints avoiding function calling and any other overhead.

Thank you in advance.

warpAffine and use the memory allocated on CUDA devices

Hi folks, I' trying to to the following pipeline in C++ using OpenCV 3.4.7:

feature extraction -> transform and crop 32x32 patches around features -> use the patches in LibTorch.

The feature extraction is leveraged to a SIFT implementation made on CUDA since in my scenario I can't sacrifice the execution time. After this is done, the data is retrieved from the GPU memory to the host memory and I compute the transformation matrices Ms for every keypoint. After that, I crop the patches around the keypoint by calling warpAffine with the matrix Ms and the following parameters for the interpolation WARP_INVERSE_MAP + INTER_LINEAR + WARP_FILL_OUTLIERS, and the following for the border: BORDER_REPLICATE. After a lot of time (well, not a lot but still too much for my scenario) I get the patches and I can feed it to LibTorch.

Actually the computation of the Ms matrices and the warpAffine calls is leveraged to the CPU by a loop that I can parallelize even with a parallel_for_ loop, but dealing with indices is not that easy stuff. I would like to avoid that much I/O operation and do all of the hard work in the GPU since I perform the Sift extraction on it and I've all the data stored inside the GPU memory. Moreover, LibTorch can run on a CUDA device and doing all the job on the GPU avoid all the upload/download calls. I made a CUDA kernel that compute the Ms matrices for every keypoint. I haven't tested yet but I'm going to do it, and it is feasible. Then, I have to extract the patches: I can do it using cv::cuda::warpAffine to leverage the computation to the graphic card and use the data stored inside. I checked the code and it doesn't seems that cv::cuda::warpAffine expect the Ms matrix to be already stored in the GPU but rather a creation and allocation is performed

The questions is the following:

Is there a way to call the warpAffine OpenCV CUDA kernel directly with the parameters that I said on multiple keypoints? Since the final patches are just 32x32 I think there will be not that much of problems afterall. I've seen the .cu file and I've seen that there are 2 dispatcher WarpDispatcherStream and WarpDispatcherNonStream and there is also the kernel warp but I cannot find anything regarding the B<work_type>, BorderReader and Filter declaration. I found something in the cudev interface but I don't know how to use everything together. Calling the proper kernel directly should do the trick for my use case scenario. in that way I should be able to put the images in the shared memory and work directly on the keypoints avoiding function calling and any other overhead.

Thank you in advance.

Edit: I would like to add that since my borderMode is BORDER_REPLICATE I cannot use the Npp primitive.

warpAffine and use the memory allocated on CUDA devices

Hi folks, I' trying to to the following pipeline in C++ using OpenCV 3.4.7:

feature extraction -> transform and crop 32x32 patches around features -> use the patches in LibTorch.

The feature extraction is leveraged to a SIFT implementation made on CUDA since in my scenario I can't sacrifice the execution time. After this is done, the data is retrieved from the GPU memory to the host memory and I compute the transformation matrices Ms for every keypoint. After that, I crop the patches around the keypoint by calling warpAffine with the matrix Ms and the following parameters for the interpolation WARP_INVERSE_MAP + INTER_LINEAR + WARP_FILL_OUTLIERS, and the following for the border: BORDER_REPLICATE. (so I cannot use the Npp primitive). After a lot of time (well, not a lot but still too much for my scenario) scenario, 400ms) I get the patches and I can feed it to LibTorch.

Actually the computation of the Ms matrices and the warpAffine calls is leveraged to the CPU by a loop that I can parallelize even with a parallel_for_ loop, but dealing with indices is not that easy stuff. I would like to avoid that much I/O operation and do all of the hard work in the GPU since I perform the Sift extraction on it and I've all the data stored inside the GPU memory. Moreover, LibTorch can run on a CUDA device and doing all the job on the GPU avoid all the upload/download calls. I made a CUDA kernel that compute the Ms matrices for every keypoint. I haven't tested yet but I'm going to do it, and it is feasible. Then, I have to extract the patches: I can do it using cv::cuda::warpAffine to leverage the computation to the graphic card and use the data stored inside. I checked the code and it doesn't seems that cv::cuda::warpAffine expect the Ms matrix to be already stored in the GPU but rather a creation and allocation is performed

The questions is the following:

Is there a way to call the warpAffine OpenCV CUDA kernel directly with the parameters that I said on multiple keypoints? Since the final patches are just 32x32 I think there will be not that much of problems afterall. I've seen the .cu file and I've seen that there are 2 dispatcher WarpDispatcherStream and WarpDispatcherNonStream and there is also the kernel warp but I cannot find anything regarding the B<work_type>, BorderReader and Filter declaration. I found something in the cudev interface but I don't know how to use everything together. Calling the proper kernel directly should do the trick for my use case scenario. in that way I should be able to put the images in the shared memory and work directly on the keypoints avoiding function calling and any other overhead.

Thank you in advance.

Edit: I would like to add that since my borderMode is BORDER_REPLICATE I cannot use the Npp primitive.

EDIT: I figure it out what BorderReader and Filter do and are defined. I still didn't get why B<work_type>, in the dispatcher is called with 3 arguments as:

B<work_type> brd(src.rows, src.cols, VecTraits<work_type>::make(borderValue));