Ask Your Question

Run SURF_CUDA against multiple images in parallel

asked 2019-03-10 17:25:31 -0500

jianweike gravatar image

updated 2019-03-14 21:44:02 -0500

Hi all,

In my application, I need to compute SURF features for multiple images. I am wondering if it is possible to run surf() function against different images in parallel in cuda. For example, can surf take multiple images as inputs or is there an async option for SURF().

Thanks all in advance.

edit retag flag offensive close merge delete

2 answers

Sort by ยป oldest newest most voted

answered 2019-03-11 10:31:51 -0500

HYPEREGO gravatar image

updated 2019-03-11 10:54:50 -0500

First of all, I suggest you to get a brief introfuction on how CUDA works and how the memory is managed to get better optimization (what is a CUDA Kernel and the block size). You can find some articles even in the website.

Two action are always required:the copy of the images in the GPU memory (upload) and the download of the images from the GPU memory. Keep in minds that CUDA works with kernel and block, and you've to set the dimension and manage the memory at is best to achieve the maximum performance. Then what I can say to you is, yes, is possibile for sure, but how many images per times you can process is up to your graphic card specs. Here you can find an example of SIFT/SURF CUDA usage in OpenCV. I don't know how this is implemented so maybe the best things is just try it and find if it does the job and eventually think about optimize it.

Don't forget that you need OpenCV with CUDA enabled, for this feature it is required to compile the package from the sources.

edit flag offensive delete link more


Thank you for your helpful comment. I am quite sure how to set up the execution configuration when calling the SURF function. Meanwhile, I was also wondering if there are any asynchornize versions of SURF. Any comments are appreciated. Thank you in advance.

jianweike gravatar imagejianweike ( 2019-03-11 12:27:17 -0500 )edit

What do you mean for the "asynchronize version" of SURF? OpenCV already provide to you the CUDA version and the CPU version of it. Pay attention to the constructor, the default value for the GPU version may be different ;)

HYPEREGO gravatar imageHYPEREGO ( 2019-03-12 09:08:41 -0500 )edit

Thank you. By "asynchronize", I mean the CUDA implementation of SURF that accepts streams so that the SURF for different images could be computing concurrently. But as pointed out by "cudawarped", CUDA implementation of SURF doesn't seem to accept streams.

jianweike gravatar imagejianweike ( 2019-03-12 21:04:49 -0500 )edit

As mentioned before, it depends on which graphic card are you running the algorithm. With some (high end GPU usually) you can run in parallel. The problem is that even if application can run in parallel, kernels in CUDA are serialized. In THIS you can find some simple information for understand how CUDA works. (I suggest you to take a look at the post #4)

You can may think of use the both version (CPU and GPU) at the same time if you don't have a timing constraint but you want just to find and store somewhere keypoints and descriptors of images.

If you find this reply useful, please mark as solution.

HYPEREGO gravatar imageHYPEREGO ( 2019-03-13 04:54:06 -0500 )edit

What you can try do by the way is concatenate images: suppose that you have, images with size 800*600; you can create a bigger one (like a container) for let's say 4 images, so a 1600*1200 image and then execute the algorithm on it. The extraction of the keypoint and description position will be a little bit tricky (not that much) but maybe this can do the job. To test it do as follow. Choose 4 images, and run SURF_CUDA for every image, store the keypoint/descriptor or for better check, just display the images with the keypoint (there are function to draw it and in this forum you can find how to extract the position of features in pixel). Then, try to concatenate the original images as mentioned above, re-run the algorithm with this bigger image, and visualize again the result.

HYPEREGO gravatar imageHYPEREGO ( 2019-03-13 06:22:46 -0500 )edit

Concatenating images is an interesting idea. I will give it a try and pose the result so that everyone knows. What I was asking was essentially if I can assign different SURF() calls to different GPU streams so that each stream run SURF on each image concurrently, but as someone pointed out in this thread, such option seems no to be possible. Yes, I can always optimize the implementation so that the function actually compute all the images "in parallel". Of course, whether it actually achieve a better performance depends on the resource like #of SM, sMem, registers, e.t.c. However, I don't think merely changing the execution configuration would make the computation in parallel, as one kernel call (SURF) only takes one image as input. Kernel calls are executed serially in GPU in one stream.

jianweike gravatar imagejianweike ( 2019-03-14 21:33:30 -0500 )edit

FYI, all computation is done in one application, so no resource would be shared with other applications in my case. Thanks again for your reply.

jianweike gravatar imagejianweike ( 2019-03-14 21:36:48 -0500 )edit

If you try to concatenate images, let us know regarding the result, it should work btw :)

HYPEREGO gravatar imageHYPEREGO ( 2019-03-15 05:18:57 -0500 )edit

answered 2019-03-12 18:34:43 -0500

If by in parallel you mean at the same time without joining them together in some way, i would have to say no. CUDA is data parallel not task parallel.

Regarding async, it doesn't look like the CUDA implentation accepts streams so no.

edit flag offensive delete link more


Thanks for the information. It seems like it. I was not able to find SURF CUDA implementation that accepts streams. Are you aware of any way of "joining them together" other than concatenating the images?

jianweike gravatar imagejianweike ( 2019-03-12 20:59:28 -0500 )edit

Concatinating is probably your only option but this will only hide the cost of data transfers and inactive sm's. As cuda is data parallel even with streams you will probably not be running more than one image at once (an image will probably require more blocks to process than the sm's on your gpu), you will just be running more efficiently and hiding latency, that is sm's which would be inactive on the last part of your image may be used by the next image but this is usually only possible with simple algorithms using a single kernel. Additionally without streams you cannot overlap data transfers with processing which will probably cost you more than using the cpu implementation. Also the algo will be optimized using the best number of threads per block, changing could reduce performance

cudawarped gravatar imagecudawarped ( 2019-03-15 12:25:25 -0500 )edit

Question Tools

1 follower


Asked: 2019-03-10 17:25:31 -0500

Seen: 318 times

Last updated: Mar 14 '19