Revision history - OpenCV Q&A Forum

I would say you have two options and both should solve the problem above:

Use multiple streams in a single thread to overlap host and device computation and memory transfers with host/device computation. See Accelerating OpenCV with CUDA streams in Python for an example of how this can be done. This will be more complex but may be the most efficient depending on your problem set up.
Use a separate stream in each thread and use the exact same processing pipeline as you have already. This should be the easiest to implement but may not be suitable for you if you don't want to/can't take advantage of multiple threads.

First, there is no advantage that I know of to processing multiple frames at the same time over efficiently processing one after the other. In order to do so you would either have to process very small images on a large GPU which is generally not an option or alter the block and grid sizes of each CUDA algorithm that you use. At best you may see a marginal speed up from this, however you would have to tweak you code calculating the ideal block and grid size every time you change the image size. It is much better to use streams and let the hardware try to schedule the operations most efficiently.

Given that I would say you have two ~~options and both should solve the problem above:~~options:

Use multiple streams in a single thread to overlap host and device computation and memory transfers with host/device computation. See Accelerating OpenCV with CUDA streams in Python for an example of how this can be done. This will be more complex but may be the most efficient depending on your problem set up.
Use a separate stream in each thread and use the exact same processing pipeline as you have already. This should be the easiest to implement but may not be suitable for you if you don't want to/can't take advantage of multiple threads.

First, there is no advantage that I know of to processing multiple frames at the same time over efficiently processing one after the other. In order to do so you would either have to process very small images on a large GPU which is generally not an option or alter the block and grid sizes of each CUDA algorithm that you use. At best you may see a marginal speed up from this, however you would have to tweak you code calculating the ideal block and grid size every time you change the image size. It is much better to use streams and let the hardware try to schedule the operations most efficiently.

Given that I would say you have two options:

Use multiple streams in a single thread to overlap host and device computation and memory transfers with host/device computation. See Accelerating OpenCV with CUDA streams in Python for an example of how this can be done. This will be more complex but may be the most efficient depending on your problem set up. This does have one caveat, some of the CUDA routines have sync points hard coded into them because the result from the previous round of GPU computation needs to be processed on the CPU before the next round of GPU computation can take place. If this is the case for the function you are calling approach 2 maybe your only option.
Use a separate stream in each thread and use the exact same processing pipeline as you have already. This should be the easiest to implement but may not be suitable for you if you don't want to/can't take advantage of multiple threads.

Revision history [back]