Ask Your Question

Parallelizing GPU processing of multiple images

asked 2020-07-30 10:32:48 -0600

rgov gravatar image

updated 2020-07-30 10:34:25 -0600

For each frame of a video, I apply some transformations and then write the frame out to an image file. I am using OpenCV's CUDA API for this, so it looks something like this, in a loop:

# read frame from video
_, frame =

# upload frame to GPU
frame = cv2.cuda_GpuMat(frame)

# create a CUDA stream
stream = cv2.cuda_Stream()

# do things to the frame
# ...

# download the frame to CPU memory
frame =

# wait for the stream to complete (CPU memory available)

# save frame out to disk
# ...

Since I send a single frame to the GPU, and then wait for its completion at the end of the loop, I can only process one frame at a time.

What I would like to do is send multiple frames (in multiple streams) to the GPU to be processed at the same time, then save them to disk as the work gets finished.

What is the best way to do this?

edit retag flag offensive close merge delete


The best way to do this is using OpenGL 4.3's compute shaders, along with C++.

sjhalayka gravatar imagesjhalayka ( 2020-07-30 11:00:33 -0600 )edit

Can you link to an example please? Where would OpenGL be used, would you still be using CUDA to access the GPU?

cudawarped gravatar imagecudawarped ( 2020-07-31 03:05:41 -0600 )edit

The control of the GPU by OpenGL does not use CUDA. CUDA was designed before OpenGL computer shaders were part of the standard.

Keep in mind that you can only bind so many textures at once. This inherent limitation is platform-agnostic — it happens on CUDA and OpenGL. For instance, the Intel would only bind 8 textures at a time, where is was 64 on an AMD Vega.

For simple compute shader code, see:

sjhalayka gravatar imagesjhalayka ( 2020-07-31 14:49:57 -0600 )edit

... that said, I believe that you can use CUDA and OpenGL in the same app.

sjhalayka gravatar imagesjhalayka ( 2020-07-31 15:53:23 -0600 )edit

Thanks for the link. I am still unsure what the advantage of opengl would be over cuda in general unless interacting with a graphics pipeline. My understanding is that it offers a less mature interface to gpu computation than cuda with the advantage that it will run on amd and integrated gpu's. I guess it is also closer to the metal than opencl so maybe if the implementation is good it will be faster on those aswell. Since this user already has an nvidia gpu, the routines they need have existing opencv cuda implementations and the functionality they requested is built in i think writing everything from scratch in opengl would be the wrong way to go. Futhermore without experience of writing opengl i think their implementations would be slower than the existing cuda ones.

cudawarped gravatar imagecudawarped ( 2020-08-01 05:33:45 -0600 )edit

1 answer

Sort by » oldest newest most voted

answered 2020-07-31 03:01:45 -0600

updated 2020-08-01 05:37:42 -0600

First, there is no advantage that I know of to processing multiple frames at the same time over efficiently processing one after the other. In order to do so you would either have to process very small images on a large GPU which is generally not an option or alter the block and grid sizes of each CUDA algorithm that you use. At best you may see a marginal speed up from this, however you would have to tweak you code calculating the ideal block and grid size every time you change the image size. It is much better to use streams and let the hardware try to schedule the operations most efficiently.

Given that I would say you have two options:

  1. Use multiple streams in a single thread to overlap host and device computation and memory transfers with host/device computation. See Accelerating OpenCV with CUDA streams in Python for an example of how this can be done. This will be more complex but may be the most efficient depending on your problem set up. This does have one caveat, some of the CUDA routines have sync points hard coded into them because the result from the previous round of GPU computation needs to be processed on the CPU before the next round of GPU computation can take place. If this is the case for the function you are calling approach 2 maybe your only option.
  2. Use a separate stream in each thread and use the exact same processing pipeline as you have already. This should be the easiest to implement but may not be suitable for you if you don't want to/can't take advantage of multiple threads.
edit flag offensive delete link more

Question Tools

1 follower


Asked: 2020-07-30 10:32:48 -0600

Seen: 2,885 times

Last updated: Aug 01 '20