Ask Your Question

Optimized async GPU streams usage in Video post-treatment

asked 2013-07-01 06:36:53 -0600

neoirto gravatar image


I'd like to know if the following CUDA pseudocode is feasible ?

  • 1 dispatcher CPU thread, that will :

    -- initialize CUDA streams, saying 12 differents streams for example, each stream may run the same GPU code

    -- manage I/O frames on this CPU thread with VideoCapture/VideoWriter,

    -- for each frame :

    --- feed 1 free stream of the 12 CUDA streams in async to get the best usage of the total bandwith transfer (is async DMA transfer possible on every graphic cards or not ? >= compute capability x.x ?), with optimized struct for each data transfer
    --- release CPU until a Callback is done from the GPU : is it possible ?

    --- receive async resulting data from any of the 12 GPU streams : so wake up the CPU thread, that will handle the Videowriter, and send a new frame to that free GPU stream... etc ?

What low cost Nvidia card would you advise for the best results ?

I understood GPU class is "compute capability 1.3" actually, but would it be 2.0 or higher in a near future ?

Tx for your answers ;)

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted

answered 2013-07-01 23:04:56 -0600

Yes, it is possible...most likely you will want to parameterize the stream counts (rather than hard-coding '12') to make efficient use of block transfers based on GPU-specs including avail. memory, video format and frame specs, frame speed, and GPU algorithm-execution performance.

Here is nVidia's somewhat dated whitepaper link from their video decode/encode kit, which may outperform the VideoCapture/VideoWriter:

More importantly is their Codec SDK:

My assumption is that CV/Cuda will be used within the frame-based algorithm, so I'll leave that to you....but the nVidia codec SDK will get you in and out again from the GPU without CPU-based video writes.

Also, here is a link to IntuVision's publication and commercial products that scale to make use of GPU resources for video analytics:

--Gotta give credit where credit is due, and IntuVision does it well!

Hope this helps, JasonInVegas

edit flag offensive delete link more


Thanks for your answer JasonInVegas !

So what I understood is :


    • NVCUVID (ie NVIDIA VIDEO CODEC SDK) can be used to decode video files, but it depends on video formats and hardware compatibility,
    • other solution is to use ffmpeg from OpenCV, which works great, but a bit slow on CPU of course...
  2. OpenCV + CUDA treatment :

neoirto gravatar imageneoirto ( 2013-07-02 05:55:23 -0600 )edit

What function could be used to determine if hardware is compatible with multiple streams and how much for n ?

  • do you confirm the CPU could be released after each call to a new stream ?

    • is it something related with the use of : stream.enqueueHostCallback( receive_data_from_GPU, Data_to_GPU) ?
    • and after that call CPU is awaken by receive_data_from_GPU( cv::gpu::Stream&, int status, void* Data_from_GPU ) function ? The difficulty for me at this stage is I own an ATI card only : may I choose a >= 2.0 compatibility card, or 1.3 is enough ?
  • Another point : is it possible to declare a memory space on GPU, that would be directly accessible by any of the "n" streams for a "second level" of the algorithm ?

neoirto gravatar imageneoirto ( 2013-07-02 05:56:45 -0600 )edit


  • NVIDIA VIDEO CODEC SDK with several solution to encode with GPU, compatible on Win and Linux, but not Mac actually ?

  • other solution is to use ffmpeg from OpenCV, which works great, but a bit slow on CPU of course... But if all my treatment is on GPU, it could be aceptable, I'll have to test that.

Tx again !

neoirto gravatar imageneoirto ( 2013-07-02 05:57:10 -0600 )edit

regarding item #2: my opinion is "yes" based on the comments in the referenced opencv forum post, it appears compute level 2.0 is required in order to achieve the stream-based parallelism.

So, which nVidia cards support compute level 2.0? (or restated: What is the least expensive Fermi architecture board?) Summary page:

Everything on this page will support at least compute level 2.0 (most are 2.1 and above)

These range from $150 USD to way too much...a Quadro k600 is a $175 USD board that supports compute level 3.0

JasonInVegas gravatar imageJasonInVegas ( 2013-07-02 22:11:03 -0600 )edit

Now, about parameterizing the stream buffers based on video specs.

I'm not exactly confident which GPU memory structure(s) to use, but the GPU memory allocation is based on video frame Width x Height x Color_depth for each frame packed into an image struct like a pixel frame buffer object (FBO) in shaders....if you extract the frame(s) on CPU: you pack them, then pass them into the allocated GPU FBO memory object(s). So the frame extraction time(s) and GPU execution times ideally balance so the CPU doesn't bottleneck nor rest. The allocated amount of GPU memory must support input frames and output frame results.

My original comment was: "Why allocate 12 streams, if 6 frames 24-bit color uncompressed, HD video packed into FBOs on the GPU will consume the GPU memory?"

JasonInVegas gravatar imageJasonInVegas ( 2013-07-02 22:20:09 -0600 )edit

I believe there's a common practice of recycling the GPU algorithm input memory objects and filling them with output results so you don't have to allocate double memory....this won't work if the GPU is performing the video frame decode ... naturally...because it needs the memory for the next frame.

Last item: I believe the Cuda perf tools include a call signature (even example code) to test for GPU compute level capabilities. So that code is out-of-the-box functionality on the CUDA side.

Whew...hope this helps (and doesn't confuse things more!)

Good luck!

P.S. please reply with the use-case you're constructing!

JasonInVegas gravatar imageJasonInVegas ( 2013-07-02 22:24:51 -0600 )edit

Question Tools


Asked: 2013-07-01 06:36:53 -0600

Seen: 1,651 times

Last updated: Jul 01 '13