Optimized async GPU streams usage in Video post-treatment


I'd like to know if the following CUDA pseudocode is feasible ?

  • 1 dispatcher CPU thread, that will :

    -- initialize CUDA streams, saying 12 differents streams for example, each stream may run the same GPU code

    -- manage I/O frames on this CPU thread with VideoCapture/VideoWriter,

    -- for each frame :

    --- feed 1 free stream of the 12 CUDA streams in async to get the best usage of the total bandwith transfer (is async DMA transfer possible on every graphic cards or not ? >= compute capability x.x ?), with optimized struct for each data transfer
    --- release CPU until a Callback is done from the GPU : is it possible ?

    --- receive async resulting data from any of the 12 GPU streams : so wake up the CPU thread, that will handle the Videowriter, and send a new frame to that free GPU stream... etc ?

What low cost Nvidia card would you advise for the best results ?

I understood GPU class is "compute capability 1.3" actually, but would it be 2.0 or higher in a near future ?

Tx for your answers ;)