MASSIVELY slow transfer between GPU and host memory

asked 2018-07-17 13:45:49 -0600

I'm doing some simple benchmarking and comparing the cost of transferring data from host to GPU and back. Here's a paraphrasing of the snippet that's acting up:

    Mat lImage( 720, 1280, CV_8UC3, Scalar( 100, 250, 30 ) );
    UMat lUImage; 
    lUImage = lImage.getUMat(ACCESS_READ); /* This is fast */
    // lImage.copyTo( lUImage ); /* This is SLOW */
    cvtColor( lUImage, lUDestImage, COLOR_BGR2YCrCb );
    lNumGpuCopyConverts++;
    // lImage = lUImage.getMat( ACCESS_READ ); /* This is SLOW */
    // lUImage.copyTo( lImage ); /* This is SLOW */

When I say very slow I'm talking about literally over a minute to do the copy from GPU to CPU. This is an AMD FirePro card and for perspective, using a 720p image will have cvtColor done 170k-180k times per second. Using just the one-way copy it drops to 22k conversions per second. If I copy back to the CPU I don't even get one.

I tested this on a variety of other machines/laptops/etc. and doing the two-way copy seems to be slow-ish but not terrible so I'm assuming there must be either something weird about this card or some misconfiguration on my machine. Does anyone have any ideas about what I could check?

cheers,

Chris

edit retag flag offensive close merge delete

Comments

I made the same ovservations in OpenGL using a RX550 vs Intel iGPU. The texture upload on the RX550 is SLOWER than on the iGPU.

Terreko gravatar imageTerreko ( 2018-07-17 23:33:39 -0600 )edit

Interesting. You never found any solution or way to make it better? Does anyone have recommended GPUs that have better performance in this area?

Chris Warkentin gravatar imageChris Warkentin ( 2018-07-20 07:29:47 -0600 )edit

There is something available in OpenGL which allows you upload textures asynchronously called PBO see http://www.songho.ca/opengl/gl_pbo.html I wasn't able to test it out since the framework I am using does not support PBOs. I have no plan how to get it done in opencv since the only included option I know for asynchronous upload / download is cv::gpu::stream and this one is CUDA based. I would try to pipline your code and upload bigger chunks.

Terreko gravatar imageTerreko ( 2018-07-20 09:49:37 -0600 )edit

Thanks. I'll look into it. This is my first foray into GPU processing so I'm still just feeling out the landscape.

Chris Warkentin gravatar imageChris Warkentin ( 2018-07-20 10:37:07 -0600 )edit

So this is interesting. Running the demo code at http://www.songho.ca/opengl/gl_pbo.html using the different kinds of PBO modes and I seem to get acceptable data transfer rates in all cases. That is, even without using PBO it still runs quite well.

Is it possible there is some kind of bug in the OpenCV library that is causing transfer between a UMat and a Mat to be very slow?

Chris Warkentin gravatar imageChris Warkentin ( 2018-07-20 14:12:13 -0600 )edit

Iam sorry, but that is beyond my knowlege. All I can say is that you want to pipeline your code in a manner that upload, download and processing is done asynchronously. Additionally maybe not a real solution but easy a workaround for you: If texture upload on OpenGL works fine there is this is this function cv::ogl::mapGLBuffer which allows you to wrap OpenGL memory into GPU Mat. I belive this call could have near to zero overhead.

Terreko gravatar imageTerreko ( 2018-07-21 02:21:39 -0600 )edit

However for a real solution you want to take a look at https://www.cs.cmu.edu/afs/cs/academi... From what I know mapping is faster than read / write but u can try both. There should be a way to use native cl calls to manage your transfer and wrap the cl object inside UMats, but no idea how.

Terreko gravatar imageTerreko ( 2018-07-21 02:27:21 -0600 )edit