GPU Cuda initialization much slower with opencv libraries

asked 2014-08-25 17:26:42 -0600

Dagmor
106 ●1 ●9

updated 2014-08-26 10:18:31 -0600

Hello all,

Prereqs for posting, my environment: Linux x86 64, OpenCV 2.4.6.1, CUDA 5.0, Tesla Kepler K20c GPU

I've got a simple C++ application to benchmark cuda performance. It makes and times the following calls once each in order:

cudaSetDevice(0);
cudaMalloc(&someMemory, sizeof(float)*1024*1024);
cudaFree(someMemory);
cudaDeviceReset();

With just the cuda libraries linking, this takes ~10s of milliseconds for each call except for the malloc, which is about 0.25 seconds. Fine...no biggie, it's all part of GPU startup costs.

Here's the weird part - if I include libopencv_gpu.so and libopencv_core.so in the linker list (-lopencv_gpu -lopencv_core), without changing code whatsoever, those timings go through the roof. The cudaSetDevice call takes ~2.5 seconds, and the malloc takes ~5 seconds. Calls after that seem to be just as fast, but a ~7.5 second startup cost is ridiculous considering it's only ~.5 seconds without the opencv libraries.

Another oddity, taking out libopencv_gpu and just leaving the core library still has an effect: the set device call still takes ~2.5 seconds, and the malloc takes ~.7 seconds. What gives?

This affects more than my benchmark app, and it is repeatable. Does anyone have any insight on how opencv is destroying my startup performance? I tried setting CUDA_DEVCODE_CACHE to /tmp/devcode, thinking it was PTX compilations, but nothing was made in the directory - am I using it wrong?

Any help would be great. Thanks!

edit retag flag offensive close merge delete

Comments

OpenCV optimizes its GPU performance by demanding a higher startup cost for each library you include and link. However this makes sure that further in the programming you are not getting any overhead anymore. This is as far as I have understood it always. It is very simple, first run you know it will take 10 sec, but then it can process functions at amazing speeds.

StevenPuttemans ( 2014-08-26 07:44:21 -0600 )edit

Thanks for the reply Steven. Unfortunately, I don't have the luxury of that startup lag being acceptable. According to the opencv documentation, it could be doing the JIT PTX compilation, and that CUDA_DEVCODE_CACHE should be used to cache the PTX code for future use, but that feature does not seem to be working. Has anyone ever even tried this? Google fails me (or maybe I fail Google).

Dagmor ( 2014-08-26 09:52:40 -0600 )edit

No idea about that...

StevenPuttemans ( 2014-08-26 11:39:29 -0600 )edit

Another thought, the documentation makes me believe that the cuda code is precompiled by default for compute capabilities 1.1 and 1.3, and that perhaps if I add CUDA_ARCH_BIN=3.5 to the CMake defines it'll precompile the cuda kernels for my K20c. I'm trying it out and will report back if it helps.

Dagmor ( 2014-08-26 12:05:35 -0600 )edit

Reporting back, setting CUDA_ARCH_BIN dropped about 2 seconds off the initialization time, but I'm still looking at around a ~6 second startup lag. If opencv isn't compiling PTX now, what on earth is it doing?

Dagmor ( 2014-08-27 12:37:11 -0600 )edit

Another update - opencv 2.4.9 is slightly slower by about 0.1 seconds, so that didn't help.

Dagmor ( 2014-08-28 09:14:24 -0600 )edit

Another update - compiling with static libs shaved off another 3 seconds! I'm getting there...

Dagmor ( 2014-08-28 12:48:58 -0600 )edit

add a comment

answered 2014-08-28 16:17:38 -0600

Dagmor
106 ●1 ●9

Problem solved!

For anyone wanting to know how to speed up opencv initialization, here ya go:

Compile binary CUDA kernels for your compute capability (mine was 3.5) (~20% of the extra time)
Compile the library statically instead of dynamically (~40% of the extra time)
Remove all GUI dependencies (~40% of the extra time)

These three changes took my start time from ~7.5 seconds down to ~0.7 seconds (almost the same as it is without opencv at all). Here's the cmake flags I changed to do the above:

CUDA_ARCH_BIN=3.5
CUDA_ARCH_PTX=
BUILD_SHARED_LIBS=off
CMAKE_CXX_FLAGS=-fPIC
WITH_QT=off
WITH_VTK=off
WITH_GTK=off
WITH_OPENGL=off

Hope this helps someone out in the future - there certainly is little information out there about this.

edit flag offensive delete link

Comments

Pretty nice! Do accept your own answer as a solution :) And indeed removing visualisation and gui interfaces removes alot of initialisation work!

StevenPuttemans ( 2014-08-29 02:13:14 -0600 )edit

Thanks Steve, it says I need >50 points to mark my own answer as a solution, so I'll have to come back later and do it. Yeah the GUI stuff was the final straw that gave back all the performance I needed, woohoo!

Dagmor ( 2014-08-29 08:54:16 -0600 )edit

Thanks to your efforts you can accept it now!

StevenPuttemans ( 2014-08-29 09:50:12 -0600 )edit

add a comment

GPU Cuda initialization much slower with opencv libraries

Comments

1 answer

Comments

Links

Question Tools

Stats

Related questions

GPU Cuda initialization much slower with opencv libraries edit

Comments

1 answer

Comments

Links

Question Tools

Stats

Related questions

GPU Cuda initialization much slower with opencv libraries