Ask Your Question

Why OpenCV building is so slow with CUDA?

asked 2012-12-12 02:49:46 -0500

rics gravatar image

When I completely rebuild OpenCV it takes hours if I use the with CUDA option? Such a block takes long minutes to compile:

My configuration:

  • Core i7 laptop
  • Nvidia Geforce 630M
  • Windows 7 32 bit
  • OpenCV 2.4.3
  • CUDA 5.0
  • MS Visual C++ 2010 Express

I had assumed that it was much faster given my fast CPU. During compilation only 1 out of 8 CPU cores is working, others are almost idle.

What is going on inside the CUDA compilation?

When I tried to stop building it was waiting for some minutes before it really stopped. Compiling a helloword was normal using nvcc. Similarly compiling Nvidia GPU Computing Toolkit's activity trace example with mingw took some seconds only.

So is it normal that GPU compilation of OpenCV takes so long?

edit retag flag offensive close merge delete

4 answers

Sort by » oldest newest most voted

answered 2012-12-12 04:57:55 -0500

Vladislav Vinogradov gravatar image

updated 2012-12-12 05:01:06 -0500

The reasons are the following:

  • Slow compiler
  • Necessity to compile the same code many times for all GPU architectures
  • A lot of templates instantiations in the module to support all possible types, flags, border extrapolation modes, interpolations, kernel sizes, etc.

Compilation only for one architecture is 6x faster. If you don’t need to compile for all architectures, clear CUDA_ARCH_PTX in CMake, and set CUDA_ARCH_BIN accordingly (“3.0” for Kepler, “2.0” for Fermi, etc.). You can find information about your gpu in

edit flag offensive delete link more



OMG why is this information not available anywhere else?

awesomenesspanda gravatar imageawesomenesspanda ( 2016-08-24 05:26:52 -0500 )edit

Running on a 2.3Ghz Xeon, and the CUDA compiles are killing me. This really saves my employer tons of billable hours. :)

elchan gravatar imageelchan ( 2016-09-22 09:58:20 -0500 )edit

how do I clear CUDA_ARCH_PTX and set CUDA_ARCH_BIN? do I just do -DCUDA_ARCH_PTX='' and -DCUDA_ARCH_BIN=30?

Lawb gravatar imageLawb ( 2018-02-11 03:14:23 -0500 )edit

For me the longest time took compiling the cuda examples / tests.

holger gravatar imageholger ( 2020-08-12 04:31:51 -0500 )edit

answered 2012-12-14 13:45:37 -0500

solvingPuzzles gravatar image

I get pretty long compilation times for OpenCV even on an Intel Sandy Bridge server with 64gb of ram. My speculation is that it's a combination of:

  • Lots of C++ templates (see this thread for some insight into why C++ templates can be slow to build).
  • Lots of cuda kernels -- remember, the kernels themselves are not polymorphic (unless you do some really advanced tricks), so OpenCV routines often have one kernel for each data type (CV_8UC1, CV_32FC3, etc). This adds up quickly.
  • As Vladislav said, building for several architectures (Compute 1.0, 1.1, 1.2, 2.0, etc) increases build time, but you can avoid this by just selecting your architecture in the CUDA_ARCH_BIN flag.
  • I think there's also some code generation going on at compile-time. I don't remember the details, but I remember seeing a bunch of printouts about code generation during the OpenCV GPU compilation.

You may have already tried this, but building in multithread mode (e.g. use the flag -j8 for 8 threads, -j16 for 16 threads, pick your favorite number) can help. I've noticed that builds sometimes fail in multithreaded mode, but this may just be coincidence. Anyway, it's worth a try.

edit flag offensive delete link more

answered 2012-12-14 07:02:25 -0500

ubehagelig gravatar image

updated 2012-12-14 07:03:53 -0500

A couple of somewhat related comments:

I believe TBB (threaded building blocks) would utilise all cores, but maybe that is in run-time and not when compiling. I haven't fiddled with TBB myself yet.

Another thing: Make sure the charger is plugged in. My i7 laptop runs at 1.0 ghz if it is running on battery, but at a full 2.4 ghz on the charger.

edit flag offensive delete link more


The charger was plugged in so that could not be the cause.

rics gravatar imagerics ( 2012-12-21 07:03:04 -0500 )edit

answered 2020-08-11 10:37:43 -0500

elliotwoods gravatar image

updated 2020-08-11 10:43:20 -0500

I'm also noticing this (e.g. several hours to build with CUDA)

I understand that it's building many versions of each kernel for different template types and different compute targets.

However what is confusing is that this part of the build is not multi-threaded (unlike the rest of the build which is multi-threaded). I presume either this is intentional because NVCC doesn't like being run in parallel (it seems to actually do something with the GPU device whilst building), or because the CMake system doesn't generate the correct project settings to perform the NVCC instructions in parallel (e.g. this is built using some sort of long batch script which just executes instructions one after another).

Testing here with VS2019, cmake, OpenCV 4.4.0

Edit: And it seems I was somewhat correct. The cuda files are all built using <CustomBuild> tags in the opencv world project, which Visual Studio does not perform in parallel (since often this feature is used for actions that need to be run in the correct sequence). If you're not building as opencv world, then you might be able to perform some parallel compilation here and same some time (since projects can be compiled in parallel to one another and I presume the build events don't lock out build events from other projects being compiled).

edit flag offensive delete link more


There are a few periods where the CPU usage goes down however I don't think this is a major cause for concern. I just compiled OpenCV 4.4.0 with all CUDA modules including the DNN backend using CUDA 11.0, ninja and VS2019 for a single binary arch and it only took 25m16s on my laptop (admittedly it does have a desktop i7-8700). I then removed all the CUDA modules and it took 15m3s. I don't think that 10m for the CUDA modules is excessive and its definitely a lot less than several hours.

cudawarped gravatar imagecudawarped ( 2020-08-12 12:59:53 -0500 )edit

Thanks cudawarped. Just to check - did you compile for only 1 CUDA architecture, or all architectures? Compiling here (admittedly on a 16 core Ryzen), the compile time for the CUDA part (all cuda architectures) was significantly longer than the rest of compile time.

elliotwoods gravatar imageelliotwoods ( 2020-12-12 03:43:38 -0500 )edit

Question Tools



Asked: 2012-12-12 02:49:46 -0500

Seen: 17,292 times

Last updated: Aug 11 '20