Ask Your Question

TBB parallel_for vs std::thread

asked 2014-08-20 05:47:43 -0600

stfn gravatar image

updated 2016-05-31 04:49:11 -0600


I'm starting with parallel processing in OpenCV an wonder why I should use parallel_for (from TBB) instead of just using multiple std::threads. As I understood the parallel_for functionality, you have to create a class extending cv::ParallelLoopBody with a method signature of void operator()(const cv::Range& range). This is where the processing then happens. But you cannot pass any arguments to this function, nor can you parametrise your parallel function in any way. All you have is this range of your thread, and the arguments you passed to the cv::ParallelLoopBody instance, which are the same for each thread. So you'd have to sort out your arguments with that range, e.g. passing a vector of Images to the cv::ParallelLoopBody instance and then using the range to extract the one you need. You'd have to do so for every single parameter that is thread-dependend.

So what's the benefit then compared to threads? I can bind any arbitrary function with (almost) arbitrary parameters with boost or C++11, without creating new classes for each task to be parallized. For this purpose I wrote a very primitive thread pool manager (.hpp, .cpp). Anything wrong with that?

cheers, stfn

P.S. I'm not an threading expert. I know there are memory access concerns when the functions I'm threading are using the same memory for writing. Reading is not the problem, but when two function write simultaniously e.g. on the same Mat, what is happening despite probable corrupted data due to race conditions? Is caching triggered, forcing the data to be up to date before writing? More generally: what do I need to take care of in terms of performance and data safety? Are those pitfalls already taken care of in TBB and this is why it is used in OpenCV?

EDIT: I ended up using tbb::task_group for parallelization and load balancing. Works like a charm.

edit retag flag offensive close merge delete



you can pass additional data e.g. to the constructor of the class.

berak gravatar imageberak ( 2014-08-20 06:31:19 -0600 )edit

yes. I mentioned that. I also mentioned, that this data will be the same for each thread. Which is the problem :)

stfn gravatar imagestfn ( 2014-08-20 09:06:20 -0600 )edit

oh, i see. sorry misread it then.

berak gravatar imageberak ( 2014-08-20 09:29:02 -0600 )edit

I don't know much about std::threads but parallel_for_ (<- please use this function - not parallel_for) is meant like a wrapper-class and supports TBB / openmp and some more, so it gives more flexibiliity on the threading-library underneath. However, you are right, imho it should also support C++11 threads, maybe it will in the future.

Guanta gravatar imageGuanta ( 2014-08-20 10:20:10 -0600 )edit

Note that you can use tbb::parallel_for, which is much simpler than the OpenCV implementation, as it doesn't require any wrapper class. You can use it to parallelize simply image filters on lines.

See my exampel code here:

kbarni gravatar imagekbarni ( 2016-05-31 07:49:25 -0600 )edit

1 answer

Sort by ยป oldest newest most voted

answered 2014-08-28 21:58:04 -0600

rwong gravatar image

updated 2014-08-28 22:10:52 -0600

Your code captures some, but not all important elements involved in a task parallelism framework.

A proper framework is more aptly called "a parallel task queue execution system", rather than the older concept known as "thread pool".

Some things to check:

  1. use thread-safe data structures everywhere inside the framework;
  2. accept new tasks while the framework is running (without requiring that all tasks can only be added during initialization)
  3. reuse threads without killing them (applicable to some platforms where thread creation/destruction is expensive),
  4. avoid activating more threads than there are processors (physical or virtual). Activating more threads means the CPUs have to switch between tasks, which adds overhead.
  5. for multi-socket CPU systems only - avoid migrating tasks from one socket to another, unless one takes care of a number of issues. (details omitted.)
  6. provide a high-performance multi-threaded malloc. On some platforms, the library-provided malloc may have a critical section that will become a bottleneck when running heavily multi-threaded workload with concurrent memory allocations and releases.
  7. pop and execute next task if task queue is empty without entering sleep (applicable to some platforms where thread sleep / awake is inefficient)
  8. Efficient waking of threads when new data comes in. (On Windows, this is done with a "I/O completion port" feature.)
  9. Efficient hand-off between two threads: if thread A sets a signal and goes immediately to sleep, while thread B is the only one waiting on that signal and begins executing, then thread B should basically pick up the CPU slice that thread A was using. This is an OS feature, not something that can be mimicked by library software alone.

As you can see, so far as you are only concerned with Linux, it is not necessary to over-design a parallel task queue execution engine. However, as soon as you cross the chasm to Windows, all of the "concerns" are applied, and the engine design will become vastly different.

OpenCV does not design its own engine. Instead, it delegates to whatever engine that is available on the platform, such as TBB or PPL or OpenMP. These big-vendor engines have been optimized for every single platform they're designed to run.

With regard to the thread-safety inside OpenCV:

Basically, you are on your own. Multithreading bugs have been found and fixed on OpenCV, but new bugs continue to be found and fixed. If you suspect a bug, you can open a bug request, or better yet submit a test case and also a pull request to the Github repository.

When one is accessing an OpenCV matrix from multiple threads, the methods such as Mat.Ptr basically allows one to access its data as if it were in a C program. You deal with the raw pointers, and you read about your C++ platform's instructions about the thread-safety of compiler-generated code, and you write your code to be thread-safe.

There is no help or magic involved. You will need to decide, and perform any locking that is deemed necessary.

edit flag offensive delete link more


Thanks, that was very helpful. But some questions are still open. 1st: Is there then a nice way to parametrize the parallel_for for each thread differently apart from the solution mentioned in the question? And 2nd: is writing access of Mat(rizes) (or any memory write) then thread safe when using parallel_for?

stfn gravatar imagestfn ( 2014-08-29 05:02:43 -0600 )edit

Question Tools

1 follower


Asked: 2014-08-20 05:47:43 -0600

Seen: 3,749 times

Last updated: May 31 '16