Fastest method to process an image

asked 2014-02-21 05:24:18 -0600

4134 ●1 ●13 ●49

Hi,

I would like to know, what do you find the best (fastest) way to process an image?

I did some performance tests on a 60 megapixel image and a pixel-by-pixel time-intensive operation (dest=sqrt(src/255.0)*255). I'm using Linux and a 4-core Core i7 processor.

I implemented this using 1. MatIterator; 2. two for cycles (and pointers); 3. a parallel_for_ (for lines) and a for cycle (inside the lines).

To my surprise, MatIterator was 3 times faster than the other two (570ms), while there was no difference between the other two methods (1700-1750ms).

When I checked the processor usage, it seemed that only one core is used while running the program.

My questions are:

Is there a better way to process images?
Is it normal, that there is no speed gain with parallel_for_?
Do I need to set some flags or use some libraries when compiling with g++ to get a parallel code?

Thanks for any explications!

If you want, I can post my code, too.

edit retag flag offensive close merge delete

add a comment

answered 2014-02-21 06:45:31 -0600

tuannhtn

1500 ●10 ●19 http://4fire.wordpress...

Because I don't mention which version of OpenCV you use so I suppose you use precompiled (Release version) of OpenCV (aka OpenCV 2.4.8). That version was not compiled with TBB so parallel_for_ does not give any improvement in speed. That why in your test, the Matiterator is faster then parallel_for_. For pointer version (2.), since sqrt function of OpenCV is implemented with SSE2 function, which can process 4 float (or 2 double) numbers at a CPU cycle, it is obviously faster than your simple pointer implementation (which can only works with 1 float number), if you use SSE with pointer and 2 for iterations, the speed will be the same. I have tested parallel_for_ with TBB enable version of OpenCV, and it usually gives nCores times faster than normal MatIterator (where nCores is the number of CPU cores on your system). So I suggest you to build OpenCV with TBB on your own and use parallel_for_, which can be seen more at http://answers.opencv.org/question/22115/best-way-to-apply-a-function-to-each-element-of/. Hope this help.

edit flag offensive delete link

Comments

Thanks for the answer! The examples in the post you linked are very well written.

I used the latest OpenCV compiled by myself, but I didn't enable TBB. I recompiled and reinstalled everything using TBB, but I'm still doing something wrong. In my test, the execution time of the TBB and OpenCV parallel_for_ version grew from 1700ms to 2700ms.

To be reproductible: I installed libtbb2 and libtbb-dev (v4.1). I cloned the OpenCL github repo and compiled it using these instructions: http://answers.opencv.org/question/10/how-to-build-opencv-with-tbb-support/ Then, I ran maythe4th's example from here: http://answers.opencv.org/question/22115/best-way-to-apply-a-function-to-each-element-of/ My results: Normal: 680ms, TBB: 700ms, OpenCV: 700ms. So no speed gain. Do you know why?

kbarni ( 2014-02-21 11:04:25 -0600 )edit

I doubt that your configuration to compile OpenCV with TBB has some thing wrong. You can check to see if your program uses more than one core when running that test? Because on my laptop with a duo core 2.4 Ghz, the results I got even better then yours: Normal: 773 ms, TBB: 378 ms, OpenCV parallel_for_: 378 ms. And on my workstation with 2 CPU (each has 4 Core 1.8Ghz), the results are more impressive: Normal: 675 ms, TBB: 78 ms, OpenCV parallel_for_: 78 ms. In the link, it is said that OpenCV parallel_for_ is faster than TBB but I found they have the same speed, but I like TBB more since its syntax is simpler. And a note that TBB 4.2 Update 2 is little faster and more stable then TBB 4.2 Update 3 (just released very recently), I do not know why.

tuannhtn ( 2014-02-21 11:49:33 -0600 )edit

strange...first (when it's using normal mode), it only one core is running, then all 4 cores for the parallel operations. still, I don't get better performance...

kbarni ( 2014-02-21 12:08:37 -0600 )edit

What kind of compiler did you use? And did you turn on some optimize options (aka -Ofast -std=c++11)?

tuannhtn ( 2014-02-21 12:12:54 -0600 )edit

Oh, it was my stupidity from the beginning. I was measuring time with the clock() function, which measures cpu clock cycles (which is the same in all cases). By changing it to clock_gettime(), I finally get the correct values:

Normal: 634ms, TBB: 175ms, OpenCV: 186ms.

Thanks for your help!

kbarni ( 2014-02-24 03:00:09 -0600 )edit

Ah, since C++11 is an excellent and wide supported standard now, I suggest you to use the high_resolution_clock in the chrono library of C++11 standard, it gives simple syntax with native code.

tuannhtn ( 2014-02-24 03:14:14 -0600 )edit

add a comment

Fastest method to process an image

1 answer

Comments

Links

Question Tools

Stats

Related questions

Fastest method to process an image edit

1 answer

Comments

Links

Question Tools

Stats

Related questions

Fastest method to process an image