# Revision history [back]

Haar features are inherently slow - they make extensive use of floating point operations, which are a bit slow on mobile devices.

A quick solution would be to turn to lbb cascades - all you need is a few lines changed in your code. The performance gain is significant, and the loss in accuracy is minimal. Look for lbpcascades/lbpcascade_frontalface.xml.

If you want to dig deeper into optimzations, here is a generic optimization tip list (cross-posted from SO) Please note that face detection, being one of the most requested features of OpenCV, is already quite optimized, so advancing it further may mean deep knowledge.

A. Profile your app. Do it first on your computer, since it is much easier. Use visual studio profiler, and see what functions take the most. Optimize them. Never ever optimize because you think is slow, but because you measure it. Start with the slowest function, optimize it as much as possible, then take the second slower.

B. First, focus on algorithms. A faster algorithm can improve performance with orders of magnitude (100x). A C++ trick will give you maybe 2x performance boost.

Classical techniques:

• Resize you video frames to be smaller. many times, you can extract the information from a 200x300px image, instead of a 1024x768. The area of the first one is 10 times smaller.

• Use simpler operations instead of complicated ones. Use integers instead of floats. And never use double in a matrix or a for loop that executes thousands of times.

• Do as little calculation as possible. Can you track an object only in a specific area of the image, instead of processing it all for all the frames? Can you make a rough/approximate detection on a very small image and then refine it on a ROI in the full frame?

C. In for loops, it may make sense to use C style instead of C++. A pointer to data matrix or a float array is much faster than mat.at<i, j=""> or std::vector<>. But change only if it's needed. Usually, a lot of processing (90%) is done in some double for loop. Focus on it. It doesn't make sense to replace vector<> all over the place, ad make your code look like spaghetti.

D. Some OpenCV functions convert data to double, process it, then convert back to the input format. Beware of them, they kill performance on mobile devices. Examples: warping, scaling, type conversions. Also, color space conversions are known to be lazy. Prefer grayscale obtained directly from native YUV.

E. ARM processors have NEON. Learn and use it. It is powerfull!

A small example:

float* a, *b, *c;
// init a and b to 1000001 elements
for(int i=0;i<1000001;i++)
c[i] = a[i]*b[i];


can be rewritten as follows. It's more verbose, but trust me it's faster.

float* a, *b, *c;
// init a and b to 1000001 elements
float32x4_t _a, _b, _c;
int i;
for(i=0;i<1000001;i+=4)
{
a_ = vld1q_f32( &a[i] ); // load 4 floats from a in a NEON register
b_ = vld1q_f32( &b[i] );
c_ = vmulq_f32(a_, b_); // perform 4 float multiplies in parrallel
vst1q_f32( &c[i], c_); // store the four results in c
}
// the vector size is not always multiple of 4 or 8 or 16.
// Process the remaining elements
for(;i<1000001;i++)
c[i] = a[i]*b[i];


Purists say you must write in assembler, but for the regular programmer guy that's a bit daunting. I found good results writing with intrinsics, like in the above example.

Also check this blog post and the following posts about NEON.

And, last but not least, I should mention that I had very good success converting the SSE optimizations (this is the NEON counterpart in x86-64 processors) in OpenCV to NEON, like here. This is the image filtering code for uchar matrices (the regular image format). You should't blindly convert instructions one by one, because there are better ways to do it, but take it as an example to start with.

Haar features are inherently slow - they make extensive use of floating point operations, which are a bit slow on mobile devices.

A quick solution would be to turn to lbb LBP cascades - all you need is a few lines changed in your code. The performance gain is significant, and the loss in accuracy is minimal. Look for lbpcascades/lbpcascade_frontalface.xml.

If you want to dig deeper into optimzations, here is a generic optimization tip list (cross-posted from SO) Please note that face detection, being one of the most requested features of OpenCV, is already quite optimized, so advancing it further may mean deep knowledge.

A. Profile your app. Do it first on your computer, since it is much easier. Use visual studio profiler, and see what functions take the most. Optimize them. Never ever optimize because you think is slow, but because you measure it. Start with the slowest function, optimize it as much as possible, then take the second slower.

B. First, focus on algorithms. A faster algorithm can improve performance with orders of magnitude (100x). A C++ trick will give you maybe 2x performance boost.

Classical techniques:

• Resize you video frames to be smaller. many times, you can extract the information from a 200x300px image, instead of a 1024x768. The area of the first one is 10 times smaller.

• Use simpler operations instead of complicated ones. Use integers instead of floats. And never use double in a matrix or a for loop that executes thousands of times.

• Do as little calculation as possible. Can you track an object only in a specific area of the image, instead of processing it all for all the frames? Can you make a rough/approximate detection on a very small image and then refine it on a ROI in the full frame?

C. In for loops, it may make sense to use C style instead of C++. A pointer to data matrix or a float array is much faster than mat.at<i, j=""> or std::vector<>. But change only if it's needed. Usually, a lot of processing (90%) is done in some double for loop. Focus on it. It doesn't make sense to replace vector<> all over the place, ad make your code look like spaghetti.

D. Some OpenCV functions convert data to double, process it, then convert back to the input format. Beware of them, they kill performance on mobile devices. Examples: warping, scaling, type conversions. Also, color space conversions are known to be lazy. Prefer grayscale obtained directly from native YUV.

E. ARM processors have NEON. Learn and use it. It is powerfull!

A small example:

float* a, *b, *c;
// init a and b to 1000001 elements
for(int i=0;i<1000001;i++)
c[i] = a[i]*b[i];


can be rewritten as follows. It's more verbose, but trust me it's faster.

float* a, *b, *c;
// init a and b to 1000001 elements
float32x4_t _a, _b, _c;
int i;
for(i=0;i<1000001;i+=4)
{
a_ = vld1q_f32( &a[i] ); // load 4 floats from a in a NEON register
b_ = vld1q_f32( &b[i] );
c_ = vmulq_f32(a_, b_); // perform 4 float multiplies in parrallel
vst1q_f32( &c[i], c_); // store the four results in c
}
// the vector size is not always multiple of 4 or 8 or 16.
// Process the remaining elements
for(;i<1000001;i++)
c[i] = a[i]*b[i];


Purists say you must write in assembler, but for the regular programmer guy that's a bit daunting. I found good results writing with intrinsics, like in the above example.

Also check this blog post and the following posts about NEON.

And, last but not least, I should mention that I had very good success converting the SSE optimizations (this is the NEON counterpart in x86-64 processors) in OpenCV to NEON, like here. This is the image filtering code for uchar matrices (the regular image format). You should't blindly convert instructions one by one, because there are better ways to do it, but take it as an example to start with.

 3 No.3 Revision Kirill Kornyakov 2742 ●13 ●24 ●52

Haar features are inherently slow - they make extensive use of floating point operations, which are a bit slow on mobile devices.

A quick solution would be to turn to LBP cascades - all you need is a few lines changed in your code. The performance gain is significant, and the loss in accuracy is minimal. Look for lbpcascades/lbpcascade_frontalface.xml.

If you want to dig deeper into optimzations, here is a generic optimization tip list (cross-posted from SO) Please note that face detection, being one of the most requested features of OpenCV, is already quite optimized, so advancing it further may mean deep knowledge.

A. Profile your app. Do it first on your computer, since it is much easier. Use visual studio profiler, and see what functions take the most. Optimize them. Never ever optimize because you think is slow, but because you measure it. Start with the slowest function, optimize it as much as possible, then take the second slower.

B. First, focus on algorithms. A faster algorithm can improve performance with orders of magnitude (100x). A C++ trick will give you maybe 2x performance boost.

Classical techniques:

• Resize you video frames to be smaller. many times, you can extract the information from a 200x300px image, instead of a 1024x768. The area of the first one is 10 times smaller.

• Use simpler operations instead of complicated ones. Use integers instead of floats. And never use double in a matrix or a for loop that executes thousands of times.

• Do as little calculation as possible. Can you track an object only in a specific area of the image, instead of processing it all for all the frames? Can you make a rough/approximate detection on a very small image and then refine it on a ROI in the full frame?

C. In for loops, it may make sense to use C style instead of C++. A pointer to data matrix or a float array is much faster than mat.at<i, j=""> or std::vector<>. But change only if it's needed. Usually, a lot of processing (90%) is done in some double for loop. Focus on it. It doesn't make sense to replace vector<> all over the place, ad make your code look like spaghetti.

D. Some OpenCV functions convert data to double, process it, then convert back to the input format. Beware of them, they kill performance on mobile devices. Examples: warping, scaling, type conversions. Also, color space conversions are known to be lazy. Prefer grayscale obtained directly from native YUV.

E. ARM processors have NEON. Learn and use it. It is powerfull!

A small example:

float* a, *b, *c;
// init a and b to 1000001 elements
for(int i=0;i<1000001;i++)
c[i] = a[i]*b[i];


can be rewritten as follows. It's more verbose, but trust me it's faster.

float* a, *b, *c;
// init a and b to 1000001 elements
float32x4_t _a, _b, _c;
int i;
for(i=0;i<1000001;i+=4)
{
a_ = vld1q_f32( &a[i] ); // load 4 floats from a in a NEON register
b_ = vld1q_f32( &b[i] );
c_ = vmulq_f32(a_, b_); // perform 4 float multiplies in parrallel
vst1q_f32( &c[i], c_); // store the four results in c
}
// the vector size is not always multiple of 4 or 8 or 16.
// Process the remaining elements
for(;i<1000001;i++)
c[i] = a[i]*b[i];


Purists say you must write in assembler, but for the regular programmer guy that's a bit daunting. I found good results writing with intrinsics, like in the above example.

Also check this blog post and the following posts about NEON.

And, last but not least, I should mention that I had very good success converting the SSE optimizations (this is the NEON counterpart in x86-64 processors) in OpenCV to NEON, like here. This is the image filtering code for uchar matrices (the regular image format). You should't blindly convert instructions one by one, because there are better ways to do it, but take it as an example to start with.