Revision history [back]

SIMD optimizations get no performance gains on ARM (NEON)

Hi, all, I've recently compiled OpenCV(commit: 9ec3d76b21e7f9b15b8ffccfafe254b6113d0a75, a few new commits after 4.2.0) on ARM & x86 with SIMD opitmization ON / OFF, and make a performance comparison. It's pretty weird that on ARM, when NEON is eanbled, there is no significant performance improvements, and even some drops. While on x86, the SSE, SSE2, AVX... instruction sets make OpenCV benifits a lot. I read some of the source code and see that most ARM optimization is implemented with NEON(in intrinsics manner), so does this mean that it's the hardware gap between ARM and x86 that makes the difference? Or is there something I missed? Below is the environment and compiling options that I test with:

# hardward: 16U32G on both ARM and x86

# cmake options ARM(enable / disable NEON):
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -DENABLE_NEON=ON ../
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -DENABLE_NEON=OFF ../

# cmake options on x86(enable / disable CPU_BASELINE option, ref to https://github.com/opencv/opencv/wiki/CPU-optimizations-build-options/):
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -D CPU_BASELINE= -D CPU_DISPATCH= ../
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local ../

Then I run the tests by the instructions: https://github.com/opencv/opencv/wiki/HowToUsePerfTests

Since the whole test takes a long time, I just test 'blur' and 'resize' operations, the result shows the NEON optimized version doesn't get much improvement, and even got some big performance drop on some functions.