SIMD optimizations get no performance gains on ARM (NEON)

asked 2020-01-05 19:55:32 -0500

Hi, all, I've recently compiled OpenCV(commit: 9ec3d76b21e7f9b15b8ffccfafe254b6113d0a75, a few new commits after 4.2.0) on ARM & x86 with SIMD opitmization ON / OFF, and make a performance comparison. It's pretty weird that on ARM, when NEON is eanbled, there is no significant performance improvements, and even some drops. While on x86, the SSE, SSE2, AVX... instruction sets make OpenCV benifits a lot. I read some of the source code and see that most ARM optimization is implemented with NEON(in intrinsics manner), so does this mean that it's the hardware gap between ARM and x86 that makes the difference? Or is there something I missed? Below is the environment and compiling options that I test with:

# hardward: 16U32G on both ARM and x86

# cmake options ARM(enable / disable NEON):
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -DENABLE_NEON=ON ../
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -DENABLE_NEON=OFF ../

# cmake options on x86(enable / disable CPU_BASELINE option, ref to https://github.com/opencv/opencv/wiki/CPU-optimizations-build-options/):
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local -D CPU_BASELINE= -D CPU_DISPATCH= ../
cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local ../

Then I run the tests by the instructions: https://github.com/opencv/opencv/wiki...

Since the whole test takes a long time, I just test 'blur' and 'resize' operations, the result shows the NEON optimized version doesn't get much improvement, and even got some big performance drop on some functions.

edit retag flag offensive close merge delete

Comments

Post the hardware info for ARM and add the output of the perf run between ARM with and without optimization. Also, post the output of cv::getBuildInformation().

Eduardo gravatar imageEduardo ( 2020-01-06 06:39:19 -0500 )edit