Some ARM NEON architecture do not have a native floating-point division instruction for vector data. Instead, the operation must be composed from a sequence of native instructions which together implement an iterative reciprocal estimate algorithm (most probably of Newton-Raphson method).
C++ compilers targeting ARM NEON should automatically generate such instructions for the scalar floating-point source code, or defer to a standard math library function call. However, if the library code specifically loops over each element performing its own non-trivial approximation, then it is dubious that C++ compilers (even with auto-vectorization enabled) would dare to defy the hard-coded logic.
It appears that unless the library specifically codes ARM NEON-specific matrix floating point divisions, it will fall back on to scalar C++ code, resulting in one math library function call per matrix element.
I see that OpenCV contains a very nifty vector-of-four elementwise division algorithm, but I doubt if it could beat the native implemented instructions.
Has anyone performed a benchmark on mobile ARM NEON processors to evaluate the performance of the native NEON vector reciprocal estimate operations?