Thanks for all your help. Here are my results:
I managed to successfully re-build opencv from master, with NEON & FVPV3 on top of a nightly build of VC4CL. I also have a TensorFlow 1.8.0 installed from tensorflow-on-arm project here: https://github.com/lhelontra/tensorflow-on-arm/releases. I tried to enable TBB but I seem to get defaulted to pthreads.
I can now load the uint8 model with the provided prototxt, and the detection works, by and large.
- The detection boundaries appear slightly off if I downscale the input - this seems specific to this model, as it doesn't happen with either Caffe models.
- Speed is essentially the same as the Caffe model. Again 80x80 is the sweet spot where it can reach 10 FPS, anything larger and it drops to 3-4 FPS
- The speed of the Caffe model is also roughly the same as it was with 3.4.1
- When I run it as the pi user, it only uses OpenCL for the camera capture (v4l), and complains about lack of root privileges. If I run it as root, it does seem to allocate a chunk of GPU memory, but speed is not any faster. Annoyingly, the process hangs upon exit, and does not relinquish the GPU until reboot.
- It seems to use all 4 CPUs evenly in all cases, but it only reaches about 60% of usage, which leads me to think that there is some other bottleneck at play here.
- I also noticed that detection times for 128x128 are fairly jittery compared to 80x80. (200-350ms versus 80-100ms). (Now as I understand, feeding forward a neural net should always pretty much take the same time, right?)
- I tried to play around with setNumThreads now to see if the perf jump is due to the threading overhead. Running things on a single core is in fact only half slower, so the parallelization does have large overhead. However the sweet spot at 80x80 is the same on single core.
Maybe I'll try the Raspi Zero next for a really minimalistic system...