Multithreading not effective for stereo vision with small images
I am building stereo camera device that should do the processing (=stereo matching) on board too. My requirement is to possibly do high FPS (=60 FPS) but the resolution can be as small as QVGA (320x240) or even QQVGA (160x120). Such small resolution should be faster to process. I would like to use an ARM Cortex-A9 quad-core (namely i.MX6 Quad by Freescale) for the stereo computation because I have an OEM who also supplies compatible camera modules.
My intention is to use StereoBM - I tried using it on Raspberry Pi 2 (ARM Cortex-A7 quad core). However I noticed that the lower the resolution is the worse the CPU tends to perform. I am able to control the number of threads of StereoBM (by setting nstripes variable and/or setNumThreads) and thus the number of CPU cores used (number of threads translates directly to CPU usage - can be seen on CPU utilization %).
However, I reached a state where further paralellisation (read using more cores) does not help or even is counterproductive. Here are my results for StereoBM on RPi2:
QVGA (320x240):
- 2 threads (OpenCV default) = ~50ms
- 4 threads forced = ~50ms
QQVGA (160x120):
- 1 thread (OpenCV default) = ~20ms
- 4 threads forced = ~20ms
I know 20ms at QQVGA is 50 FPS but I would like QVGA to be that fast too. Is there a way to further speed up the computation on an ARM CPU? Is it possible to use multiple threads/cores more effectively for small images? I suspect the parallelization overhead may be too large for small data and thus not effective. From my tests bigger images tend to be more effective with multithreading.
The solution would most likely be to rewrite StereoBM to FPGA but I want to avoid this.
Note: I dont really know what type of threads is used. I compiled OpenCV with TBB support but multithreading worked even without TBB support (I suspect OpenMP may have been used instead).