strategy to build asynchronic subpixel registration analysis
Hi, I am analysing set of images for subpixel image shifts. I have code which essantially loops through:
loop(){
- read binary image, send it to GpuMat/cuda
//next 2 points are based on dft, mulSpectrums, magnitude (all cuda "Streamable")
- convolve with smoothing/gradient kernels (cuda)
- cross-correlate (phase-correlate) with base image (cuda)
// next are locating maximum of correlation pattern with subpixel precision
- find maxLoc (cuda, but value sent to Point.x/Point.y on CPU)
- copy maxLoc 3x3 neighbours into Mat (CPU)
- subpixel registration by quadratic fit of 3x3 maxima neighbours (CPU)
- resulting (x,y) pixel shifts are placed in shift maps (CPU) }
All this is computed ~65000 times, it takes about 8 minutes to compute (256x256 base 16 bit B&W images). Cuda card is not even heating up (nvidia-smi shows 6% GPU-Util).
Any suggestions on how to parallelize (the faster the better) this?
(also thanks to L.Berger who got me this far)
Have you try without CUDA to have a reference?
You think CUDA is faster than CPU for FFT. But that's not always true (see this post for TAPI) Your image is 256X2566 it is too small to have a large improvement relative to CPU. time transfer between memory and GPU is not free.
I will try. My analysis has 5x dft + 2x mulSpectrum + 1x magnitude per each image 512x512 (zero padded due to the features close to image border) and these are done without CPU code in-line, therefore I thought if CUDA may be better here. It may be faster on CPU but not so parallel (at least I think)
I have read this document about CUDA Have you try something like p24-27?
tried linear case, with single Stream but that is roughly same speed as w/o
with stream: 7.4745 ms w/o: 7.57892 ms
(averaged over 1000 readouts)
compared with functions without myStream. I also made sepparate function so need to check whether this dos not influence speed
6.64 ms if I use codes inside int main and not sepparate function