strategy to build asynchronic subpixel registration analysis

asked 2015-09-09 13:13:44 -0600

anatrong0 gravatar image

updated 2015-09-09 13:27:53 -0600

Hi, I am analysing set of images for subpixel image shifts. I have code which essantially loops through:


  • read binary image, send it to GpuMat/cuda

//next 2 points are based on dft, mulSpectrums, magnitude (all cuda "Streamable")

  • convolve with smoothing/gradient kernels (cuda)
  • cross-correlate (phase-correlate) with base image (cuda)

// next are locating maximum of correlation pattern with subpixel precision

  • find maxLoc (cuda, but value sent to Point.x/Point.y on CPU)
  • copy maxLoc 3x3 neighbours into Mat (CPU)
  • subpixel registration by quadratic fit of 3x3 maxima neighbours (CPU)
  • resulting (x,y) pixel shifts are placed in shift maps (CPU) }

All this is computed ~65000 times, it takes about 8 minutes to compute (256x256 base 16 bit B&W images). Cuda card is not even heating up (nvidia-smi shows 6% GPU-Util).

Any suggestions on how to parallelize (the faster the better) this?

(also thanks to L.Berger who got me this far)

edit retag flag offensive close merge delete


  1. Have you try without CUDA to have a reference?

  2. You think CUDA is faster than CPU for FFT. But that's not always true (see this post for TAPI) Your image is 256X2566 it is too small to have a large improvement relative to CPU. time transfer between memory and GPU is not free.

LBerger gravatar imageLBerger ( 2015-09-10 00:51:20 -0600 )edit

I will try. My analysis has 5x dft + 2x mulSpectrum + 1x magnitude per each image 512x512 (zero padded due to the features close to image border) and these are done without CPU code in-line, therefore I thought if CUDA may be better here. It may be faster on CPU but not so parallel (at least I think)

anatrong0 gravatar imageanatrong0 ( 2015-09-10 07:31:31 -0600 )edit

I have read this document about CUDA Have you try something like p24-27?

LBerger gravatar imageLBerger ( 2015-09-10 09:40:21 -0600 )edit

tried linear case, with single Stream but that is roughly same speed as w/o

with stream: 7.4745 ms w/o: 7.57892 ms

(averaged over 1000 readouts)

for (int i=0; i<1000;i++){
g_disk.upload(readDisk(fs,tmp0, disk),myStream);
cudaDFT2D(g_disk, G_disk, G_gauss_dx, G_gauss_dy, g_disk_dx, g_disk_dy, G_magnitude, myStream);,myStream);}

compared with functions without myStream. I also made sepparate function so need to check whether this dos not influence speed

anatrong0 gravatar imageanatrong0 ( 2015-09-10 09:57:15 -0600 )edit

6.64 ms if I use codes inside int main and not sepparate function

anatrong0 gravatar imageanatrong0 ( 2015-09-10 10:05:20 -0600 )edit