strategy to build asynchronic subpixel registration analysis

asked 2015-09-09 13:13:44 -0600

anatrong0 gravatar image

updated 2015-09-09 13:27:53 -0600

Hi, I am analysing set of images for subpixel image shifts. I have code which essantially loops through:

loop(){

  • read binary image, send it to GpuMat/cuda

//next 2 points are based on dft, mulSpectrums, magnitude (all cuda "Streamable")

  • convolve with smoothing/gradient kernels (cuda)
  • cross-correlate (phase-correlate) with base image (cuda)

// next are locating maximum of correlation pattern with subpixel precision

  • find maxLoc (cuda, but value sent to Point.x/Point.y on CPU)
  • copy maxLoc 3x3 neighbours into Mat (CPU)
  • subpixel registration by quadratic fit of 3x3 maxima neighbours (CPU)
  • resulting (x,y) pixel shifts are placed in shift maps (CPU) }

All this is computed ~65000 times, it takes about 8 minutes to compute (256x256 base 16 bit B&W images). Cuda card is not even heating up (nvidia-smi shows 6% GPU-Util).

Any suggestions on how to parallelize (the faster the better) this?

(also thanks to L.Berger who got me this far)

edit retag flag offensive close merge delete

Comments

1
  1. Have you try without CUDA to have a reference?

  2. You think CUDA is faster than CPU for FFT. But that's not always true (see this post for TAPI) Your image is 256X2566 it is too small to have a large improvement relative to CPU. time transfer between memory and GPU is not free.

LBerger gravatar imageLBerger ( 2015-09-10 00:51:20 -0600 )edit

I will try. My analysis has 5x dft + 2x mulSpectrum + 1x magnitude per each image 512x512 (zero padded due to the features close to image border) and these are done without CPU code in-line, therefore I thought if CUDA may be better here. It may be faster on CPU but not so parallel (at least I think)

anatrong0 gravatar imageanatrong0 ( 2015-09-10 07:31:31 -0600 )edit

I have read this document about CUDA Have you try something like p24-27?

LBerger gravatar imageLBerger ( 2015-09-10 09:40:21 -0600 )edit

tried linear case, with single Stream but that is roughly same speed as w/o

with stream: 7.4745 ms w/o: 7.57892 ms

(averaged over 1000 readouts)

for (int i=0; i<1000;i++){
g_disk.upload(readDisk(fs,tmp0, disk),myStream);
cudaDFT2D(g_disk, G_disk, G_gauss_dx, G_gauss_dy, g_disk_dx, g_disk_dy, G_magnitude, myStream);  
g_disk_dx.download(disk,myStream);}

compared with functions without myStream. I also made sepparate function so need to check whether this dos not influence speed

anatrong0 gravatar imageanatrong0 ( 2015-09-10 09:57:15 -0600 )edit

6.64 ms if I use codes inside int main and not sepparate function

anatrong0 gravatar imageanatrong0 ( 2015-09-10 10:05:20 -0600 )edit