Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

ORB_GPU not as good as ORB(CPU)

Hi all,

I have some code working decently with the CPU implementation of ORB (from module/features2d); now I am experimenting with ORB_GPU, hoping to have a 2-way implementation, users with CUDA get faster performance, users without still get good quality.

Problem is, the keypoints/descriptors returned by ORB_GPU are not yielding a sufficient number of correct matches; on exactly the same data that the CPU version is able to. I understand that synchronization issues etc may cause the GPU results to be different than CPU, but I would hope that the quality of the results would be comparable. Any tips?

In particular, is there any way I can wrangle the ORB_GPU interface to compute descriptors for keypoints found by the CPU ORB? Maybe then I could isolate to either keypoint extraction or descriptor computation.

Dirty details; I just downloaded CUDA 5.0 and built OpenCV2.4.4 for myself with VS2008 (previously I was using prebuilt OpenCV2.4.4, but I guess that was built with HAVE_CUDA=0). For my development testing I have obtained a rather old, low-end card (Quadro FX 4800, capability level 1.3). I am using ORB/ORB_GPU only for keypoints/descriptors. I have written my own matching code (CPU-based) which is used after the kp/dsc are extracted. I am using the same computer/compiler/data/etc for testing each way, just recompiling with/without the gpu code path. Here's a snippet:

   vector<cv::KeyPoint> fkps, rkps;
   cv::Mat fdescs, rdescs;

   int ngpus = cv::gpu::getCudaEnabledDeviceCount();
   if (ngpus > 0) { // compile this way for GPU
   //if (0) {       // compile this way to force CPU
     cv::gpu::GpuMat gpumat(fmat);
     cv::gpu::ORB_GPU orb(1000);
     cv::gpu::GpuMat gpudsc;
     cv::gpu::GpuMat fullmask(gpumat.size(), CV_8U, 0xFF);

     orb(gpumat, fullmask, fkps, gpudsc);

     gpudsc.download(fdescs);
     orb.release();
     gpudsc.release();
     gpumat.release();
     fullmask.release();
     gpumat.upload(rmat);
     fullmask.create(gpumat.size(), CV_8U);
     fullmask.setTo(0xFF);

     orb(gpumat, fullmask, rkps, gpudsc);

     gpudsc.download(rdescs);
     gpudsc.release();
     gpumat.release();
     fullmask.release();
   } else {
     cv::ORB orb(1000); // DEF 500 features
     orb(fmat, cv::Mat(), fkps, fdescs);
     orb(rmat, cv::Mat(), rkps, rdescs);
   }
   // now go through fdescs/rdescs and find matches

I am getting some results, so I must have compiled/linked/etc OK, but the results are significantly worse with GPU. In particular, with my baseline "easy" unit test case my matcher is detecting 34 matches (out of the 1000 kp/dsc per image), and GPU is yielding 5-9 (not deterministic--which is OK if I can get it to be reliably good). After this matching I run RANSAC to find a maximal subset that fits well to a homography, and the CPU kp/desc winds up with 20 correct matches, but GPU never yields better than 4 (i.e. trivial homography fit, and not all correct matches).

Any feedback would be appreciated!

ORB_GPU not as good as ORB(CPU)

Hi all,

I have some code working decently with the CPU implementation of ORB (from module/features2d); now I am experimenting with ORB_GPU, hoping to have a 2-way implementation, users with CUDA get faster performance, users without still get good quality.

Problem is, the keypoints/descriptors returned by ORB_GPU are not yielding a sufficient number of correct matches; on exactly the same data that the CPU version is able to. I understand that synchronization issues etc may cause the GPU results to be different than CPU, but I would hope that the quality of the results would be comparable. Any tips?

In particular, is there any way I can wrangle the ORB_GPU interface to compute descriptors for keypoints found by the CPU ORB? Maybe then I could isolate to either keypoint extraction or descriptor computation.

Dirty details; I just downloaded CUDA 5.0 and built OpenCV2.4.4 for myself with VS2008 (previously I was using prebuilt OpenCV2.4.4, but I guess that was built with HAVE_CUDA=0). For my development testing I have obtained a rather old, low-end card (Quadro FX 4800, capability level 1.3). I am using ORB/ORB_GPU only for keypoints/descriptors. I have written my own matching code (CPU-based) which is used after the kp/dsc are extracted. I am using the same computer/compiler/data/etc for testing each way, just recompiling with/without the gpu code path. Here's a snippet:

   vector<cv::KeyPoint> fkps, rkps;
   cv::Mat fdescs, rdescs;

   int ngpus = cv::gpu::getCudaEnabledDeviceCount();
   if (ngpus > 0) { // compile this way for GPU
   //if (0) {       // compile this way to force CPU
     cv::gpu::GpuMat gpumat(fmat);
     cv::gpu::ORB_GPU orb(1000);
     cv::gpu::GpuMat gpudsc;
     cv::gpu::GpuMat fullmask(gpumat.size(), CV_8U, 0xFF);

     orb(gpumat, fullmask, fkps, gpudsc);

     gpudsc.download(fdescs);
     orb.release();
     gpudsc.release();
     gpumat.release();
     fullmask.release();
     gpumat.upload(rmat);
     fullmask.create(gpumat.size(), CV_8U);
     fullmask.setTo(0xFF);

     orb(gpumat, fullmask, rkps, gpudsc);

     gpudsc.download(rdescs);
     gpudsc.release();
     gpumat.release();
     fullmask.release();
   } else {
     cv::ORB orb(1000); // DEF 500 features
     orb(fmat, cv::Mat(), fkps, fdescs);
     orb(rmat, cv::Mat(), rkps, rdescs);
   }
   // now go through fdescs/rdescs and find matches

I am getting some results, so I must have compiled/linked/etc OK, but the results are significantly worse with GPU. In particular, with my baseline "easy" unit test case my matcher is detecting 34 matches (out of the 1000 kp/dsc per image), and GPU is yielding 5-9 (not deterministic--which is OK if I can get it to be reliably good). After this matching I run RANSAC to find a maximal subset that fits well to a homography, and the CPU kp/desc winds up with 20 correct matches, but GPU never yields better than 4 (i.e. trivial homography fit, and not all correct matches).

Any feedback would be appreciated!

UPDATE: I noticed that the CPU implementation offers optional parameter useProvidedKeypoints=false -- so I modified my code to ignore the ORB_GPU descriptors and let ORB(CPU) compute descriptors for the GPU-generated keypoints, and results are much better (29 matches, by eyeball maybe 15 of them look correct, but according to RANSAC only 8 are correct). So it seems the difference is in the GPU descriptor computation. Also maybe a little bit the GPU keypoint generation is maybe a little less accurate, causing RANSAC to throw out more of them?

ORB_GPU not as good as ORB(CPU)

Hi all,

I have some code working decently with the CPU implementation of ORB (from module/features2d); now I am experimenting with ORB_GPU, hoping to have a 2-way implementation, users with CUDA get faster performance, users without still get good quality.

Problem is, the keypoints/descriptors returned by ORB_GPU are not yielding a sufficient number of correct matches; on exactly the same data that the CPU version is able to. I understand that synchronization issues etc may cause the GPU results to be different than CPU, but I would hope that the quality of the results would be comparable. Any tips?

In particular, is there any way I can wrangle the ORB_GPU interface to compute descriptors for keypoints found by the CPU ORB? Maybe then I could isolate to either keypoint extraction or descriptor computation.

Dirty details; I just downloaded CUDA 5.0 and built OpenCV2.4.4 for myself with VS2008 (previously I was using prebuilt OpenCV2.4.4, but I guess that was built with HAVE_CUDA=0). For my development testing I have obtained a rather old, low-end card (Quadro FX 4800, capability level 1.3). I am using ORB/ORB_GPU only for keypoints/descriptors. I have written my own matching code (CPU-based) which is used after the kp/dsc are extracted. I am using the same computer/compiler/data/etc for testing each way, just recompiling with/without the gpu code path. Here's a snippet:

   vector<cv::KeyPoint> fkps, rkps;
   cv::Mat fdescs, rdescs;

   int ngpus = cv::gpu::getCudaEnabledDeviceCount();
   if (ngpus > 0) { // compile this way for GPU
   //if (0) {       // compile this way to force CPU
     cv::gpu::GpuMat gpumat(fmat);
     cv::gpu::ORB_GPU orb(1000);
     cv::gpu::GpuMat gpudsc;
     cv::gpu::GpuMat fullmask(gpumat.size(), CV_8U, 0xFF);

     orb(gpumat, fullmask, fkps, gpudsc);

     gpudsc.download(fdescs);
     orb.release();
     gpudsc.release();
     gpumat.release();
     fullmask.release();
     gpumat.upload(rmat);
     fullmask.create(gpumat.size(), CV_8U);
     fullmask.setTo(0xFF);

     orb(gpumat, fullmask, rkps, gpudsc);

     gpudsc.download(rdescs);
     gpudsc.release();
     gpumat.release();
     fullmask.release();
   } else {
     cv::ORB orb(1000); // DEF 500 features
     orb(fmat, cv::Mat(), fkps, fdescs);
     orb(rmat, cv::Mat(), rkps, rdescs);
   }
   // now go through fdescs/rdescs and find matches

I am getting some results, so I must have compiled/linked/etc OK, but the results are significantly worse with GPU. In particular, with my baseline "easy" unit test case my matcher is detecting 34 matches (out of the 1000 kp/dsc per image), and GPU is yielding 5-9 (not deterministic--which is OK if I can get it to be reliably good). After this matching I run RANSAC to find a maximal subset that fits well to a homography, and the CPU kp/desc winds up with 20 correct matches, but GPU never yields better than 4 (i.e. trivial homography fit, and not all correct matches).

Any feedback would be appreciated!

UPDATE: I noticed that the CPU implementation offers optional parameter useProvidedKeypoints=false -- so I modified my code to ignore the ORB_GPU descriptors and let ORB(CPU) compute descriptors for the GPU-generated keypoints, and results are much better (29 matches, by eyeball maybe 15 of them look correct, but according to RANSAC only 8 are correct). So it seems the difference is in the GPU descriptor computation. Also maybe a little bit the GPU keypoint generation is maybe a little less accurate, causing RANSAC to throw out more of them?

UPDATE2: I updated my code to hold onto BOTH the GPU- and CPU-computed descriptors, and run through and compare them. I am find that, for the same keypoints, the CPU-/GPU-computed 32x8bit descriptors are min/avg/max 23/59.6/115 bits different. That seems like an awful big difference for 256-bit vectors. Could I be misparameterizing somehow ORB_GPU? As far as I can see, the default parameterizations of ORB and ORB_GPU are the same, all I am setting is desired num kp to 1000 for both.

I see there are performance tests that exercise ORB and ORB_GPU on the same pixel data, but the results are discarded. Wouldn't it be a good idea to have a unit test in there as well that verified that ORB/ORB_GPU generate sufficiently similar keypoints, and compute sufficiently close decriptor vectors for the same keypoints?