Why my dense descriptor has bad performance?
I'm trying to use desne descriptors for SURF and SIFT descriptors to improve my VLAD code precision. I'm testing my approach with the Oxford building dataset.
It has been shown that dense descriptor can improve this kind of applications.
Since the total number of descriptors is huge (we are talking about tens of millions of descriptors), in order not to run out of memory (or take ages for k-means), we need to choose randomly dense descriptors for each image. The author of the linked paper above suggested me 50 descriptor per image (which on oxford roughly means 253050000 descriptors for k-means training)
This is my code for extracting dense SURF descriptors:
void DSURFOpenCV::ComputeDescriptors(const cv::Mat &img, cv::Mat1f &descriptors){
descriptors.release();
int startSize = step < 8 ? 8 : step;
std::vector<cv::KeyPoint> kps;
for(int z=startSize; z<=startSize*5;z=z+startSize)
for (int i=step; i<img.rows-step; i+=step)
for (int j=step; j<img.cols-step; j+=step)
kps.push_back(cv::KeyPoint(float(j), float(i), float(z)));
surf->compute(img,kps,descriptors);
}
Where step
is the number of pixels between different keypoints (at least 8pxs) and I extract keypoints at 5 different scales, starting from the initial value of step
(i.e. startSize
) to 5 times its original value. With step=8
we have the same values of the same values of the linked paper above, except I use 5 scales (Section 5, "Implementation details"):
We extract SIFT [29] descriptors at 4 scales corresponding to region widths of 16, 24, 32 and 40 pix- els. The descriptors are extracted on a regular densely sam- pled grid with a stride of 2 pixels.
I do something similar with dense SIFT by VLFeat:
void DSIFTVLFeat::ComputeDescriptors(const cv::Mat &img, cv::Mat1f &descriptors){
descriptors.release();
// transform image in cv::Mat to float vector
cv::Mat imgFloat;
img.convertTo(imgFloat, CV_32F, 1.0/255.0);
if(!imgFloat.isContinuous())
throw std::runtime_error("imgFloat is not continous");
for(int i=binSize; i<=maxBinSize; i+=2){
VlDsiftFilter *dsift = vl_dsift_new_basic (img.rows, img.cols, step, i);
vl_dsift_process (dsift, imgFloat.ptr<float>());
cv::Mat scaleDescs(vl_dsift_get_keypoint_num(dsift), 128, CV_32F, (void*) vl_dsift_get_descriptors(dsift));
descriptors.push_back(scaleDescs);
vl_dsift_delete (dsift);
}
}
However, all these methods decreased the Mean Average Precision for Oxford (performance metric for this dataset). Why this happens and how I could improve it? Any C++ implementation would be useful.
"at 4 scales corresponding to region widths of 16, 24, 32 and 40 pixels" -- wouldn't that require scaling the image ? all you do now, is increase the distance between keypoints, in the end, you have a somewhat "irregular" keypoint grid, but all on the same scale.
@berak mmmmh by "increase the distance between keypoints" you mean the step value, so the number of pixels between one keypoint and another? Because that's not the case. In
DSURFOpenCV
,step
(the number of pixels between keypoints) is constant. I thought that in order to implement the "different scale" stuff I should have changed the keypoint diamater, which is the third value of theKeyPoint
constructor (represented byz
here). Isn't that correct?