Why my dense descriptor has bad performance?

asked 2017-04-11 10:57:05 -0600

lovaj gravatar image

updated 2017-04-11 11:00:39 -0600

I'm trying to use desne descriptors for SURF and SIFT descriptors to improve my VLAD code precision. I'm testing my approach with the Oxford building dataset.

It has been shown that dense descriptor can improve this kind of applications.

Since the total number of descriptors is huge (we are talking about tens of millions of descriptors), in order not to run out of memory (or take ages for k-means), we need to choose randomly dense descriptors for each image. The author of the linked paper above suggested me 50 descriptor per image (which on oxford roughly means 253050000 descriptors for k-means training)

This is my code for extracting dense SURF descriptors:

void DSURFOpenCV::ComputeDescriptors(const cv::Mat &img, cv::Mat1f &descriptors){
    descriptors.release();
    int startSize = step < 8 ? 8 : step;
    std::vector<cv::KeyPoint> kps;
    for(int z=startSize; z<=startSize*5;z=z+startSize)
        for (int i=step; i<img.rows-step; i+=step)
            for (int j=step; j<img.cols-step; j+=step)
                kps.push_back(cv::KeyPoint(float(j), float(i), float(z)));
    surf->compute(img,kps,descriptors);
}

Where step is the number of pixels between different keypoints (at least 8pxs) and I extract keypoints at 5 different scales, starting from the initial value of step (i.e. startSize) to 5 times its original value. With step=8 we have the same values of the same values of the linked paper above, except I use 5 scales (Section 5, "Implementation details"):

We extract SIFT [29] descriptors at 4 scales corresponding to region widths of 16, 24, 32 and 40 pix- els. The descriptors are extracted on a regular densely sam- pled grid with a stride of 2 pixels.

I do something similar with dense SIFT by VLFeat:

void DSIFTVLFeat::ComputeDescriptors(const cv::Mat &img, cv::Mat1f &descriptors){
    descriptors.release();

    // transform image in cv::Mat to float vector
    cv::Mat imgFloat;
    img.convertTo(imgFloat, CV_32F, 1.0/255.0);
    if(!imgFloat.isContinuous())
        throw std::runtime_error("imgFloat is not continous");

    for(int i=binSize; i<=maxBinSize; i+=2){
        VlDsiftFilter *dsift = vl_dsift_new_basic (img.rows, img.cols, step, i);
        vl_dsift_process (dsift, imgFloat.ptr<float>());
        cv::Mat scaleDescs(vl_dsift_get_keypoint_num(dsift), 128, CV_32F, (void*) vl_dsift_get_descriptors(dsift));
        descriptors.push_back(scaleDescs);
        vl_dsift_delete (dsift);
    }
}

However, all these methods decreased the Mean Average Precision for Oxford (performance metric for this dataset). Why this happens and how I could improve it? Any C++ implementation would be useful.

edit retag flag offensive close merge delete

Comments

"at 4 scales corresponding to region widths of 16, 24, 32 and 40 pixels" -- wouldn't that require scaling the image ? all you do now, is increase the distance between keypoints, in the end, you have a somewhat "irregular" keypoint grid, but all on the same scale.

berak gravatar imageberak ( 2017-04-12 02:45:35 -0600 )edit

@berak mmmmh by "increase the distance between keypoints" you mean the step value, so the number of pixels between one keypoint and another? Because that's not the case. In DSURFOpenCV , step (the number of pixels between keypoints) is constant. I thought that in order to implement the "different scale" stuff I should have changed the keypoint diamater, which is the third value of the KeyPoint constructor (represented by z here). Isn't that correct?

lovaj gravatar imagelovaj ( 2017-04-12 03:57:30 -0600 )edit