Revision history [back]

Image Retrieval through fisher vectors: why my implementation works SO BAD?

I'm trying to implement a Content Based Image Retrieval system for small image-dataset. By now, I'm using 1k images (40 categories) from Caltech101.

This is the system workflow:

For each image img compute the SIFT descriptors (using cv::SIFT, default paramaters) and save the descriptor matrix in std::vector<cv::Mat> descriptors
Compute a Gaussian Mixture Model using VLfeat implementation VlGMM and the previously computed descriptors, using the k-means as base algorithm (again, using VLFeat implementation).
For each img, compute the correspondent fisher vector using GMM obtained before, one for each dataset image.
Given the query q, compute SIFT descriptors and fisher vectors (using the same GMM of before).
Compute the Euclidean distance between q's fisher vector and each img fisher vector from the dataset.
Return the top k images, according to the distances obtained from 5.

This is the code from point 2 to point 3 and 5, which are the most important ones:

vl_size totalElements = totalKeypoints * dimension;
float *data = new float[totalElements];
size_t counter = 0;
//save into data all descriptors matrices (tested, it works)
for(size_t i=0; i<descriptors.size(); i++){
    std::memcpy(data+counter,descriptors[i].data,descriptors[i].total() * sizeof(float));
    counter += descriptors[i].total();
}
VlKMeans * kmeans = vl_kmeans_new (VL_TYPE_FLOAT, VlDistanceL2) ;
vl_kmeans_set_algorithm (kmeans, VlKMeansElkan) ;
vl_kmeans_set_initialization(kmeans, VlKMeansPlusPlus);
vl_kmeans_set_min_energy_variation(kmeans,0.0001);
vl_kmeans_set_num_repetitions(kmeans,3);
VlGMM* gmm = vl_gmm_new(VL_TYPE_FLOAT, dimension, k) ;
vl_gmm_set_initialization (gmm,VlGMMKMeans);
vl_gmm_set_kmeans_init_object(gmm, kmeans);
vl_gmm_cluster (gmm, data, totalKeypoints);
delete []data;
int encodingSize = 2 * dimension * k;
std::vector<cv::Mat> codes(n); //n is the dataset size
for(size_t i=0;i<descriptors.size();i++){
    float *enc = (float*)vl_malloc(sizeof(float) * encodingSize);
    vl_fisher_encode(enc, VL_TYPE_FLOAT,
            vl_gmm_get_means(gmm), dimension, k,
            vl_gmm_get_covariances(gmm),
            vl_gmm_get_priors(gmm),
            descriptors[i].data,1,
            VL_FISHER_FLAG_IMPROVED
            );
    codes[i] = cv::Mat(1, encodingSize, CV_32FC1, enc);
}
...
//here we compute query's fisher code in the same way
...
for(int i=0;i<n;i++)
    distances[i] = norm(queryCode,codes[i],cv::NORM_L2);//from OpenCV

I hope that this can be considered as MCV code. If you want to see some other section, please let me know it.

The performance are simply HORRIBLE: even if we use an image from the dataset itself, the most similar image is the image itself (but even in that case the distance is 0.0201524!), while all the others images are totally uncorellated!

I'm a bit frustrated, I don't know at all how could I improve this. Fisher vector is already an advanced solution, why does it work SO bad?

Please, notice that I would avoid to use even more complicate solutions (like deep neural networks), I'm not a machine learning guys and image processing/ML are already quite new fields for me. And most of all good results have already been obtained with the approach above!

Where am I doing this wrong?