# Image Retrieval through fisher vectors: why my implementation works SO BAD?

I'm trying to implement a Content Based Image Retrieval system for small image-dataset. By now, I'm using 1k images (40 categories) from Caltech101.

This is the system workflow:

1. For each image img compute the SIFT descriptors (using cv::SIFT, default paramaters) and save the descriptor matrix in std::vector<cv::Mat> descriptors
2. Compute a Gaussian Mixture Model using VLfeat implementation VlGMM and the previously computed descriptors, using the k-means as base algorithm (again, using VLFeat implementation).
3. For each img, compute the correspondent fisher vector using GMM obtained before, one for each dataset image.
4. Given the query q, compute SIFT descriptors and fisher vectors (using the same GMM of before).
5. Compute the Euclidean distance between q's fisher vector and each img fisher vector from the dataset.
6. Return the top k images, according to the distances obtained from 5.

This is the code from point 2 to point 3 and 5, which are the most important ones:

vl_size totalElements = totalKeypoints * dimension;
float *data = new float[totalElements];
size_t counter = 0;
//save into data all descriptors matrices (tested, it works)
for(size_t i=0; i<descriptors.size(); i++){
std::memcpy(data+counter,descriptors[i].data,descriptors[i].total() * sizeof(float));
counter += descriptors[i].total();
}
VlKMeans * kmeans = vl_kmeans_new (VL_TYPE_FLOAT, VlDistanceL2) ;
vl_kmeans_set_algorithm (kmeans, VlKMeansElkan) ;
vl_kmeans_set_initialization(kmeans, VlKMeansPlusPlus);
vl_kmeans_set_min_energy_variation(kmeans,0.0001);
vl_kmeans_set_num_repetitions(kmeans,3);
VlGMM* gmm = vl_gmm_new(VL_TYPE_FLOAT, dimension, k) ; //k=256, dimension= 128
vl_gmm_set_initialization (gmm,VlGMMKMeans);
vl_gmm_set_kmeans_init_object(gmm, kmeans);
vl_gmm_cluster (gmm, data, totalKeypoints);
delete []data;
int encodingSize = 2 * dimension * k;
std::vector<cv::Mat> codes(n); //n is the dataset size
for(size_t i=0;i<descriptors.size();i++){
float *enc = (float*)vl_malloc(sizeof(float) * encodingSize);
vl_fisher_encode(enc, VL_TYPE_FLOAT,
vl_gmm_get_means(gmm), dimension, k,
vl_gmm_get_covariances(gmm),
vl_gmm_get_priors(gmm),
descriptors[i].data,1,
VL_FISHER_FLAG_IMPROVED
);
codes[i] = cv::Mat(1, encodingSize, CV_32FC1, enc);
}
...
//here we compute query's fisher code in the same way
...
for(int i=0;i<n;i++)
distances[i] = norm(queryCode,codes[i],cv::NORM_L2);//from OpenCV


I hope that this can be considered as MCV code. If you want to see some other section, please let me know it.

The performance are simply HORRIBLE: even if we use an image from the dataset itself, the most similar image is the image itself (but even in that case the distance is 0.0201524!), while all the others images are totally uncorellated!

I don't know if this is normal, but it take 263s for gmm clustering and 0.91s for creating 1k fisher vectors.

I'm a bit frustrated, I don't know at all how could I improve this. Fisher vector is already an advanced solution, why does it work SO bad?

Trying also the cosine distances with the following code:

static double cosdistance(const cv::Mat &testFeature, const cv::Mat &trainFeature)
{
double a = trainFeature.dot(testFeature);
double b = trainFeature.dot(trainFeature);
double c = testFeature.dot(testFeature);
return -a / sqrt(b*c);
}
...
for(int i=0;i<n;i++)
distances[i] = cosdistance(queryCode,codes[i]);


Didn't give better results ...

edit retag close merge delete

Sort by » oldest newest most voted

Currently you are encoding only 1 descriptor, change descriptors[i].data,1, to descriptors[i].data,descriptors[i].rows.

more

But this should be ok, since producing the encoding the fisher vectors one by one through the for that you can see. Processing one fisher vector per time should be the same of processing them all together.

( 2016-07-21 21:04:38 -0500 )edit

? No, it's totally different. For each image you want one global descriptor from all local descriptors. Currently you are encoding only one single (the first) descriptor of your local descriptors.

( 2016-07-22 09:20:14 -0500 )edit

Oh I think that I understood what you mean. Supposing we have detected k keypoints, that means we have k SIFT descriptors vectors of 128 dimensions. We want to encode all the k descriptors vectors into one single Fisher Vector, while I'm encoding one vector. Is that right?

( 2016-07-23 00:38:20 -0500 )edit

That seemed the error! Now the performances are much better, but I'll try to compute a precision-recall curve to be sure :) If I find some other problem, I hope that's note a problem if I contact you again :)

( 2016-07-23 01:56:25 -0500 )edit

Sure, no problem and yes to your previous comment, I think you got the idea now. Please mark the answer as accepted if it works.

( 2016-07-23 06:37:01 -0500 )edit

Official site

GitHub

Wiki

Documentation