# Can we use Bag of Visual Words to compute similarity between images directly?

I'm implementing a Content Based Image Retrieval application (CBIR).

I've read about the Bag of Features model and it's considered an intermediate-step algorithm in some application. For example, the histograms generated could be used for SVM classification.

Since the produced vectors are affected by the curse of dimensionality, it can become expensive to compute some kind if distance between a query histogram and all the dataset histograms. For this reason, techniques like LSH have been implemented for datasets with hundreds of thousands (or millions) of images.

However, my system is based on (let's say) 50k images, so compute directly the distance should not be so prohibitive.

I have two questions:

1. Is this approach reasonable? Recap of it: classic BoF approach and then compute the distance between each dataset histogram and the query histogram. The smaller one is returned as the most similar image.
2. Which distance should I use? I've heard that chi-squared distance is a popular choice. What about euclidean, cosine similarity etc? Which one is the best in such a context?
edit retag close merge delete

i don't quite understand, what you mean by "directly". does it mean without bow clustering, or using the image pixels instead of derived features like SIFT ?

( 2016-07-20 01:14:47 -0500 )edit

I tried to explain it in the question, I think that I was not clear enough, I'm sorry for that. What I mean is: can we use the histograms generated by BoF to compute the similarity between two images through some distance metric (which one is actually question 2.)? For example: we have 50k images. We apply the BoF algorithm. Now we have 50K histograms vectors in k dimensions (where k is the k-means parameter). My question is: given a query image q, can I compare q's histogram with all the other 50k histograms in order to find the most similar image in the dataset? If yes, which distance metric to use (point 2.)? I'm sorry if I was not clear even this time.

( 2016-07-20 01:54:52 -0500 )edit
1

ah, sorry, i wasn't quite awake yet ;)

so, - more like using a "manual" nearest neighbour search (e.g. compareHist(a,b,HISTCMP_CHISQR); in a loop) instead of applying an SVM or KNearest ?

i don't think, there's a canonical answer to this, you have to try (and no stone must be left unturned..)

( 2016-07-20 02:04:09 -0500 )edit

SVM? I never used it (I'm not really a machine learning guy :) ) but I know that's for classification problem, not CBIRs (I think they're quite different domains). Just as curiosity! Btw, I'll try to implement a linear scan implementation on histograms and come back with results! Most important: what distance should I use to compute the similarity between histograms?

( 2016-07-20 02:11:50 -0500 )edit

ah, right, SVM / KNearest was wrong guess, it's not classification. (but maybe a flann::Index would do)

just try - but i guess, CORREL, INTERSECT and KL_DIV will be pretty bad, and CHISQR and HELLINGER quite ok.

( 2016-07-20 02:19:11 -0500 )edit

Sort by » oldest newest most voted

there's a couple of things you could try:

1 nearest neighbour search with some distance function:

Mat train; // like 50k rows
Mat test; // a single row feature

double minD=DBL_MAX;
int bestId=0;
for (size_t i=0; i<train.size(); i++)
{
double d = compareHist(test, train.row(i), HISTCMP_CHISQR); // also try HELLINGER !
if (d<minD)
{
minD=d;
bestId=i;
}
}


also, cosine distance instead of compareHist:

static double cosdistance(const cv::Mat &testFeature, const cv::Mat &trainFeature)
{
double a = trainFeature.dot(testFeature);
double b = trainFeature.dot(trainFeature);
double c = testFeature.dot(testFeature);
return -a / sqrt(b*c);
}


then, you could try a flann::Index:

    // train it:
Ptr<cv::flann::Index> index =
makePtr<cv::flann::Index>(train, cv::flann::LinearIndexParams(), cvflann::FLANN_DIST_L2);
// predict:
int K=5;
cv::flann::SearchParams params;
cv::Mat dists;
cv::Mat indices;
index->knnSearch(test, indices, dists, K, params);

more

Thanks for your answer. In these slides, at the number 23 it's suggested to do the "normalization with L1/L2 norm". Why that should be helpful? In addition, which should be the size k of each codebook? In the same slide they are suggested 1000-4000 dimensions, but they seems too many to me. What do you think?

( 2016-07-20 09:18:25 -0500 )edit

One last thing: you cited the flann::index for implementing a fast approximate retrieval using the L2 distance, right? But I thought that only the CHI2 or HELLINGER ones were good, not the L2 one! :)

( 2016-07-20 09:20:06 -0500 )edit
1
• flann - oh, that's just some minimal example to get started. there is lsh as well, as other distance types
• normalization - you could apply normalize(mat,mat); to each row in your train & test data as a prep. step. this would make the distribution of your data independant from the magnitude of the histogram.
• k - imho this depends on the length of your features. it's a projection from that spacee to k dimensions. if you make it too small, you loose precision, too big will add noise and take time.
( 2016-07-20 10:11:46 -0500 )edit

@berak would you please give a look to this question please?

( 2016-07-20 11:25:13 -0500 )edit
2

Sidenotes: a) if you want to get better results: use activation features extracted from a CNN. b) There are experiments were just averaging these activation features gives a good global descriptor c) try other encoding methods, e.g. VLAD is very simple and effective (I really should finally do a pull request for OpenCV...)

( 2016-07-20 16:24:53 -0500 )edit

Thanks for your comment! a) By CNN you mean Convolutional Neural Networks, right? As I said, I'm not the machine learning guys, I don't know nothing about neural networks unfortunately. b) interesting, but I'm afraid that these activation features are not easy to obtain (especially for me :D), for now I look for something simpler c) YES! I heard a lot about VLAD and Fisher Vectors, but the only implementation that I know is from VLFeat, which C api is bad documented! Can you please share with me in any way your VLAD implementation for OpenCV?

( 2016-07-20 23:18:55 -0500 )edit

@Guanta Just as info: which distance should I use between VLAD/Fisher vectors? Euclidean?

( 2016-07-21 00:30:33 -0500 )edit

Since VLAD/Fisher vectors are typically already l2 normalized a Cosine distance is sufficient.

( 2016-07-21 08:40:54 -0500 )edit
• Vlfeat is actually quite good and imho not that bad documented, and esp. VLAD is not that difficult to compute and very similar to normal vector quantization, pseudo-code:

vlad = zeros(n_clusters, n_dimension) for i in 1:n_descr: vlad[idx[i]] += descrriptor[i] - centers[idx[i]]

where idx is a vector containing the indices from each descriptor to nearest k-means center. After that you only need to reshape and normalize the vlad descriptor.

( 2016-07-21 08:46:06 -0500 )edit

Thanks for your useful answers. Anyway, according to this answer (and the linked papers), Euclidean distance seems the most popular choice.

( 2016-07-21 11:55:02 -0500 )edit

Official site

GitHub

Wiki

Documentation

## Stats

Asked: 2016-07-20 00:01:44 -0500

Seen: 494 times

Last updated: Jul 20 '16