# BOW number of words in vocabulary for unique images database?

I'm using BoW for an image matching application, in which basically a query image will be matched against thousands of images, (could be millions too). I'll not implement the SVM part, because I don't need classfication. I only need to make a codebook for a fast inverted index implementation to search my images by BOW image descriptors.

First question

BoW vocabulary will consist of N visual words (clusters) as specified in clusterCount param in BOWKMeansTrainer instance creation.

For an image matching application, where there will be only thousands of unique images, and will forever be increased (or mass decreased sometimes), what should the clusterCount be? If for 100k images I specify it to be a mere 1000 clusters, I'm sure that I'll get a lot of false matches in the ranks of my result.

Second question: Also, what would you think the size of the vocabulary will be (in bytes), for a 100k images? I figured if one cluster takes about 0.5 kB, then if I create a vocabulary of 1 million images, giving clusterCount = 1 million, then the vocabulary size should not be more than ~490MB, which is reasonable enough to be used in processing query images on a simple machine with 8GB RAM.

How I calculated cluster size: I made a vocabulary of 8 words from 16 images, and then dumped the vocabulary to disk (numpy.save). The size on disk was S. Assuming each cluster was of same size, dividing S by clusterCount always gave me the value of 512 bytes, for different sets of images and clusterCount and different sizes of images. I used simple toy implementation of @berak's, which can be found here

Thus, for 1 million clusters from 1 million images, my calculations above should hold. Actually, for 1 million images, I'm very sure that most of the descriptors from many images would fall together, so the size could be smaller.

edit retag close merge delete

1

the vocabulary size is simply: nclusters * sizeof(feature).

if you use SURF or SIFT features, both contain 128 floats, so each feature vector is 512 bytes on disk (or in mem). [exactly, what you found]

what i don't understand in your question is:

• there's a cbir tag there. this would assume, that you are training for a fixed (?) set of classes (like: car,house,cat,spoon) . is it so ? if so, this would be a far more relevant number than the overall img count.

• even if you don't use an svm, you still have to do something with the resulting bow-vectors. what is it ? and how would that be not classification ?

( 2014-10-17 10:53:32 -0500 )edit

@berak, I don't want classification. I need to find a match of an incoming query image from a database of unique images. Edit: Also, thanks for the size explanation @berak. That was very clear. Edit 2: To answer your second question, I'd try to make an inverted index table from the bow vocabulary.

( 2014-10-17 23:45:55 -0500 )edit

Sort by » oldest newest most voted

The number of clusters does not need to be related to the number of images you create to train the K-Means algorithm (=BOWKMeansTrainer).

So assuming you have 100 images, each image has maybe 2000 descriptors (can of course be much more), then you randomly choose let's say 500 000 descriptors of all the descriptors and train k-Means w. 10000 clusters. The number of clusters here are not really related to the number of images. It's more important how diverse your dataset and thus your features are. There exist plenty of publications trying to find out if more clusters give always better accuracy (typically this is the case, but often they saturate). However, with more clusters you choose, your BoW-descriptor also grows, which makes a fast search maybe more difficult.

What wonders me: do you maybe misunderstood or mis-used the term BOWKMeansTrainer here? How do you classify the image later? The BOWKMeansTrainer doesn't have anything to do with K-neirest neighbor classification, there you could compute k-means to get the k-neirest neighbors - however, I'd create just a kd-tree or use the great flann-library to find nearest neighbors.

1mio clusters = 1mio x feature dimension = 1mio x 128 (if SIFT) = 128mio bytes. 128mio bytes are about 122Mb.

more

@Guanta, I don't think I misunderstood BOWKMeansTrainer. It clusters train descriptors into words for the vocabulary, given how many clusters you want. Also, repeating again, I do not want to classify images. I want to leverage the ease of use and fast indexing from the inverted index approach in BoW to find out which clusters the query images falls most into, then find those clusters' associated database images, and then find out which image has the highest frequency of occurrence from the lookup. Maybe this would make it more clear to you: http://www.cs.utexas.edu/~grauman/courses/fall2009/slides/lecture16_bow.pdf goto slide 25 in it, labelled inverted index. Edit: Thanks for answering though, @Guanta.

( 2014-10-17 23:56:09 -0500 )edit

Also, maybe I missed something: my main motive is to have BoW inverted index replace my current FLANN based image index. Because the flann matcher I use to match query image with many images currently is not savable or updatable (one cannot simply add another image's descriptors and use it without retraining to use it for matching, and deleting a particular set of descriptors is not possible). And I want to save it to disk, not generate it from saved descriptor data everytime my application starts.

( 2014-10-18 00:06:03 -0500 )edit

Okay, it got clearer now what you like to do. So, the BoW inverted index is typically drawn from the same clusters which you used to create the BoW-descriptors, since typically you use the index file for weighting the BoW descriptors. Of course you could also cluster the local descriptors differently and use thus a different number of clusters, but why? I would go with the simple way.

For updating: I wouldn't retrain the image database and just update the inverted index file, if your training images were diverse enough then you are good. If you would retrain it, you'd also need to recreate the inverted index file which is I guess pretty time consuming.

( 2014-10-18 09:00:03 -0500 )edit

I don't understand this part: Of course you could also cluster the local descriptors differently and use thus a different number of clusters, but why? What local descriptors? The one I take from the image by SURF/ORB ? And differently how? Actually, I just don't understand any of it. Sorry.

And this: I would go with the simple way. Are you referring to the FLANN based image index which I mentioned above or else what?

And then: For updating: I wouldn't retrain the image database and just update the inverted index file You mean I just add the SURF descriptors of a new image added to my server to my BoW trainer object and then somehow without retraining it to cluster the new set of descriptors into the existing clusters, I should update my inverted index? Seems crazy.

( 2014-10-20 01:01:43 -0500 )edit

For the last part: How would I even update the vocabulary to include the new set of descriptors from the newly added image without retraining the BOWKMeansTrainer object to get a new vocab? And then only can I get BOW descriptor of the new image with BOWImgDescriptorExtractor and use the result to update the inverted index. Please correct me if I'm missing something from the flow.

( 2014-10-20 01:11:11 -0500 )edit

"what local descriptors" -> yes SURF/ORB/Whatever you have chosen here, these are local descriptors, the BoW descriptor for an image is called the global descriptor. You need to cluster the local descriptors to create the vocabulary (aka dictionary), this vocabulary is then used to a) create the BoW descriptor for each image (each image, it doesn't need to be in the training set) b) it is typically also used to create the inverted index file to weight then the BoW descriptors. However, if you don't want to use it for weighting then you could in theory also train it differently, also with different numbers of clusters. But as I wrote, I wouldn't do it and just use the same clusters as before, i.e. only one clustering is involved.

( 2014-10-20 03:08:20 -0500 )edit

"you mean I just add the SURF descriptors" -> no retraining (only occasionally if flann decides to do so). You compute your BoW descriptor as before, i.e. create histogram of all SURF descriptors to its nearest clusters. With updating I meant only updating the inverted index file: for each cluster where you raise the bin count you also save the image index (this is how you create the inverted index file), so you just need to update it with the new infos. This is possible, e.g. in flann w. addPoints()-method, however you are right in the sense that the flann index may be needed to be retrained (depends on the parameter you used with the addPoints()-method).

( 2014-10-20 03:15:31 -0500 )edit

For your last comment: I think you still mix some parts up: you don't need to retrain the BoWKMeansTrainer() (=vocabulary), this is needed only once (if your new images are not too much different from the images used when training it)!

( 2014-10-20 03:18:40 -0500 )edit

I think you still mix some parts up: you don't need to retrain the BoWKMeansTrainer() (=vocabulary), this is needed only once (if your new images are not too much different from the images used when training it)!

That's the thing dude! All my images will always be different so much so that atleast I as a person even wouldn't saw that any two of them belong in a same class or can be classified as "similar".

So, for every image added to my server, I will need to: 1) Update the vocab with the descriptors of those new images. 2) Get the nearest clusters for those new images and then add them to inverted index.

Continuing..

( 2014-10-20 07:24:25 -0500 )edit

Now, without 1, 2 is not possible, or, say, how does it make sense to match a query image with a matcher that isn't even trained with the target image's descriptors? (because we didn't add descriptors of the new image to bowtrainer and then call cluster on it)

( 2014-10-20 07:28:25 -0500 )edit

Official site

GitHub

Wiki

Documentation