BOW number of words in vocabulary for unique images database?
I'm using BoW for an image matching application, in which basically a query image will be matched against thousands of images, (could be millions too). I'll not implement the SVM part, because I don't need classfication. I only need to make a codebook for a fast inverted index implementation to search my images by BOW image descriptors.
First question
BoW vocabulary will consist of N visual words (clusters) as specified in clusterCount param in BOWKMeansTrainer instance creation.
For an image matching application, where there will be only thousands of unique images, and will forever be increased (or mass decreased sometimes), what should the clusterCount be? If for 100k images I specify it to be a mere 1000 clusters, I'm sure that I'll get a lot of false matches in the ranks of my result.
Second question: Also, what would you think the size of the vocabulary will be (in bytes), for a 100k images? I figured if one cluster takes about 0.5 kB, then if I create a vocabulary of 1 million images, giving clusterCount = 1 million, then the vocabulary size should not be more than ~490MB, which is reasonable enough to be used in processing query images on a simple machine with 8GB RAM.
How I calculated cluster size: I made a vocabulary of 8 words from 16 images, and then dumped the vocabulary to disk (numpy.save). The size on disk was S. Assuming each cluster was of same size, dividing S by clusterCount always gave me the value of 512 bytes, for different sets of images and clusterCount and different sizes of images. I used simple toy implementation of @berak's, which can be found here
Thus, for 1 million clusters from 1 million images, my calculations above should hold. Actually, for 1 million images, I'm very sure that most of the descriptors from many images would fall together, so the size could be smaller.
the vocabulary size is simply:
nclusters * sizeof(feature).
if you use SURF or SIFT features, both contain 128 floats, so each feature vector is 512 bytes on disk (or in mem). [exactly, what you found]
what i don't understand in your question is:
there's a cbir tag there. this would assume, that you are training for a fixed (?) set of classes (like: car,house,cat,spoon) . is it so ? if so, this would be a far more relevant number than the overall img count.
even if you don't use an svm, you still have to do something with the resulting bow-vectors. what is it ? and how would that be not classification ?
@berak, I don't want classification. I need to find a match of an incoming query image from a database of unique images. Edit: Also, thanks for the size explanation @berak. That was very clear. Edit 2: To answer your second question, I'd try to make an inverted index table from the bow vocabulary.