I'm using BoW for an image matching application, in which basically a query image will be matched against thousands of images, (could be millions too). I'll not implement the SVM part, because I don't need classfication. I only need to make a codebook for a fast inverted index implementation to search my images by BOW image descriptors.
First question
BoW vocabulary will consist of N visual words (clusters) as specified in clusterCount param in BOWKMeansTrainer instance creation.
For an image matching application, where there will be only thousands of unique images, and will forever be increased (or mass decreased sometimes), what should the clusterCount be? If for 100k images I specify it to be a mere 1000 clusters, I'm sure that I'll get a lot of false matches in the ranks of my result.
Second question: Also, what would you think the size of the vocabulary will be (in bytes), for a 100k images? I figured if one cluster takes about 0.5 kB, then if I create a vocabulary of 1 million images, giving clusterCount = 1 million, then the vocabulary size should not be more than ~490MB, which is reasonable enough to be used in processing query images on a simple machine with 8GB RAM.
How I calculated cluster size: I made a vocabulary of 8 words from 16 images, and then dumped the vocabulary to disk (numpy.save). The size on disk was S. Assuming each cluster was of same size, dividing S by clusterCount always gave me the value of 512 bytes. I used simple toy implementation of @berak's, which can be found here:
Thus, for 1 million clusters from 1 million images, my calculations above should hold. Actually, for 1 million images, I'm very sure that most of the descriptors from many images would fall together, so the size could be smaller.