Ask Your Question

Question about Bag of Words, detectors, and such

asked 2014-02-28 21:31:00 -0500

KonradZuse gravatar image

updated 2014-03-01 02:39:21 -0500

berak gravatar image

hello all,

I am basically looking to do a bag of words type setup where I comepare a picture taken, with items in the database.

Basically am example of what I am doing is taking a picture of a bookshelf, and identifying the books on it. So 1 picture could contain 50 "vocabulary" items.

Basically I am curious about which "keypoint detectors" "feature descriptors" and "matchers" I will need.

It seems there are so many choices and I don't know which would be better for what.

I would like to use something other than SURF or SIFT because I hear you need a license and to get that requires a good bit of money.

I have heard good things about FREAK, BRISK, and ORB, but those are only descriptors right? I would still need a keypoint detector and matcher? ( I thought I also heard that some descriptors are also detectors or...?)

I think that one of the more important things would be scale invariance as the picture I have might not be the size of the picture I'm taking within the bookshelf.

I don't think that rotation is that big a deal.

I'm not sure what else I should ask about these but if anyone has any input to help me on my path I would greatly appreciate it...

As for BoW itself I hear you basically have your vocab of keypoints, then you compare them to the keypoints in the image, and then do a histogram compare?

I also believe I heard something about training classifiers? Why exactly would we need to do that? To identify the items within the whole picture? like a bottle compared to a box?

I think that's all, thanks again to anyone who can help,


edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted

answered 2014-03-01 08:14:06 -0500

Guanta gravatar image

Here some answers to the rumors you 'heard':

  • Yes, you need to be careful when using SIFT and SURF commercially.
  • About detectors, descriptors and matchers:

  1. There exist keypoint detectors which detect locations in the image containing high entropy or with a certain criteria (like corners, blobs). Often they also encode scale and rotation information which make then a feature invariant to those transformation (if it can make use of it at all). A list of possible keypoint-detectors can be found at
  2. Then there exist feature (or descriptor) extractors which typically use the keypoint locations to build a descriptor. There exist two categories of features, binary ones (FREAK, BRISK, ORB, BRIEF) and non-binary ones (SIFT, SURF, MSER). Binary features are typically faster to compute and offer a more compact representation. They are often used in robotics for a fast match. In the case of image retrieval the most common choice is SIFT. A list of supported Features can be found here: Typically you can combine any possible keypoint detector and feature extractor of OpenCV (apart from the SIFT detector). However, not each combination makes sense or works as expected. Natural combinations or combinations suggested by the authors of the descriptors are afaik SIFT-SIFT, ORB-ORB, SURF-SURF, BRISK-BRISK, BRISK-FREAK, MSER-MSER, FAST-BRIEF.

  3. For the matchers you have the choice if you want to use the brute-force matcher or the flann based matcher. The latter one is preferable if you have a very large database of features. It creates an index of all features which is then queried. The brute-force matcher in contrast just goes over all features in question and picks the best match. It depends on the dataset and the task you have what you should use (typically BFMatcher is fine). Furthermore you need to take care that if you use a binary feature you should compare them with the Hamming-norm not with the L2 norm.

Good luck with your project!

edit flag offensive delete link more


Thank you for this, this cleared up a lot of uncertainty.

So it seems that most of the feature descriptors have a keypoint detector as well.

I have seen that thread in the past I guess since you are the poster, I can ask a question about what I was confused with in that post, when I mentioned classifier it had to do with #4, but seems the first method is the preferred.

I figured that if you trained certain objects it would also help with detecting them easier(as in my case most objects will be symmetrical shapes that are common). I also figure if I wanted to wrap a border around it to show a detection I would need a them as well.

Now it seems you recommend the GIST-Descriptor? I will look into it, but I'm curious your take on it, so it just gets the ...(more)

KonradZuse gravatar imageKonradZuse ( 2014-03-02 00:12:35 -0500 )edit

I ran out of space, but I also wanted to know about the BoW vocab size. It seemed like 2000 vocabulary items seemed to be where it started to drop in accuracy(from some paper I saw, I will try ot find it) so I'm curious about how bad it's affected as it grows.

If we have a histogram compare we shoudl be able to at least give a few options, and I would assume that even if we have 100,000 items we still would get the "correct" answer, we might just have a bunch of others that also fit within the spectrum, right?

I'm not sure how many vocab items I will have, but eventually it could be 100,000, I might be able to categorize and such, but my total will definitely be a huge number.

KonradZuse gravatar imageKonradZuse ( 2014-03-02 00:36:35 -0500 )edit

BoW is used to form a global image descriptor from local ones, or in other words: local descriptors are encoded to a global one. Since GIST is already a global one you don't need this.

There are different opinions on the vocabulary size, some papers state that with growing size also the accuracy increases, in others it drops at some point.

I don't know what you exactly meant with 'spectrum'. In an image retrieval case, you are comparing the BoW-Descriptor from one image with all others and take the best one. If it's a recognition task you classify the BoW-Descriptor and get the class of the image.

Guanta gravatar imageGuanta ( 2014-03-02 07:13:07 -0500 )edit

The last sentence appears that you haven't understood the concept totally (or I misinterpret it). You don't build a vocabulary for each individual image, instead you need to build one single vocabulary from a set of training images (i.e. clustering let's say 100k local features from several images) which is used afterwards to encode the local descriptors for each individual test image --> frequency histogram of nearest cluster-center = BoW descriptor.

Guanta gravatar imageGuanta ( 2014-03-02 07:17:59 -0500 )edit

Huh, a global descriptor, crazy... So it's a giant descriptor that contains all of the information about it's parts(the individual descriptions of each)? Is this what the "Cluster" function would do?

It seems GIST is what I would be looking at then, you seem to recommend it as well.

Glad to hear that there are varying opinions on size.

I meant spectrum with histograms. I thought that when we get the descriptor we find the images with the closest resemblance, then do a histogram compare and see which ones are the closest resemblance(since there could be more than one that it closely resem

I am going to need to recognize items, but then find out what it is. Like I mentioned a bottle, I could have multiple companies, coke, sprite, fanta, but I need to identify that bottle.

KonradZuse gravatar imageKonradZuse ( 2014-03-02 20:06:27 -0500 )edit

Sorry, the last sentence was something I just made up in case there was an issue with a giant vocab.

I understand that we use multiple images to create this vocabulary. I also want to make sure we can constantly update this vocabulary? I guess it shouldn't matter when we create a new cluster with new features since the old ones will still be there.

I also find it interesting there aren't m any matchers, just a few brute force methods I saw and the FLANN. I'm assuming I will still be using a matcher with the GIST-Descriptor?

Thanks again for the help.

EDIT: It's a bit difficult to find GIST-Desciptors. I found a link here to a C++ implementation

KonradZuse gravatar imageKonradZuse ( 2014-03-02 20:11:04 -0500 )edit

GIST is a global image descriptor, I don't know how high dimensional it is.

No, this is not what the cluster-function would do, it just clusters the training-features and the means are used to encode the test-features.

Updating the vocabulary is typically not that easy! You can't just add new clusters. For k-means the number of clusters have to be given in advance. But you don't need to do this. Only if your training-features are very different to your test-features.

Well, you need some form of matcher to get the nearest cluster centers to encode your feature vectors.

Guanta gravatar imageGuanta ( 2014-03-06 05:10:51 -0500 )edit

@Guanta, I'd search a query image in a database of unique images. For those unique images, I'd construct a BoW vocabulary with N number of clusters where N 1)either would be as large as the final number of images I will have in database, 2) or be change dynamically and correspond to the number of images in my database. Does it make sense to create a codebook from BoW visual words and use that to match query images with only one unique image in database? Also, if you can, please answer my question here:

bad_keypoints gravatar imagebad_keypoints ( 2014-10-17 09:26:27 -0500 )edit

1) this is not neccesary, see my answer to your other question. 2) every change of clusters would correspond in a retraining -> costs time -> typically not needed if training set is already diverse enough.

Yes you can train another codebook to use it for matching, however typically you wouldn't use k-means but just a kd-tree (see e.g. the good flann library). However, if the number of classes grows you'd need to retrain that as well (the same would happen with other real classifiers).

Guanta gravatar imageGuanta ( 2014-10-17 10:41:35 -0500 )edit

Question Tools

1 follower


Asked: 2014-02-28 21:31:00 -0500

Seen: 1,618 times

Last updated: Mar 01 '14