Ask Your Question

Bag of (visual) Words - Locate Object in Image

asked 2014-03-12 18:50:12 -0500

B-Man gravatar image

updated 2014-03-12 23:56:53 -0500

Hi everybody,

I was dealing with the Bag of (visual) Words (BOW) approach in OpenCV recently. I found some code in the Internet where you can predict by means of a trained Support Vector Machine if a certain object is present in the image or not.

My problem is now, that this BOW approach indeed to some extent accurately recognizes your object, but it doesn't tell you where the object is located in the image.

My goal is to put a bounding box around the detected object. Is there any way I can use the BOW approach and still locate that object?

Any help, suggestion or idea would be helpful! Thanks!

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted

answered 2014-03-13 04:21:38 -0500

Guanta gravatar image

Yes you can use BoW, just apply the same pipeline for image tiles, i.e. divide your image in (typically overlapping) windows (aka blocks) of varying size and analyzse each block individually. However note, that this is typical slow, thus, for the detection of objects often other techniques come in play:

a) Speed-up the sliding windows approach, e.g. by an efficient subwindow search technique. (The original work was proposed by Lampert et al.: "Efficient subwindow search: a branch and bound framework for object localization.")

b) By a fast reject of non-object windows using a cascade classifier. Since this is already part of OpenCV you should try that first. See how to train it.

edit flag offensive delete link more


Hi Guanta! Thank you very much for your answer! The code I found for testing purposes was from this website: (see "feature extraction.cpp" in rar file).

My problem is now: You said, that I should partition each image in small rectangles/windows and analyse that block, i.e. do a sliding window approach. But I'm confused for what I should search in that block. In the code (see link) the SVM predicts the class of a object by examining the histogram of the visual words of the whole image (here: "bowDescriptor2"). But a histogram contains no location info...

Do you know of any sample code where I can see how analysis is done on these subwindows? That would be great! Thanks!

P.S.: Cascade classifier gives me too many false positives...

B-Man gravatar imageB-Man ( 2014-03-13 23:05:25 -0500 )edit

A histogram doesn't contain the location information but you have that information from the block, since it came from a specific location. You need to run the complete prediction pipeline for each block instead the whole image, i.e. extract the features from the block instead of the image, then make the vector quantization (i.e. building the BoW descriptor) and then predict the class for this BoW descriptor. This you do for each block. Then, you can create a map of the SVM predictions for each object location (i.e. block location). From there one you can proceed further (depending on the results and block size, you could for example make a kinda maximum suppression on the output results).

Guanta gravatar imageGuanta ( 2014-03-14 04:00:32 -0500 )edit

In terms of this really awkward code (what a mess!). You need to slide over the image just before keypoint detection, and bow-extraction, i.e:

blocksize = 100; // change that or vary it with another for loop

for( int y = 0; y < img2.rows-blocksize; y++) {

 for( int x = 0; x &lt; img2.cols-blocksize; x++) {

   detector.detect(img2(cv::Range(y, y+blocksize), cv::Range(x, x+blocksize), keypoint2);

   // same for bowDE.compute() and svm.predict() - hope you got the idea now
Guanta gravatar imageGuanta ( 2014-03-14 04:08:01 -0500 )edit

Hi Guanta! Thanks, I got the idea now! I was thinking if it is possible to speed up things by just examining the detected keypoints in the image, and predict for each keypoint (area around it), to which class it might belong? e.g. (don't know if this is correct):

detector.detect(img2, keypoint2);

// Just examine the area around detected keypoint (e.g. 20x20 area)

for(int i=0; i<keypoint2.size(); i++)


Rect ROI = Rect(keypoint2[i].pt.x-10, keypoint2[i].pt.y-10, 20, 20)

bowDE.compute(img2(ROI), keypoint2, bowDescriptor2)

float response = svm.predict(bowDescriptor2);


Afterwards you would build a bounding box only around the keypoints that have been predicted to belong to the target class...

Anyway, you helped me really out with your support! Thanks again for that! Cheers!

B-Man gravatar imageB-Man ( 2014-03-15 20:52:39 -0500 )edit

Glad I could help! Your idea with taking the ROI around a keypoint is absolutely valid. However, two reasons why your code from above wouldn't work:

  1. The vocabulary was trained with local features not with ROIs, so you would need to change that somehow, the BoW-Extractor needs the same input features.

  2. The BoW-Extractor of OpenCV needs to have keypoints and corresponding descriptors (i.e. your line bowDE.compute(img2(ROI), keypoint2, bowDescriptor2) wouldn't work. You could maybe generate new keypoints around the found ones and encode and classify the corresponding features.

Good luck!

Guanta gravatar imageGuanta ( 2014-03-16 06:04:58 -0500 )edit

Hi Guanta! Thanks for your input!

I'm a little confused by your first point. I thought that for each detected keypoint (which is just one pixel) an area around the keypoint is considered to build the descriptor/feature vector

(e.g. 16x16 for SIFT, see ).

That's why I choose a ROI of 20x20 at first...

Regarding your second point, would this be a valid approach to just detect the keypoints?

detector.detect(img2, keypoint2);

vector<keypoint> keypoint3;

for(int i=0; i<keypoint2.size(); i++)


Rect ROI = Rect(keypoint2[i].pt.x-10, keypoint2[i].pt.y-10, 20, 20)

detector.detect(img2(ROI), keypoint3);

bowDE.compute(img2(ROI), keypoint3, bowDescriptor2)

float response = svm.predict(bowDescriptor2);



B-Man gravatar imageB-Man ( 2014-03-17 18:52:13 -0500 )edit

Keypoints don't just encode a point and a size, but in most cases also rotation and scale information, which makes the descriptor only then invariant against these transformations! Anyway, I wanted to point sth else out: You trained the vocabulary e.g. with SURF features but then you called bowDE with an image ROI not with SURF features, which wouldn't work.

About your last code: it would only work if you'd chose a different detector, like 'Dense' for your second detect. Otherwise the same keypoints as before would be detected and no information gain would be achieved.

Guanta gravatar imageGuanta ( 2014-03-18 04:37:28 -0500 )edit

Ok, thanks again for your input and help!

You clarified lots of things for me :-)


B-Man gravatar imageB-Man ( 2014-03-18 18:06:46 -0500 )edit

Question Tools

1 follower


Asked: 2014-03-12 18:50:12 -0500

Seen: 1,896 times

Last updated: Mar 13 '14