Ask Your Question

Bag of features with dense SIFT and SVM - understanding and implementation

asked 2018-11-28 14:04:29 -0500

karollo gravatar image

updated 2018-11-29 11:29:15 -0500

My aim is to detect some underwater object - badminton racket among others. I have over 160 images of this racket laying underwater. I have created binary masks for this racket object (object I want to detect) and then I calculated based on that racket masks the underwater scenery masks (rocks, leafs, etc..,objects I don't want to detect). Now I want to use BOF with dense sift. What I intend to do:

  1. Create a visual dictionary - compute dense SIFT on the image applying the racket mask and then background mask(on each image I am calculating SIFT two times - for objects I want to detect(racket) and for all other underwater objects
  2. Having dictionary I have to calculate my SVM train data - so once again for every image I calculate SIFT applying my object mask( and label it 1) and applying background mask(label 0) - I am calculating frequency (histogram) of visual words from the dictionary.
  3. Object recognition - that part is tricky for me. My trained svm knows the frequencies of dictionary visual words for the racket(label 1) and the background(label 0). Now I have an image i want to test my SVM on - racket laying underwater among some rocks and other things. When I put that data in my SVM it will detect "both frequencies of visual words" - because on the image is my racket, and there is background as well. It is detecting both things. Now how I can prevent that? My idea is to segment image I want to classify on several (10-50) regions and then on each region calculate dense SIFT and then svm prediction based on dense sift on those regions?

Am I right, or I misunderstood something about this BOF method. If I am wrong, how can I achieve my goal. Once again at my disposal, I have a 160 sets of images(original frame, mask on the racket, mask on the background). Below I have an example of my image set:

My racket:image description

Racket mask:image description

Mask for the background: image description

I tried detecting it with SIFT descriptors, but there was much of a noise on the output image:


Then I tried to use BOW: I created Dictionary (dense SIFT on whole images, 5px size), the divided input images into 100 regions, and if any of those regions was situated on the mask, i calculated it's dense SIFT against my vocabulary (input data to SVM, label 1) and if any of those regions were situated on the background mask I did the same( calculate dense sift on the region, measure frequency of visual words and label it 0).

When I was testing my BOW , i divided test Image into 100 regions and did the same thing I did while training(dense sift and confront with dictionary). Here is my miserable result:


You can see the shape of the racket, but as you can see there are many errors.

Any idea how can I improve my algorithms? If I wasn't clear ... (more)

edit retag flag offensive close merge delete


what is a "racket" ? (you're not playing tennis underwater, or are you ?) can you add an example image ?

berak gravatar imageberak ( 2018-11-28 23:49:56 -0500 )edit

I have updated my question, please have a look :)

karollo gravatar imagekarollo ( 2018-11-29 11:29:38 -0500 )edit

1 answer

Sort by ยป oldest newest most voted

answered 2018-11-30 02:42:25 -0500

berak gravatar image

updated 2018-11-30 02:46:19 -0500

  1. to build a good BOW dictionary, you'd need ~10000 clusters (build from a multiple number of images), and you only have 160 images (that's a laugh with ml). your masks are also somewhat useless here (too artificial). VLAD descriptors would need far less clusters, but none of it seems feasible, given the sparsity of your data.

  2. again, when testing later on real world data, you can't "mask" anything, so the approach with the masks is useless (without previous segmentation)

  3. "My idea is to segment image I want to classify on several (10-50)" -- no, you can't chop it up into 5x5 pixel tiles. rather use a sliding window approach with the minimum size of the expected racket (or even a pyramid scheme, to adapt to size

in the end, -- you tried a lot and even got quite far (cool !!), but i think -- this is the wrong bus.

rather consider retraining a one-shot detector with your images, and annotated bounding boxes. this can even be done with as little data as you have !

edit flag offensive delete link more


Thank you for your reply. 1. I have 160 images of racket. I have also about 400 images of tires, and 300images of some metal cylinders (but now I am training with only one scenery- the racket.) 2. You said about a sliding window - is it not what I am doing right now? In training my SVM I create 96x54 pixel rectangles, perform Dense sift on them( with 5px descriptor size),calculate visual words frequency of my dictionary and then move to the next rectangle. You said that this rectangle size have to be a minimum size of the expected racket - why is that? Can I just make some small rectangles, train my SVM with the "pieces" of the racket and then test it with the same sliding rectangle? 3. You said that approach without segmetation is useless?But I am segmenting with my rectangles,arent I?

karollo gravatar imagekarollo ( 2018-11-30 02:59:08 -0500 )edit

And my mask arent useless I think because when I am sliding the image with my rectangle, I have to know, if I am on a piece of racket or not? Am I totally mistaken?

karollo gravatar imagekarollo ( 2018-11-30 03:01:15 -0500 )edit
  • "In training my SVM I create 96x54 pixel rectangles, perform Dense sift on them( with 5px descriptor size)" -- i might have misread it then, but the last image looked more like an 5x5 grid

i divided test Image into 100 regions -- that's the part i mean. that's a grid, not a sliding window (which would only move a small distance, like 1 or 2 pixels, and have overlap with the previous attempts)

  • how do you plan to use the masks while predicting ?
berak gravatar imageberak ( 2018-11-30 03:13:31 -0500 )edit

I don't. They are for training, to know whether my rectangle is on the racket(label 1) or on the background mask(label 0) or none(then it's not taking into account). Is it really not gonna work this way? ;p My rectangles are overlapping, but only half of their size -> initial rectangle is at 0,0 (96x54) then it's at 48,27. For me it's just as same as your sliding idea, but with smaller images. Is it not true? I mean I can make them overlap more,, is that what you mean?

karollo gravatar imagekarollo ( 2018-11-30 03:31:07 -0500 )edit

Question Tools

1 follower


Asked: 2018-11-28 14:04:29 -0500

Seen: 298 times

Last updated: Nov 30 '18