Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Bag of features with dense SIFT and SVM - understanding and implementation

My aim is to detect some underwater object - badminton racket among others. I have over 160 images of this racket laying underwater. I have created binary masks for this racket object (object I want to detect) and then I calculated based on that racket masks the underwater scenery masks (rocks, leafs, etc..,objects I don't want to detect). Now I want to use BOF with dense sift. What I intend to do:

  1. Create a visual dictionary - compute dense SIFT on the image applying the racket mask and then background mask(on each image I am calculating SIFT two times - for objects I want to detect(racket) and for all other underwater objects
  2. Having dictionary I have to calculate my SVM train data - so once again for every image I calculate SIFT applying my object mask( and label it 1) and applying background mask(label 0) - I am calculating frequency (histogram) of visual words from the dictionary.
  3. Object recognition - that part is tricky for me. My trained svm knows the frequencies of dictionary visual words for the racket(label 1) and the background(label 0). Now I have an image i want to test my SVM on - racket laying underwater among some rocks and other things. When I put that data in my SVM it will detect "both frequencies of visual words" - because on the image is my racket, and there is background as well. It is detecting both things. Now how I can prevent that? My idea is to segment image I want to classify on several (10-50) regions and then on each region calculate dense SIFT and then svm prediction based on dense sift on those regions?

Am I right, or I misunderstood something about this BOF method. If I am wrong, how can I achieve my goal. Once again at my disposal, I have a 160 sets of images(original frame, mask on the racket, mask on the background).

Bag of features with dense SIFT and SVM - understanding and implementation

My aim is to detect some underwater object - badminton racket among others. I have over 160 images of this racket laying underwater. I have created binary masks for this racket object (object I want to detect) and then I calculated based on that racket masks the underwater scenery masks (rocks, leafs, etc..,objects I don't want to detect). Now I want to use BOF with dense sift. What I intend to do:

  1. Create a visual dictionary - compute dense SIFT on the image applying the racket mask and then background mask(on each image I am calculating SIFT two times - for objects I want to detect(racket) and for all other underwater objects
  2. Having dictionary I have to calculate my SVM train data - so once again for every image I calculate SIFT applying my object mask( and label it 1) and applying background mask(label 0) - I am calculating frequency (histogram) of visual words from the dictionary.
  3. Object recognition - that part is tricky for me. My trained svm knows the frequencies of dictionary visual words for the racket(label 1) and the background(label 0). Now I have an image i want to test my SVM on - racket laying underwater among some rocks and other things. When I put that data in my SVM it will detect "both frequencies of visual words" - because on the image is my racket, and there is background as well. It is detecting both things. Now how I can prevent that? My idea is to segment image I want to classify on several (10-50) regions and then on each region calculate dense SIFT and then svm prediction based on dense sift on those regions?

Am I right, or I misunderstood something about this BOF method. If I am wrong, how can I achieve my goal. Once again at my disposal, I have a 160 sets of images(original frame, mask on the racket, mask on the background).background). Below I have an example of my image set:

My racket:image description

Racket mask:image description

Mask for the background: image description

I tried detecting it with SIFT descriptors, but there was much of a noise on the output image:

SIFT

Then I tried to use BOW: I created Dictionary (dense SIFT on whole images, 5px size), the divided input images into 100 regions, and if any of those regions was situated on the mask, i calculated it's dense SIFT against my vocabulary (input data to SVM, label 1) and if any of those regions were situated on the background mask I did the same( calculate dense sift on the region, measure frequency of visual words and label it 0).

When I was testing my BOW , i divided test Image into 100 regions and did the same thing I did while training(dense sift and confront with dictionary). Here is my miserable result:

BOW

You can see the shape of the racket, but as you can see there are many errors.

Any idea how can I improve my algorithms? If I wasn't clear about something let me know and I will try to explain it more properly.