What options are there for Dimensionality Reduction of HoG Descriptors

answered 2016-08-02 03:44:52 -0600

Guanta
6736 ●6 ●25 ●79

You can use PCA to reduce the dimensionality of your descriptors. Note that it won't reduce the number of your descriptors.

For training k-means: you don't need to feed all your descriptors to k-means for BoW tasks. Typical strategies are the use of around 100k descriptors for k-means randomly taken from 1k representative (i.e. from all classes) images, i.e. you need to take only 100 random samples per selected image.

edit flag offensive delete link

Comments

Hi Guanta,

Thanks for responding. My source images are video frames so wouldn't they have alot of similar descriptors as slight changes occur between frame x and x+ 1 and so if you were to sample the descriptors isn't there a possibility that the 100 samples per image would not be representative of the image detail and so could not be used to discriminate between images of different class ?

mobmsc ( 2016-08-02 06:44:07 -0600 )edit

For the actual BoW-descriptor you need to take all descriptor samples of your image, but for k-means it is enough if you take a representative selection from all classes. E.g. you want to detect if there is a pizza or not in your video stream, then you need training images with pizza and without, from all these you extract random 100k descriptors which you feed to k-means, then you train k-means with e.g. 1000 clusters. In the actual encoding step you need to take all descriptors from your training image such that you get a good BoW descriptor which you then can use to train a classifier.

Guanta ( 2016-08-02 13:33:19 -0600 )edit

Thanks for the small light at the end of the tunnel regarding Kmeans. Any advice for dealing with the descriptors prior to BoW as while using python I'm running into memory issues due to the amount and size

mobmsc ( 2016-08-02 15:19:41 -0600 )edit

At which step do you run into memory problems? For k-means you only need a selection as already pointed out, when you process each frame individually you may get a lot of descriptors, but they are not very high dimensional. In case of SIFT you might get 2000 SIFT descriptors which are all 128D. Having trained a visual vocabulary with 1000 clusters you end up with a 1000 dimensional BoW-descriptor for this image frame. You train your classifier, e.g. linear SVM with all BoW-descriptors of your training set. So, at testing you only need to evaluate your SVM with the BoW-descriptor, again not very RAM-intensive. So, for which step you need much RAM?

Guanta ( 2016-08-03 03:35:13 -0600 )edit

Could I ask what you mean by 128D ? When I run HoG I have a flattened array of (251328,) for each frame/image and at 25 frames per second it soon adds up if the video sequence is long. My understanding (possibly incorrectly) is you use kmeans to define the clusters based on the descriptors then each cluster label becomes a "word" in the corpus/BagOfWords and each image is then annotated with the cluster labels present in that image. You then use Random Forrest or other suitable classifers to train using the annotations Is that correct?

mobmsc ( 2016-08-03 05:56:04 -0600 )edit

Each SIFT descriptor is 128 dimensional. A HOG descriptor is 36 (or 31 depending on the implementation) dimensional descriptor. If you use HOG for fixed-sized images then you actually don't need BoW, since you already have a global descriptor. BoW is meant to generate a global descriptor from many local ones. So, you have to decide now: go the BoW way, then you have to cluster and encode a (N,36) dimensional matrix (where N is the number of HoG descriptors) to get a global descriptor. Or: use the flattened HoG descriptor as your global descriptor which you can use as input for your classifier.

Guanta ( 2016-08-03 07:33:01 -0600 )edit

What options would there be to scale the BoW (could I run in parallel and combine the output? Also wouldn't I need to aggregate the descriptor output of the frames if it was the action/sequence I was trying to train on not specific image elements, taking your pizza example from earlier instead of trying to identify a pizza, the aim would be to identify someone making a pizza instead of cooking a steak

mobmsc ( 2016-08-03 17:03:47 -0600 )edit

I don't know much about action recognition. Two ideas, where I prefer the last one:

BoW of BoW descriptors: if you reduce the the high dimensional BoW descriptors dimensionality with PCA you could again aggregate them via another BoW on top.
Another idea would be to compute a single BoW descriptor of several frames, since you don't need to have all descriptors at once to create the BoW descriptor this is actually no problem. You just need to sum up all BoW-descriptors of a scene. (Take care to revert the normalization step of BOWImgDescriptorExtractor::compute(), i.e. multiply again the img-descriptor (=bow-descriptor) by the number of local descriptors before summing it up and only before classification normalize again w. the total number of local descriptors)

Guanta ( 2016-08-04 06:59:37 -0600 )edit

add a comment

What options are there for Dimensionality Reduction of HoG Descriptors

1 answer

Comments

Links

Question Tools

Stats

Related questions

What options are there for Dimensionality Reduction of HoG Descriptors edit

1 answer

Comments

Links

Question Tools

Stats

Related questions

What options are there for Dimensionality Reduction of HoG Descriptors