Ask Your Question

What options are there for Dimensionality Reduction of HoG Descriptors

asked 2016-08-01 16:46:26 -0500

mobmsc gravatar image

I have a very large number of HoG descriptors for 960x540 images and I was wondering if there was any recommendations I could take that would let me reduce the dimensionality of the HoG descriptors that are produced to make the dataset more manageable for BoW and KMeans and what the trade off against predictive accuracy could look like?

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted

answered 2016-08-02 03:44:52 -0500

Guanta gravatar image

You can use PCA to reduce the dimensionality of your descriptors. Note that it won't reduce the number of your descriptors.

For training k-means: you don't need to feed all your descriptors to k-means for BoW tasks. Typical strategies are the use of around 100k descriptors for k-means randomly taken from 1k representative (i.e. from all classes) images, i.e. you need to take only 100 random samples per selected image.

edit flag offensive delete link more


Hi Guanta,

Thanks for responding. My source images are video frames so wouldn't they have alot of similar descriptors as slight changes occur between frame x and x+ 1 and so if you were to sample the descriptors isn't there a possibility that the 100 samples per image would not be representative of the image detail and so could not be used to discriminate between images of different class ?

mobmsc gravatar imagemobmsc ( 2016-08-02 06:44:07 -0500 )edit

For the actual BoW-descriptor you need to take all descriptor samples of your image, but for k-means it is enough if you take a representative selection from all classes. E.g. you want to detect if there is a pizza or not in your video stream, then you need training images with pizza and without, from all these you extract random 100k descriptors which you feed to k-means, then you train k-means with e.g. 1000 clusters. In the actual encoding step you need to take all descriptors from your training image such that you get a good BoW descriptor which you then can use to train a classifier.

Guanta gravatar imageGuanta ( 2016-08-02 13:33:19 -0500 )edit

Thanks for the small light at the end of the tunnel regarding Kmeans. Any advice for dealing with the descriptors prior to BoW as while using python I'm running into memory issues due to the amount and size

mobmsc gravatar imagemobmsc ( 2016-08-02 15:19:41 -0500 )edit

At which step do you run into memory problems? For k-means you only need a selection as already pointed out, when you process each frame individually you may get a lot of descriptors, but they are not very high dimensional. In case of SIFT you might get 2000 SIFT descriptors which are all 128D. Having trained a visual vocabulary with 1000 clusters you end up with a 1000 dimensional BoW-descriptor for this image frame. You train your classifier, e.g. linear SVM with all BoW-descriptors of your training set. So, at testing you only need to evaluate your SVM with the BoW-descriptor, again not very RAM-intensive. So, for which step you need much RAM?

Guanta gravatar imageGuanta ( 2016-08-03 03:35:13 -0500 )edit

Could I ask what you mean by 128D ? When I run HoG I have a flattened array of (251328,) for each frame/image and at 25 frames per second it soon adds up if the video sequence is long. My understanding (possibly incorrectly) is you use kmeans to define the clusters based on the descriptors then each cluster label becomes a "word" in the corpus/BagOfWords and each image is then annotated with the cluster labels present in that image. You then use Random Forrest or other suitable classifers to train using the annotations Is that correct?

mobmsc gravatar imagemobmsc ( 2016-08-03 05:56:04 -0500 )edit

Each SIFT descriptor is 128 dimensional. A HOG descriptor is 36 (or 31 depending on the implementation) dimensional descriptor. If you use HOG for fixed-sized images then you actually don't need BoW, since you already have a global descriptor. BoW is meant to generate a global descriptor from many local ones. So, you have to decide now: go the BoW way, then you have to cluster and encode a (N,36) dimensional matrix (where N is the number of HoG descriptors) to get a global descriptor. Or: use the flattened HoG descriptor as your global descriptor which you can use as input for your classifier.

Guanta gravatar imageGuanta ( 2016-08-03 07:33:01 -0500 )edit

What options would there be to scale the BoW (could I run in parallel and combine the output? Also wouldn't I need to aggregate the descriptor output of the frames if it was the action/sequence I was trying to train on not specific image elements, taking your pizza example from earlier instead of trying to identify a pizza, the aim would be to identify someone making a pizza instead of cooking a steak

mobmsc gravatar imagemobmsc ( 2016-08-03 17:03:47 -0500 )edit

I don't know much about action recognition. Two ideas, where I prefer the last one:

  • BoW of BoW descriptors: if you reduce the the high dimensional BoW descriptors dimensionality with PCA you could again aggregate them via another BoW on top.

  • Another idea would be to compute a single BoW descriptor of several frames, since you don't need to have all descriptors at once to create the BoW descriptor this is actually no problem. You just need to sum up all BoW-descriptors of a scene. (Take care to revert the normalization step of BOWImgDescriptorExtractor::compute(), i.e. multiply again the img-descriptor (=bow-descriptor) by the number of local descriptors before summing it up and only before classification normalize again w. the total number of local descriptors)

Guanta gravatar imageGuanta ( 2016-08-04 06:59:37 -0500 )edit

Question Tools

1 follower


Asked: 2016-08-01 16:46:26 -0500

Seen: 1,147 times

Last updated: Aug 02 '16