overfitting when training SVM for gender classfication
Hi,
I'm using block-based uniform LBP as feature and training an SVM for gender clarification on face images.
My 1st trained SVM model is computed from 1200 male face images and 500 female face images. (My CvSVMParams setting is exactly the same as the OpenCV SVM example in http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html The result is not good and the hit rate is only about 6x%.
Then I try to improve the hit rate by adding more face images for training. I use this 1st trained SVM model to predict more face images and use the wrong classified ones as my additional training face images. So my 2nd trained SVM model is computed from 1200+200 male face images and 500+100 female face images. I'm expecting the second SVM model works better than the first one. However, the second one is always overfitted...
I'm wondering is there any other way to improve the hit rate and why my approach gets more inaccurate classified result. Hope anyone could kindly provide me some hints. Thanks.
You need to increase the dimensionality of your input data or try non linear kernels!
AFAIK, when there are more images for training, the accuracy performance should be better. To improve the recognition rate, you can try other advanced feature extraction methods, for example Local Phase Quantization (LPQ), or combine several methods together. To pinpoint why you get lower classification rate with more training images, you should provide more information: your code, your parameters' settings and your experimental images.
@tuannhtn that is not true, using a larger dataset can lead to a larger variance in object appearance. Such an appearance might not be that easy separable like the smaller dataset. Take for example the splitting of dark and light colors. Adding just black and white samples into a binary class will be fairly easy and high accuracy could be met with a small dataset. But if you want to seperate between lightgray and darkgray under different illumination conditions, then way more data is needed to even achieve the same accuracy.
Yes, @StevenPuttemans, increasing training set's size does not always lead to improvement in accuracy: there is a threshold of this number so that even when you add more images to the training set, the accuracy does not improve. But it is usually do, before you reach the threshold.
I still do not agree, you are talking about oversampling and over fitting over a certain threshold. It all depends on the distribution of your data inside the feature space. I can easily add 1000 samples to a very simple weak classifier without increasing it's accuracy or generalisation power.
Thanks to both of you for the discussion. Today I try several things but none of trial has better result, including: 1). use RBF kernel instead of linear one, 2). change block-based LBP(8x8 blocks for 32x32-pixel image) to image-based LBP (simply compute LBP for each pixel and their histogram in the whole image), 3). try normalize data to [-1, 1] (I'm not sure about the necessity of this step but LibSVM requires this step and many papers mentions the importance of normalization).
From your discussion above, I may do another trial to train my SVM with some simple face images (frontal face). I'm currently using LFW face database which may be too difficult.