# SVM classification when testing

Hi, I have created an SVM classifier for 4 different classes. When testing it with my training data (same data used to train the svm) I get very good predictions ~98% correct.

However when using data outside of my training set, even though it looks similar, I get very poor predictions.

Any suggestions on why this is happening ? I've seen mention of "normalization" and read around about it but I don't quite get what it's for. I'm using the natural log of Hu Moments as features.

Thanks

edit retag close merge delete

2

testing with the train data is not valid, you need to split some items off for testing later

( 2015-10-29 21:58:51 -0500 )edit
1

Your model is overfitted, which means it adjusts very well to the training data but does not generalize. One solution is to change your regularization parameter (C) to a better value (if small, if generalizes well but you will have bigger error on your training set). Also normalization should be done, so every feature in your vector has approximately the same range ([-1,1] usually) and therefore every feature has the same weight while training (so to say). A lot of info about SVM can be found in the web, so just look around.

( 2015-10-30 04:06:06 -0500 )edit

@berak Indeed, hence why I tested with data outside of the training data also.

@LorenaGdL I have looked at many reports on normalisation but I don't get how to implement it on my feature set of the natural log of Hu Moments....

( 2015-10-30 07:03:39 -0500 )edit

Sort by » oldest newest most voted

To normalize data using the zero mean-std deviation approach, this is what I use in my programs (probably there are better/more optimized ways):

· For training

//Mat train_features has one descriptor vector per row (corresponding to one sample), and as many rows as samples in the training dataset
Mat means, sigmas;  //matrices to save all the means and standard deviations
for (int i = 0; i < train_features.cols; i++){  //take each of the features in vector
Mat mean; Mat sigma;
meanStdDev(train_features.col(i), mean, sigma);  //get mean and std deviation
means.push_back(mean);
sigmas.push_back(sigma);
train_features.col(i) = (train_features.col(i) - mean) / sigma;  //normalization
}
//optional steps to save all the parameters
Mat meansigma;
hconcat(means, sigmas, meansigma);  //both params in same matrix
saveMatToCsv(meansigma, "meansigma.csv");  //custom function to save data to .csv file


· For detection/testing (because you have to apply normalization there too)

    //load previously saved means and sigmas (initialization, needed just once)
Mat meansigma;
string file = "meansigma.csv";
Mat means = meansigma.col(0).clone();
Mat sigmas = meansigma.col(1).clone();

//inside your for loop, for each frame
vector<float> descriptors = computeDescriptors();  //change function appropiately
//normalize descriptors prior to classification
for (int idx = 0; idx < descriptors.size(); idx++){
float mean = means.at<float>(idx);
float sigma = sigmas.at<float>(idx);
descriptors[idx] = (descriptors[idx] - mean) / sigma;  //normalize vector
}


Yes, the testing part might seem inefficient with such loop and not using Mat and overloaded operators. I had my reasons to write it that way when I needed it, and I haven't reviewed it lately... everybody's welcomed to improve it. However, for the purposes of the current question, I think it is clearer this way too.

About the saveMatToCsv() and loadCsv() functions, they're just my own custom functions to write to and read from a .csv file. Check this post for more info about them

UPDATE - complete dummy sample (working without any problems in OpenCV 2.4.12, Win7 x64, VS 2013)

#include <opencv2/core/core.hpp>
#include <opencv2/ml/ml.hpp>
#include <iostream>
#include <fstream>

using namespace cv;
using namespace std;

void saveMatToCsv(Mat &matrix, string filename){
ofstream outputFile(filename);
outputFile << format(matrix, "CSV") << endl;
outputFile.close();
}

void main()
{
//training data and labels ------------------
Mat train_features = (Mat_<float>(10, 4) <<
1500, 25, -9, 6,
1495, 31, -8, 8,
1565, 30, -8, 7,
1536, 28, -10, 8,
1504, 29, -4, 6,
2369, 87, 15, 69,
526, 2, 47, 2,
8965, 45, 25, 14,
4500, 14, 36, 8);

Mat labels = (Mat_<int>(10, 1) << 1, 1, 1, 1, 1, -1, -1, -1, -1, -1);

//normalizing data --------------------------
Mat means, sigmas;  //matrices to save all the means and standard deviations
for (int i = 0; i < train_features.cols; i++){  //take each of the features in vector
Mat mean; Mat sigma;
meanStdDev(train_features.col(i), mean, sigma);  //get mean and std deviation
means.push_back(mean);
sigmas.push_back(sigma);
train_features.col(i) = (train_features.col(i) - mean) / sigma;  //normalization
}
//optional steps to save all the parameters
Mat meansigma;
hconcat(means ...
more

just curious, why the meanstddef per col ?

( 2015-10-30 10:03:37 -0500 )edit

do you mean why per col instead of per row or just why looping through cols?

( 2015-10-30 10:14:08 -0500 )edit

yes, mean per column (iterating just depends on rows or cols) ?

i misunderstood it, sorry. your whole train_feature goes colwise, did not see that.

( 2015-10-30 10:24:58 -0500 )edit

@LorenaGdL thanks for the help. I followed your link but couldn't find the code for loadCSV ? So this normalization should work fine on the natural log of Hu Moments right ? Thanks

( 2015-10-30 10:45:45 -0500 )edit

@berak Yep, the train_features matrix has the usual sample_per_row structure, or another way to see it, same i-th feature along i-th column

( 2015-10-30 10:46:01 -0500 )edit

@RPH: it is the code inside the reading part of @theodore's answer, just not explicitly wrapped as loadCsv function. That normalization should work fine with every kind of data, does not depend on the used descriptors

( 2015-10-30 10:49:13 -0500 )edit

@LorenaGdL Is there any other way to normalize the test data without having to save the means and std dev. using those functions ? I looked at the read function in the link but I don't get how to make it work when passing those variables to it like you have above.

( 2015-10-30 19:37:38 -0500 )edit

@RPH I'm sure there are other ways, just find one suitable for you. I just wanted to point out that you need to normalize features both during training and testing. Still I don't know what's your problem with the provided link, it is pretty obvious (I'm beginning to think you're not doing enough effort, and that pisses me incredibly...)

CvMLData mlData;
const CvMat* tmp = mlData.get_values();
cv::Mat img(tmp, true);
tmp->CvMat::~CvMat();


What you need to use according to my answer:

CvMLData mlData;
const CvMat* tmp = mlData.get_values();
cv::Mat meansigma(tmp, true);
tmp->CvMat::~CvMat();


So... as I said, pretty obvious changes

( 2015-10-31 05:08:13 -0500 )edit

@LorenaGdL Thanks, I'll try it out to see if it improves my testing results...It should work for 4 classes of data right ? I just keep pushing back the values for all classes, I don't need to restart or anything ?

( 2015-10-31 06:39:25 -0500 )edit

I think it should work with 4 classes

( 2015-10-31 07:02:11 -0500 )edit

Official site

GitHub

Wiki

Documentation