Ask Your Question
1

SVM classification when testing

asked 2015-10-29 18:45:01 -0600

RPH gravatar image

updated 2015-10-29 19:31:49 -0600

Hi, I have created an SVM classifier for 4 different classes. When testing it with my training data (same data used to train the svm) I get very good predictions ~98% correct.

However when using data outside of my training set, even though it looks similar, I get very poor predictions.

Any suggestions on why this is happening ? I've seen mention of "normalization" and read around about it but I don't quite get what it's for. I'm using the natural log of Hu Moments as features.

Thanks

edit retag flag offensive close merge delete

Comments

2

testing with the train data is not valid, you need to split some items off for testing later

berak gravatar imageberak ( 2015-10-29 21:58:51 -0600 )edit
1

Your model is overfitted, which means it adjusts very well to the training data but does not generalize. One solution is to change your regularization parameter (C) to a better value (if small, if generalizes well but you will have bigger error on your training set). Also normalization should be done, so every feature in your vector has approximately the same range ([-1,1] usually) and therefore every feature has the same weight while training (so to say). A lot of info about SVM can be found in the web, so just look around.

LorenaGdL gravatar imageLorenaGdL ( 2015-10-30 04:06:06 -0600 )edit

@berak Indeed, hence why I tested with data outside of the training data also.

@LorenaGdL I have looked at many reports on normalisation but I don't get how to implement it on my feature set of the natural log of Hu Moments....

RPH gravatar imageRPH ( 2015-10-30 07:03:39 -0600 )edit

1 answer

Sort by » oldest newest most voted
3

answered 2015-10-30 09:10:48 -0600

LorenaGdL gravatar image

updated 2015-11-01 06:08:02 -0600

To normalize data using the zero mean-std deviation approach, this is what I use in my programs (probably there are better/more optimized ways):

· For training

//Mat train_features has one descriptor vector per row (corresponding to one sample), and as many rows as samples in the training dataset
        Mat means, sigmas;  //matrices to save all the means and standard deviations
        for (int i = 0; i < train_features.cols; i++){  //take each of the features in vector
            Mat mean; Mat sigma;
            meanStdDev(train_features.col(i), mean, sigma);  //get mean and std deviation
            means.push_back(mean);
            sigmas.push_back(sigma);
            train_features.col(i) = (train_features.col(i) - mean) / sigma;  //normalization
        }
        //optional steps to save all the parameters
        Mat meansigma;
        hconcat(means, sigmas, meansigma);  //both params in same matrix
        saveMatToCsv(meansigma, "meansigma.csv");  //custom function to save data to .csv file

· For detection/testing (because you have to apply normalization there too)

    //load previously saved means and sigmas (initialization, needed just once)
    Mat meansigma;
    string file = "meansigma.csv";
    loadCsv(file, meansigma);
    Mat means = meansigma.col(0).clone();
    Mat sigmas = meansigma.col(1).clone();

    //inside your for loop, for each frame
    vector<float> descriptors = computeDescriptors();  //change function appropiately
    //normalize descriptors prior to classification
    for (int idx = 0; idx < descriptors.size(); idx++){
        float mean = means.at<float>(idx);
        float sigma = sigmas.at<float>(idx);
        descriptors[idx] = (descriptors[idx] - mean) / sigma;  //normalize vector
    }

Yes, the testing part might seem inefficient with such loop and not using Mat and overloaded operators. I had my reasons to write it that way when I needed it, and I haven't reviewed it lately... everybody's welcomed to improve it. However, for the purposes of the current question, I think it is clearer this way too.

About the saveMatToCsv() and loadCsv() functions, they're just my own custom functions to write to and read from a .csv file. Check this post for more info about them


UPDATE - complete dummy sample (working without any problems in OpenCV 2.4.12, Win7 x64, VS 2013)

#include <opencv2/core/core.hpp>
#include <opencv2/ml/ml.hpp>
#include <iostream>
#include <fstream>

using namespace cv;
using namespace std;

void saveMatToCsv(Mat &matrix, string filename){
    ofstream outputFile(filename);
    outputFile << format(matrix, "CSV") << endl;
    outputFile.close();
}

void main()
{
    //training data and labels ------------------
    Mat train_features = (Mat_<float>(10, 4) <<
                                            1500, 25, -9, 6,
                                            1495, 31, -8, 8,
                                            1565, 30, -8, 7,
                                            1536, 28, -10, 8,
                                            1504, 29, -4, 6,
                                            2369, 87, 15, 69,
                                            526, 2, 47, 2,
                                            8965, 45, 25, 14,
                                            4500, 14, 36, 8);

    Mat labels = (Mat_<int>(10, 1) << 1, 1, 1, 1, 1, -1, -1, -1, -1, -1);

    //normalizing data --------------------------
    Mat means, sigmas;  //matrices to save all the means and standard deviations
    for (int i = 0; i < train_features.cols; i++){  //take each of the features in vector
        Mat mean; Mat sigma;
        meanStdDev(train_features.col(i), mean, sigma);  //get mean and std deviation
        means.push_back(mean);
        sigmas.push_back(sigma);
        train_features.col(i) = (train_features.col(i) - mean) / sigma;  //normalization
    }
    //optional steps to save all the parameters
    Mat meansigma;
    hconcat(means ...
(more)
edit flag offensive delete link more

Comments

just curious, why the meanstddef per col ?

berak gravatar imageberak ( 2015-10-30 10:03:37 -0600 )edit

do you mean why per col instead of per row or just why looping through cols?

LorenaGdL gravatar imageLorenaGdL ( 2015-10-30 10:14:08 -0600 )edit

yes, mean per column (iterating just depends on rows or cols) ?

i misunderstood it, sorry. your whole train_feature goes colwise, did not see that.

berak gravatar imageberak ( 2015-10-30 10:24:58 -0600 )edit

@LorenaGdL thanks for the help. I followed your link but couldn't find the code for loadCSV ? So this normalization should work fine on the natural log of Hu Moments right ? Thanks

RPH gravatar imageRPH ( 2015-10-30 10:45:45 -0600 )edit

@berak Yep, the train_features matrix has the usual sample_per_row structure, or another way to see it, same i-th feature along i-th column

LorenaGdL gravatar imageLorenaGdL ( 2015-10-30 10:46:01 -0600 )edit

@RPH: it is the code inside the reading part of @theodore's answer, just not explicitly wrapped as loadCsv function. That normalization should work fine with every kind of data, does not depend on the used descriptors

LorenaGdL gravatar imageLorenaGdL ( 2015-10-30 10:49:13 -0600 )edit

@LorenaGdL Is there any other way to normalize the test data without having to save the means and std dev. using those functions ? I looked at the read function in the link but I don't get how to make it work when passing those variables to it like you have above.

RPH gravatar imageRPH ( 2015-10-30 19:37:38 -0600 )edit

@RPH I'm sure there are other ways, just find one suitable for you. I just wanted to point out that you need to normalize features both during training and testing. Still I don't know what's your problem with the provided link, it is pretty obvious (I'm beginning to think you're not doing enough effort, and that pisses me incredibly...)

Original code in provided link:

CvMLData mlData;
mlData.read_csv("cameraFrame1.csv");
const CvMat* tmp = mlData.get_values();
cv::Mat img(tmp, true);
tmp->CvMat::~CvMat();

What you need to use according to my answer:

CvMLData mlData;
mlData.read_csv("meansigma.csv");
const CvMat* tmp = mlData.get_values();
cv::Mat meansigma(tmp, true);
tmp->CvMat::~CvMat();

So... as I said, pretty obvious changes

LorenaGdL gravatar imageLorenaGdL ( 2015-10-31 05:08:13 -0600 )edit

@LorenaGdL Thanks, I'll try it out to see if it improves my testing results...It should work for 4 classes of data right ? I just keep pushing back the values for all classes, I don't need to restart or anything ?

RPH gravatar imageRPH ( 2015-10-31 06:39:25 -0600 )edit

I think it should work with 4 classes

LorenaGdL gravatar imageLorenaGdL ( 2015-10-31 07:02:11 -0600 )edit

Question Tools

3 followers

Stats

Asked: 2015-10-29 18:45:01 -0600

Seen: 1,454 times

Last updated: Nov 01 '15