machine learning split set not working properly?

asked 2014-07-21 04:17:25 -0600

lups789 gravatar image

updated 2014-07-21 06:35:02 -0600

Hi,

I am using C++ and OpenCV 249 in order to code a machine learning application. I have a .csv file with classes and features. I load it, save it to a Mat and then want to split the data into training and test set. However, when I want to access the ids of the two sets and save them to a Mat, I only get values between 0 and 255 (my original set has considerably more entries). The dimensions of the matrices are correct, but the values always stay below 255.

    //   load data from csv
    CvMLData mlData;
    mlData.read_csv(dataDir.c_str());
    Mat mlMat = mlData.get_values();

//   split into training and test set
    float train_sample_portion = 0.7;       // use 70% as training
    bool random_split = false;      // true = random
    CvTrainTestSplit spl(train_sample_portion, random_split);
    mlData.set_train_test_split(&spl);
    Mat trainSetIds = mlData.get_train_sample_idx();            // !!! values from 0 to 255 !!!
    Mat testSetIds = mlData.get_test_sample_idx();          // !!! values from 0 to 255 !!!

(I want to have a random split, but turned to random split off here)

I figured that the problem could have something to do with the type of the matrices, but adding the following did not solve the problem:

    Mat trainSetIds(mlMat.size().width, train_sample_portion * mlMat.size().height, CV_32SC1);
    Mat testSetIds(mlMat.size().width, (1-train_sample_portion) * mlMat.size().height, CV_32SC1);

Actually, when looking at the output get_train_sample_idx() gives, that are also only values between 0 and 255. I hope someone can help here.

Cheers.

edit: to clarify the problem

My .csv has the class in the first column and then multiple columns with feature values ranging from -1 to 1. mlMat is showing correct values. trainSetIds and testSetIds should give me the row index of the split data.

This mlMat

(class  feat1   feat2   ...) [not included]
1        0.3    0.6     ...
0        -0.6   -1      ...
0        -0.1   0.1     ...
1        0.2    0.8     ...

should give trainSetIds filled with (0 ; 1 ; 2) and testSetIds with (3). For this minimal example it works, but if there are more than 255 rows it does not.

edit retag flag offensive close merge delete

Comments

How did you know that your values are only between 0 and 255? Did you used the template accessor to get an integer?

Mathieu Barnachon gravatar imageMathieu Barnachon ( 2014-07-21 04:37:24 -0600 )edit

I used "cout << trainSetIds / testSetIds / mlData.get_...._sample_idx()" to output the values.

lups789 gravatar imagelups789 ( 2014-07-21 05:12:28 -0600 )edit

And mlMat has values gretter than 255?

Mathieu Barnachon gravatar imageMathieu Barnachon ( 2014-07-21 06:09:10 -0600 )edit

It does not, but this is because there are no values greater than 255 in the csv. mlMat is showing all values correctly. I edited the original post for clarification.

lups789 gravatar imagelups789 ( 2014-07-21 06:36:11 -0600 )edit