Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

opencv might not be the best ml toolkit for this purpose, but it is entirely possible to work with categorical values like in your example, main problem, like with all ml, is beating the data into submission ;)

but first, a mandatory read

you can use ml::TrainData::loadFromCSV(), for your train.csv it would look like:

Ptr<ml::TrainData> data = ml::TrainData::loadFromCSV("train.csv", // our file
                                                     1,           // yes, it has a header line
                                                     -1,          // there are labels
                                                     -1,          // and it's the last column
                                                     "",          // we *only* have categorical
                                                                  // values (not a mix)
                                                     '\t'         // delimiter
                                                     );
cout << data->getTrainSamples() << endl;
cout << data->getTrainResponses().t() << endl;

[1, 2, 3, 4;
 6, 7, 3, 4;
 1, 9, 10, 4;
 11, 2, 3, 4;
 6, 7, 3, 4;
 11, 2, 10, 4]

[5, 8, 8, 8, 8, 8]

as you can see, the categorical names are just switched to a resp. list index.

while decision trees can properly handle this (they check for node equality), if you want to use other ml algos, like knn, ann or svm, you'd need to switch to "one-hot" encoding instead.

take another look at the sample code here -- aaaaand good luck ;)

opencv might not be the best ml toolkit for this purpose, but it is entirely possible to work with categorical values like in your example, main problem, like with all ml, is beating the data into submission ;)

but first, a mandatory read

you can use ml::TrainData::loadFromCSV(), for your train.csv it would look like:

Ptr<ml::TrainData> data = ml::TrainData::loadFromCSV("train.csv", // our file
                                                     1,           // yes, it has a header line
                                                     -1,          // there are labels
                                                     -1,          // and it's the last column
                                                     "",          // we *only* have categorical
                                                                  // values (not a mix)
                                                     '\t'         // delimiter
                                                     );
cout << data->getTrainSamples() << endl;
cout << data->getTrainResponses().t() << endl;

[1, 2, 3, 4;
 6, 7, 3, 4;
 1, 9, 10, 4;
 11, 2, 3, 4;
 6, 7, 3, 4;
 11, 2, 10, 4]

[5, 8, 8, 8, 8, 8]

as you can see, the categorical names are just switched to a resp. list index.

while decision trees (and their derivatives) can properly handle this (they check for node equality), if you want to use other ml algos, like knn, ann or svm, svm (which use some concept of "distance"), you'd need to switch to "one-hot" encoding instead.

take another look at the sample code here -- aaaaand good luck ;)

opencv might not be the best ml toolkit for this purpose, but it is entirely possible to work with categorical values like in your example, main problem, like with all ml, is beating the data into submission ;)

but first, a mandatory read

you can use ml::TrainData::loadFromCSV(), for your train.csv it would look like:

Ptr<ml::TrainData> data = ml::TrainData::loadFromCSV("train.csv", // our file
                                                     1,           // yes, it has a header line
                                                     -1,          // there are labels
                                                     -1,          // and it's the last column
                                                     "",          // we *only* have categorical
                                                                  // values (not a mix)
                                                     '\t'         // delimiter
                                                     );
cout << data->getTrainSamples() << endl;
cout << data->getTrainResponses().t() << endl;

[1, 2, 3, 4;
 6, 7, 3, 4;
 1, 9, 10, 4;
 11, 2, 3, 4;
 6, 7, 3, 4;
 11, 2, 10, 4]

[5, 8, 8, 8, 8, 8]

as you can see, the categorical names are just switched to a resp. list index.

while decision trees (and their derivatives) can properly handle this (they check for node equality), if you want to use other ml algos, like knn, ann or svm (which use some concept of "distance"), you'd need to switch to "one-hot" encoding instead.

i'd propose, you put both train and test data into the same csv file, and use some split between them.

take another look at the sample code here -- aaaaand good luck ;)