Random Forest with categorical features. [closed]
Hello,
I am trying to use random forest for a mix data with continuous and categorical data. But I am not able to understand how do I use predict function with on of these samples.
Find the data format below:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
I have 35000 records in the data-set.
Please find the code below:
#include<opencv2/opencv.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/core.hpp>
#include <opencv2/ml.hpp>
#include<iostream>
using namespace std;
using namespace cv;
using namespace cv::ml;
int main()
{
cout << "Loading Data..." << endl;
Ptr<TrainData> raw_data = TrainData::loadFromCSV("real.csv", 0, -1, -1, "ord[0,2,4,10-12]cat[1,3,5-9,13-14]", ',');
Mat data = raw_data->getSamples();
Mat labels = raw_data->getResponses();
auto rtrees = RTrees::create();
rtrees->setMaxDepth(10);
rtrees->setMinSampleCount(2);
rtrees->setUseSurrogates(false);
rtrees->setMaxCategories(2);
rtrees->setCalculateVarImportance(false);
rtrees->setActiveVarCount(0);
rtrees->setTermCriteria({ cv::TermCriteria::MAX_ITER, 100, 0 });
cout << "Training Model..." << endl;
rtrees->train(data, cv::ml::ROW_SAMPLE, labels);
cout << "Saving Model..." << endl;
rtrees->save("rt_classifier.xml");
cout << "Loading Model..." << endl;
auto rtrees2 = cv::ml::RTrees::create();
cv::FileStorage read("rt_classifier.xml", cv::FileStorage::READ);
rtrees2->read(read.root());
//rtrees2->predict();
return 0;
}
Sample to predict:
53, Private, 144361, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 38, United-States
Can I get any help to format the data to feed to the predict().
Thanks in advance.
what's the problem, exactly ?
predict( InputArray samples, OutputArray results=noArray(), int flags=0 ) const = 0;
input sample needs to be floating-point matrix.
But the data which is to be predicted is mix of floating point and string like training data. So I am wondering how do I form the input data for prediction
what is the actual input to
predict()
?you probably need to process it in the same way as the training data.
/** @brief Predicts response(s) for the provided sample(s)
The processing of training data is taken care by TrainData:loadFromCSV(). So I assume that it must be internally generating mapping for categorical values. When new sample arrives for prediction it should be processed using the same mappings generated while processing of training data. So your suggestion that we should process prediction data separately is not clear to me. Please suggest.
use TrainData:loadFromCSV(). , again ?or, put both train & test data into the same csv, read that, and use setTrainTestSplit ?
Ok let me put it this way. I have a training data which has numerical data and categorical data. I use to train my tree with it.
Now on the fly I want to predict label for new data. Challanges which I see in the methods suggested by you are given below: 1. If we go the TrainData:loadFromCSV() way i am afraid it will generate new mappings for prediction data then what we had for training data. 2. Putting test data or predict data is not an option as we do not know the test data at the time of training.
Looking forward to your thoughts on this.
oh, apologies, your objection is all correct, - you get a different mapping with different data, if you allow it to start from scratch.
but that's maybe a flaw in the TrainData class. please note, that it will just assign ascending ids in the order of their appearance
that's kaggle's titanic data ? please have a look at their site again, thereare a lot of interestng receipes to find a better mapping strategy, than used here.