Random Forest with categorical features.

asked 2019-05-17 08:52:44 -0500

frogeye gravatar image

updated 2019-06-18 02:58:56 -0500

Hello,

I am trying to use random forest for a mix data with continuous and categorical data. But I am not able to understand how do I use predict function with on of these samples.

Find the data format below:

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K

I have 35000 records in the data-set.

Please find the code below:

#include<opencv2/opencv.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/core.hpp>
#include <opencv2/ml.hpp>
#include<iostream>

using namespace std;
using namespace cv;
using namespace cv::ml;

int main()
{
        cout << "Loading Data..." << endl;
        Ptr<TrainData> raw_data = TrainData::loadFromCSV("real.csv", 0, -1, -1, "ord[0,2,4,10-12]cat[1,3,5-9,13-14]", ',');
        Mat data = raw_data->getSamples();
        Mat labels = raw_data->getResponses();

        auto rtrees = RTrees::create();
        rtrees->setMaxDepth(10);
    rtrees->setMinSampleCount(2);
    rtrees->setUseSurrogates(false);
    rtrees->setMaxCategories(2);
    rtrees->setCalculateVarImportance(false);
    rtrees->setActiveVarCount(0);
    rtrees->setTermCriteria({ cv::TermCriteria::MAX_ITER, 100, 0 });
    cout << "Training Model..." << endl;
    rtrees->train(data, cv::ml::ROW_SAMPLE, labels);
    cout << "Saving Model..." << endl;
    rtrees->save("rt_classifier.xml");

    cout << "Loading Model..." << endl;
    auto rtrees2 = cv::ml::RTrees::create();

    cv::FileStorage read("rt_classifier.xml", cv::FileStorage::READ);
    rtrees2->read(read.root());

    //rtrees2->predict();

return 0;

}

Sample to predict:

53, Private, 144361, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 38, United-States

Can I get any help to format the data to feed to the predict().

Thanks in advance.

edit retag flag offensive close merge delete

Comments

what's the problem, exactly ?

berak gravatar imageberak ( 2019-05-17 08:57:18 -0500 )edit

predict( InputArray samples, OutputArray results=noArray(), int flags=0 ) const = 0;

input sample needs to be floating-point matrix.

But the data which is to be predicted is mix of floating point and string like training data. So I am wondering how do I form the input data for prediction

frogeye gravatar imagefrogeye ( 2019-05-17 09:00:49 -0500 )edit

what is the actual input to predict() ?

you probably need to process it in the same way as the training data.

berak gravatar imageberak ( 2019-05-17 09:04:02 -0500 )edit

/** @brief Predicts response(s) for the provided sample(s)

@Param samples The input samples, floating-point matrix
    @Param results The optional output matrix of results.
    @Param flags The optional flags, model-dependent. See cv::ml::StatModel::Flags.
     */
    CV_WRAP virtual float predict( InputArray samples, OutputArray results=noArray(), int flags=0 ) const = 0;

The processing of training data is taken care by TrainData:loadFromCSV(). So I assume that it must be internally generating mapping for categorical values. When new sample arrives for prediction it should be processed using the same mappings generated while processing of training data. So your suggestion that we should process prediction data separately is not clear to me. Please suggest.

frogeye gravatar imagefrogeye ( 2019-05-17 09:13:04 -0500 )edit
1

use TrainData:loadFromCSV(). , again ?

or, put both train & test data into the same csv, read that, and use setTrainTestSplit ?

berak gravatar imageberak ( 2019-05-17 10:04:04 -0500 )edit

Ok let me put it this way. I have a training data which has numerical data and categorical data. I use to train my tree with it.

Now on the fly I want to predict label for new data. Challanges which I see in the methods suggested by you are given below: 1. If we go the TrainData:loadFromCSV() way i am afraid it will generate new mappings for prediction data then what we had for training data. 2. Putting test data or predict data is not an option as we do not know the test data at the time of training.

Looking forward to your thoughts on this.

frogeye gravatar imagefrogeye ( 2019-06-18 02:50:29 -0500 )edit

oh, apologies, your objection is all correct, - you get a different mapping with different data, if you allow it to start from scratch.

but that's maybe a flaw in the TrainData class. please note, that it will just assign ascending ids in the order of their appearance

that's kaggle's titanic data ? please have a look at their site again, thereare a lot of interestng receipes to find a better mapping strategy, than used here.

berak gravatar imageberak ( 2019-06-18 04:49:34 -0500 )edit