Random Forest with categorical features.

asked 2019-05-17 08:52:44 -0500


I am trying to use random forest for a mix data with continuous and categorical data. But I am not able to understand how do I use predict function with on of these samples.

Find the data format below:

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K

I have 35000 records in the data-set.

Please find the code below:

#include <opencv2/highgui.hpp>
#include <opencv2/core.hpp>
#include <opencv2/ml.hpp>

using namespace std;
using namespace cv;
using namespace cv::ml;

int main()
        cout << "Loading Data..." << endl;
        Ptr<TrainData> raw_data = TrainData::loadFromCSV("C:/mlpack/samples/mlpack/sample-ml-app/sample-ml-app/data/real.csv", 0, -1, -1, "ord[0,2,4,10-12]cat[1,3,5-9,13-14]", ',');
        Mat data = raw_data->getSamples();
        Mat labels = raw_data->getResponses();

        auto rtrees = RTrees::create();
    rtrees->setTermCriteria({ cv::TermCriteria::MAX_ITER, 100, 0 });
    cout << "Training Model..." << endl;
    rtrees->train(data, cv::ml::ROW_SAMPLE, labels);
    cout << "Saving Model..." << endl;

    cout << "Loading Model..." << endl;
    auto rtrees2 = cv::ml::RTrees::create();

    cv::FileStorage read("rt_classifier.xml", cv::FileStorage::READ);


return 0;


Sample to predict:

53, Private, 144361, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 38, United-States

Can I get any help to format the data to feed to the predict().

Thanks in advance.

edit retag flag offensive close merge delete


what's the problem, exactly ?

berak gravatar imageberak ( 2019-05-17 08:57:18 -0500 )edit

predict( InputArray samples, OutputArray results=noArray(), int flags=0 ) const = 0;

input sample needs to be floating-point matrix.

But the data which is to be predicted is mix of floating point and string like training data. So I am wondering how do I form the input data for prediction

frogeye gravatar imagefrogeye ( 2019-05-17 09:00:49 -0500 )edit

what is the actual input to predict() ?

you probably need to process it in the same way as the training data.

berak gravatar imageberak ( 2019-05-17 09:04:02 -0500 )edit

/** @brief Predicts response(s) for the provided sample(s)

@Param samples The input samples, floating-point matrix
    @Param results The optional output matrix of results.
    @Param flags The optional flags, model-dependent. See cv::ml::StatModel::Flags.
    CV_WRAP virtual float predict( InputArray samples, OutputArray results=noArray(), int flags=0 ) const = 0;

The processing of training data is taken care by TrainData:loadFromCSV(). So I assume that it must be internally generating mapping for categorical values. When new sample arrives for prediction it should be processed using the same mappings generated while processing of training data. So your suggestion that we should process prediction data separately is not clear to me. Please suggest.

frogeye gravatar imagefrogeye ( 2019-05-17 09:13:04 -0500 )edit

use TrainData:loadFromCSV(). , again ?

or, put both train & test data into the same csv, read that, and use setTrainTestSplit ?

berak gravatar imageberak ( 2019-05-17 10:04:04 -0500 )edit