Revision history [back]

Recommended values for OpenCV RTrees parameters

Any idea on the recommended parameters for OpenCV RTrees? I have read the documentation and I'm trying to apply it to MNIST dataset, i.e. 60000 training images, with 10000 testing images. I'm trying to optimize MaxDepth, MinSampleCount, setMaxCategories, and setPriors? e.g.
   Ptr<RTrees> model = RTrees::create();

    /* Depth of the tree. 
    A low value will likely underfit and conversely 
    a high value will likely overfit. 
    The optimal value can be obtained using cross validation 
    or other suitable methods.
    */
    model->setMaxDepth(?); // letter_recog.cpp uses 10


    /* minimum samples required at a leaf node for it to be split. 
    A reasonable value is a small percentage of the total data e.g. 1%.
    MNIST 70000 * 0.01 = 700 
    */
    model->setMinSampleCount(700?); letter_recog.cpp uses 10



    /* regression_accuracy – Termination criteria for regression trees. 
    If all absolute differences between an estimated value in a node and 
    values of train samples in this node are less than this parameter 
    then the node will not be split. */
    model->setRegressionAccuracy(0); // I think this is already correct


    /* 
     use_surrogates – If true then surrogate splits will be built. 
     These splits allow to work with missing data and compute variable importance correctly.'
     To compute variable importance correctly, the surrogate splits must be enabled in 
     the training parameters, even if there is no missing data.
    */
    model->setUseSurrogates(true);  // I think this is already correct



    /* 
     Cluster possible values of a categorical variable into K \leq max_categories clusters 
     to find a suboptimal split. If a discrete variable, on which the training procedure 
     tries to make a split, takes more than max_categories values, the precise best subset
     estimation may take a very long time because the algorithm is exponential. 
     Instead, many decision trees engines (including ML) try to find sub-optimal split 
     in this case by clustering all the samples into max_categories clusters that is 
     some categories are merged together. The clustering is applied only in n>2-class
     classification problems for categorical variables with N > max_categories possible values.
     In case of regression and 2-class classification the optimal split can be found
     efficiently without employing clustering, thus the parameter is not used in these cases.
    */
    model->setMaxCategories(?); letter_recog.cpp uses 15



    /* 
    priors – The array of a priori class probabilities, sorted by the class label value. 
    The parameter can be used to tune the decision tree preferences toward a certain class. 
    For example, if you want to detect some rare anomaly occurrence, the training base will
    likely contain much more normal cases than anomalies, so a very good classification
    performance will be achieved just by considering every case as normal. 

   To avoid this, the priors can be specified, where the anomaly probability is 
   artificially increased (up to 0.5 or even greater), so the weight of the misclassified
   anomalies becomes much bigger, and the tree is adjusted properly. You can also think about
   this parameter as weights of prediction categories which determine relative weights that 
   you give to misclassification. That is, if the weight of the first category is 1 and 
   the weight of the second category is 10, then each mistake in predicting the 
   second category is equivalent to making 10 mistakes in predicting the first category.
    */
    model->setPriors(Mat()); // ?

    /* If true then variable importance will be calculated and 
     then it can be retrieved by CvRTrees::get_var_importance(). 
    */
    model->setCalculateVarImportance(true); // I think this is already correct

    /*
     The size of the randomly selected subset of features at each tree node and 
     that are used to find the best split(s). If you set it to 0 then the size 
     will be set to the square root of the total number of features.
    */
    model->setActiveVarCount(0); // I think this is already correct



    /*
    CV_TERMCRIT_ITER Terminate learning by the max_num_of_trees_in_the_forest;
    CV_TERMCRIT_EPS Terminate learning by the forest_accuracy;
    CV_TERMCRIT_ITER | CV_TERMCRIT_EPS Use both termination criteria.
    */
    model->setTermCriteria(TC(100,0.01f)); // I think this is already correct