Recommended values for OpenCV RTrees parameters

asked 2015-06-28 01:26:12 -0500

Any idea on the recommended parameters for OpenCV RTrees? I have read the documentation and I'm trying to apply it to MNIST dataset, i.e. 60000 training images, with 10000 testing images. I'm trying to optimize MaxDepth, MinSampleCount, setMaxCategories, and setPriors? e.g.

   Ptr<RTrees> model = RTrees::create();

    /* Depth of the tree. 
    A low value will likely underfit and conversely 
    a high value will likely overfit. 
    The optimal value can be obtained using cross validation 
    or other suitable methods.
    model->setMaxDepth(?); // letter_recog.cpp uses 10

    /* minimum samples required at a leaf node for it to be split. 
    A reasonable value is a small percentage of the total data e.g. 1%.
    MNIST 70000 * 0.01 = 700 
    model->setMinSampleCount(700?); letter_recog.cpp uses 10

    /* regression_accuracy – Termination criteria for regression trees. 
    If all absolute differences between an estimated value in a node and 
    values of train samples in this node are less than this parameter 
    then the node will not be split. */
    model->setRegressionAccuracy(0); // I think this is already correct

     use_surrogates – If true then surrogate splits will be built. 
     These splits allow to work with missing data and compute variable importance correctly.'
     To compute variable importance correctly, the surrogate splits must be enabled in 
     the training parameters, even if there is no missing data.
    model->setUseSurrogates(true);  // I think this is already correct

     Cluster possible values of a categorical variable into K \leq max_categories clusters 
     to find a suboptimal split. If a discrete variable, on which the training procedure 
     tries to make a split, takes more than max_categories values, the precise best subset
     estimation may take a very long time because the algorithm is exponential. 
     Instead, many decision trees engines (including ML) try to find sub-optimal split 
     in this case by clustering all the samples into max_categories clusters that is 
     some categories are merged together. The clustering is applied only in n>2-class
     classification problems for categorical variables with N > max_categories possible values.
     In case of regression and 2-class classification the optimal split can be found
     efficiently without employing clustering, thus the parameter is not used in these cases.
    model->setMaxCategories(?); letter_recog.cpp uses 15

    priors – The array of a priori class probabilities, sorted by the class label value. 
    The parameter can be used to tune the decision tree preferences toward a certain class. 
    For example, if you want to detect some rare anomaly occurrence, the training base will
    likely contain much more normal cases than anomalies, so a very good classification
    performance will be achieved just by considering every case as normal. 

   To avoid this, the priors can be specified, where the anomaly probability is 
   artificially increased (up to 0.5 or even greater), so the weight of the misclassified
   anomalies becomes much bigger, and the tree is adjusted properly. You can also think about
   this parameter as weights of prediction categories which determine relative weights that 
   you give to misclassification. That is, if the weight of the first category is 1 and 
   the weight of the second ...
edit retag flag offensive close merge delete


I run it recently:

model->setUseSurrogates(true);  // This should be set to false; It gave me an error that using surrogates is not yet implemented.

And it follows that the ff. must be set to false as well:

  model->setCalculateVarImportance(true); // Not yet implemented
mkc gravatar imagemkc ( 2015-07-04 06:01:47 -0500 )edit

model->setCalculateVarImportance(true); Mat var_importance = model->getVarImportance(); //each element of var_importance always have the same value

BY gravatar imageBY ( 2015-10-20 03:08:04 -0500 )edit