Recommended values for OpenCV RTrees parameters

mkc — Sun, 28 Jun 2015 01:26:12 -0500

Any idea on the recommended parameters for OpenCV RTrees? I have read the documentation and I'm trying to apply it to MNIST dataset, i.e. 60000 training images, with 10000 testing images. I'm trying to optimize MaxDepth, MinSampleCount, setMaxCategories, and setPriors? e.g. Ptr model = RTrees::create(); /* Depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods. */ model->setMaxDepth(?); // letter_recog.cpp uses 10 /* minimum samples required at a leaf node for it to be split. A reasonable value is a small percentage of the total data e.g. 1%. MNIST 70000 * 0.01 = 700 */ model->setMinSampleCount(700?); letter_recog.cpp uses 10 /* regression_accuracy – Termination criteria for regression trees. If all absolute differences between an estimated value in a node and values of train samples in this node are less than this parameter then the node will not be split. */ model->setRegressionAccuracy(0); // I think this is already correct /* use_surrogates – If true then surrogate splits will be built. These splits allow to work with missing data and compute variable importance correctly.' To compute variable importance correctly, the surrogate splits must be enabled in the training parameters, even if there is no missing data. */ model->setUseSurrogates(true); // I think this is already correct /* Cluster possible values of a categorical variable into K \leq max_categories clusters to find a suboptimal split. If a discrete variable, on which the training procedure tries to make a split, takes more than max_categories values, the precise best subset estimation may take a very long time because the algorithm is exponential. Instead, many decision trees engines (including ML) try to find sub-optimal split in this case by clustering all the samples into max_categories clusters that is some categories are merged together. The clustering is applied only in n>2-class classification problems for categorical variables with N > max_categories possible values. In case of regression and 2-class classification the optimal split can be found efficiently without employing clustering, thus the parameter is not used in these cases. */ model->setMaxCategories(?); letter_recog.cpp uses 15 /* priors – The array of a priori class probabilities, sorted by the class label value. The parameter can be used to tune the decision tree preferences toward a certain class. For example, if you want to detect some rare anomaly occurrence, the training base will likely contain much more normal cases than anomalies, so a very good classification performance will be achieved just by considering every case as normal. To avoid this, the priors can be specified, where the anomaly probability is artificially increased (up to 0.5 or even greater), so the weight of the misclassified anomalies becomes much bigger, and the tree is adjusted properly. You can also think about this parameter as weights of prediction categories which determine relative weights that you give to misclassification. That is, if the weight of the first category is 1 and the weight of the second category is 10, then each mistake in predicting the second category is equivalent to making 10 mistakes in predicting the first category. */ model->setPriors(Mat()); // ? /* If true then variable importance will be calculated and then it can be retrieved by CvRTrees::get_var_importance(). */ model->setCalculateVarImportance(true); // I think this is already correct /* The size of the randomly selected subset of features at each tree node and that are used to find the best split(s). If you set it to 0 then the size will be set to the square root of the total number of features. */ model->setActiveVarCount(0); // I think this is already correct /* CV_TERMCRIT_ITER Terminate learning by the max_num_of_trees_in_the_forest; CV_TERMCRIT_EPS Terminate learning by the forest_accuracy; CV_TERMCRIT_ITER | CV_TERMCRIT_EPS Use both termination criteria. */ model->setTermCriteria(TC(100,0.01f)); // I think this is already correct

OpenCV Q&A Forum - RSS feed

Recommended values for OpenCV RTrees parameters