Question on the Random Forest Classifier implementation

asked 2018-01-26 08:58:54 -0600

pefue gravatar image

updated 2018-01-26 09:04:23 -0600

Hi,

I am currently comparing OpenCV's RandomTrees Classifier with an own implementation of a random forest classifier. However, the classification rates vary strongly between the two implementations (more than expected, even when considering the random number generation).

Is there someone who is familiar with the implementation details of OpenCV's RandomTrees and who can verify if my understanding of OpenCV's implementation?

This is how I understand the training algorithm:

First, let's assume we have a set of 10000 labeled training samples for a binary classification problem. Furthermore, the number of trees is limited to 10 (CV_TERMCRIT_ITER=10), the tree depth to 7 (maxDepth=7) and the number of active samples at each split is set to sqrt(N), with N being the number of samples that are considered at a particular split (nactiveVars=0).

This means, that for each tree, a training set of N=10000 samples is randomly drawn (with replacement). Next, out of the N=10000 samples, sqrt(N)=100 samples are used to find a single split (basic threshold) that minimizes the weighted Gini-impurity. This weighted Gini-impurity is calculated on the Gini-impurities of all samples that go either right or left (weighted by their share of the total number of features). Next, two new nodes (left&right) are created and appended to the root and training continues recursively for the two nodes until some stop criteria is met.

Let's say the algorithm continues with the left branch. At this node, the number of features N equals 6000 (arbitrarily chosen in this example, corresponds to 60% of the tree's training data). Again, only a subset of all samples is used to find the best split, or more specifically sqrt(6000) = 77.

... the algorithm continues recursively to create new split nodes until:

  • too few samples end up being in a new node (parameter: minSampleCount)
  • the maximum tree depth has been reached
  • all features belong to only a single class

Questions:

One question that I am particularly interested in is, whether it is correct that only sqrt(N) of all N samples that are available at a particular node, are used for training.

Furthermore, I an wondering how the optimum split value is obtained. After reading through the source I got the impression that it is basically a "brute-force"-like search which tries basically all possible values in the range of the current subset of samples (Sort all responses ascending, use the average of the i-th and (i+1)-th value as split candidate). Is this correct?

Thanks for your help, Peter

edit retag flag offensive close merge delete