Hello everyone,
I am using OpenCV for a machine learning classification task where I am using random forests at the moment. For analysing the predictive quality I would like to look not only on the classifier's bare class output.
Looking one steep deeper into the decision tree/random forest algorithm, each leaf node votes for the class with the maximum fraction of all train data samples reaching that specific leaf. Consequently, for each leaf l and for each class k there is a score s(l,k) = #samples from class k in leaf l. For my analysis I want to compute this score s(l,k) for all leafs l and all classes k. (That is, because a prediction of a leaf where all classes have more or less the same fraction of samples voting for them is expected to be less reliable than a prediction where all samples in a leaf voting for the same class).
Computing this number s(l,k) is trivial in theory - while growing the tree don't just store the bare class prediction based on the majority voting but instead the number of votes for each class. But unluckily in practice this turns out to be a little harder.
Since I don't want to modify the OpenCV source code I tried to find another solution based on the CvRTrees::get_tree() method: after growing the forest, run with the training set through each tree (starting at the root) and accumulate in every leaf the number of votes for each class. But going this way I need the training set for each tree of the forest. The problem here is now, that this training set does not coincide with the training set of my random forest because the training set for each tree instead is computed by takeing N = |training set| random samples with replacement from the forest's training set.
So finally to go this way I need the individual training sets from each of the forest's trees. Re-computing is not possible (due to randomisation) and so I have to extract them from the growing of the trees. To takle this problem I tried to use the CvRTrees::get_tree() method, returning a CvForestTree which is a subclass of CvDTree, i.e. with CvDTree::get_data() I have access to the tree's specific train data - or at least I expected to have. I did the following:
CvRTrees* randomForest = new CvRTrees();
randomForest->train(...);
for (int t=0; t<100; ++t) //ntrees = 100
{
CvForestTree* tree = randomForest->get_tree(t);
const CvMat* currentTrainData = tree->get_data()->train_data;
printf("%d \n", currentTrainData->rows);
}
At least in theory this should give the train data of each tree, but unluckily each time currentTrainData equals the whole training set of the random forest.
Does anyone know what I am doing wrong or how to get the scores s(l,k) / the real train data of each tree?
I tried to find out as much as possible with the OpenCV reference manual (http://docs.opencv.org/opencv2refman.pdf on page 451-454) and the online documenation (http://physics.nyu.edu/grierlab/manuals/opencv/classCvDTree.html#ab497953ff96cc5f21d198343ce36486d), but unluckily there are no detailed information on these points.
Thanks a lot!