How to set the sample size for random forest
When training a random forest, I can choose the number of features that are selected per node. But how can I reduce the number of samples used to train a single tree in the forest? It seems I can only use the same size as the input data set?
The problem is that the forests I train are too large to be practical. I can choose to reduce the number of trees, but I think it would be better to have a lot of smaller trees (~100) instead of less large trees (~30). I think using more trees is the power of random forests. I can also reduce the number of samples by throwing away part of the input dataset, but I'd rather do this per tree (by reducing the bootstrap sample size) instead of reducing the dataset for training all trees. In that way I can still use all input data and have more different trees.
Background information copied from https://docs.opencv.org/2.4/modules/m...: (I want to change N to fraction*N)
"All the trees are trained with the same parameters but on different training sets. These sets are generated from the original training set using the bootstrap procedure: for each training set, you randomly select the same number of vectors as in the original set ( =N ). The vectors are chosen with replacement. That is, some vectors will occur more than once and some will be absent."
A solution might be to write my own wrapper that samples the data in the way I described and trains a random forest of 1 tree for 100 times. However, the selection of the feature subset for each node involves randomness, which turns out to be initialized with a constant seed which is the same for each tree. This is not good for my final random forest, which should contain also randomness in choosing the features. See also the described bug https://github.com/opencv/opencv/issu... I'm very disappointed this is not yet solved....
are you looking for this ? or this ?
No, the first link you mention describes the number of samples that is used for determining whether a node is the end of a tree or there should be a new deeper layer beneath the node with new nodes. It does not describe a way to reduce the input data, it is just a threshold to determine when to stop training. (the number of samples that ends in a node, not the number of samples used to train the tree)
The second link you mention is what I already described as changing the number of features that are randomly selected per node. Not the number of bootstrap samples used to train the tree. A feature(variable) is something different than sample, one sample has multiple features/variables.
thanks for the clarification !