Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

How to set the sample size for random forest

When training a random forest, I can choose the number of features that are selected per node. But how can I reduce the number of samples used to train a single tree in the forest? It seems I can only use the same size as the input data set?

The problem is that the forests I train are too large to be practical. I can choose to reduce the number of trees, but I think it would be better to have a lot of smaller trees (~100) instead of less large trees (~30). I think using more trees is the power of random forests. I can also reduce the number of samples by throwing away part of the input dataset, but I'd rather do this per tree (by reducing the bootstrap sample size) instead of reducing the dataset for training all trees. In that way I can still use all input data and have more different trees.

How to set the sample size for random forest

When training a random forest, I can choose the number of features that are selected per node. But how can I reduce the number of samples used to train a single tree in the forest? It seems I can only use the same size as the input data set?

The problem is that the forests I train are too large to be practical. I can choose to reduce the number of trees, but I think it would be better to have a lot of smaller trees (~100) instead of less large trees (~30). I think using more trees is the power of random forests. I can also reduce the number of samples by throwing away part of the input dataset, but I'd rather do this per tree (by reducing the bootstrap sample size) instead of reducing the dataset for training all trees. In that way I can still use all input data and have more different trees.

Background information copied from https://docs.opencv.org/2.4/modules/ml/doc/random_trees.html?highlight=rtrees: (I want to change N to fraction*N)

"All the trees are trained with the same parameters but on different training sets. These sets are generated from the original training set using the bootstrap procedure: for each training set, you randomly select the same number of vectors as in the original set ( =N ). The vectors are chosen with replacement. That is, some vectors will occur more than once and some will be absent."

How to set the sample size for random forest

When training a random forest, I can choose the number of features that are selected per node. But how can I reduce the number of samples used to train a single tree in the forest? It seems I can only use the same size as the input data set?

The problem is that the forests I train are too large to be practical. I can choose to reduce the number of trees, but I think it would be better to have a lot of smaller trees (~100) instead of less large trees (~30). I think using more trees is the power of random forests. I can also reduce the number of samples by throwing away part of the input dataset, but I'd rather do this per tree (by reducing the bootstrap sample size) instead of reducing the dataset for training all trees. In that way I can still use all input data and have more different trees.

Background information copied from https://docs.opencv.org/2.4/modules/ml/doc/random_trees.html?highlight=rtrees: (I want to change N to fraction*N)

"All the trees are trained with the same parameters but on different training sets. These sets are generated from the original training set using the bootstrap procedure: for each training set, you randomly select the same number of vectors as in the original set ( =N ). The vectors are chosen with replacement. That is, some vectors will occur more than once and some will be absent."

A solution might be to write my own wrapper that samples the data in the way I described and trains a random forest of 1 tree for 100 times. However, the selection of the feature subset for each node involves randomness, which turns out to be initialized with a constant seed which is the same for each tree. This is not good for my final random forest, which should contain also randomness in choosing the features. See also the described bug https://github.com/opencv/opencv/issues/4839 I'm very disappointed this is not yet solved....