Tag Archives: Outside the bag error

Out of bag error in Random Forest

The RandomForestClassifier in Sklearn has one parameter:

oob_score : bool (default=False)
Whether to use out-of-bag samples to estimate the generalization accuracy.

In Chinese, it is called ‘out of pocket error’. This parameter means: use OOB to measure test error.


About oob explanation, there is a more comprehensive explanation on stackoverflow: oob explanation
let me tell you my understanding:

RF needs to sampling from the original feature set and then split to generate a single tree. The training sample of each tree is derived from the original training set Boostraping. Due to the way boostraping is put back in the sample, the training set varies from tree to tree and is only a part of the original training set. For the TTH tree, the data in the original training set that is not in the TTH tree can be tested using the TTH tree. Now n(n is the size of the original data set) trees are generated, and the training sample size of each tree is N-1. For the ith tree, its training set does not include (xi, Yi) this sample. Use all the trees (N-1) that do not contain the (xi, YI) sample, and the result of VOTE is the test result of the final (xi, YI) sample.

This allows you to test while training, and experience shows that:

out-of-bag estimate is as accurate as using a test set of the same size as the training set.

Oob is an unbiased estimate of a test error.
To sum up: suppose Zi=(xi,yi).

The out-of-bag (OOB) error is the average error for each Zi calculated using predictions from the trees that do not contain Zi in their respective bootstrap sample. This allows the RandomForestClassifier to be fit and validated whilst being trained.


reference
OOB explanation on stackoverflow
sklearn OOB explanation on stackoverflow