Random forest algorithm learning

Random forest algorithm learning
When I was doing Kaggle recently, I found that the random forest algorithm had a very good effect on classification problems. In most cases, the effect was far better than that of SVM, log regression, KNN and other algorithms. So I want to think about how this algorithm works.
To learn random forest, we first briefly introduce the integrated learning method and decision tree algorithm. The following is only a brief introduction of these two methods (see Chapter 5 and Chapter 8 of statistical learning Methods for specific learning recommendations).


Bagging and Boosting concepts and differences
This part is mainly to study the: http://www.cnblogs.com/liuwu265/p/4690486.html
Random forest belongs to Bagging algorithm in Ensemble Learning. In ensemble learning, the algorithms are mainly divided into Bagging algorithm and Boosting algorithm. Let’s first look at the characteristics and differences between the two approaches.
Bagging (Bagging)
The algorithm process of Bagging is as follows:

    randomly draw n training samples from the original sample set using the Bootstraping method, conduct a total of k rounds of drawing, and obtain k training sets. (K training sets are independent of each other, and elements can be repeated.) For k training sets, we train k models (these models can be determined according to specific problems, such as decision tree, KNN, etc.) for classification problems: classification results are generated by voting; For the regression problem, the mean value of the predicted results of k models is taken as the final prediction result. (all models are equally important)

Boosting, Boosting
Boosting algorithm process is as follows:

    establishes weight wi for each sample in the training set, indicating the attention paid to each sample. When the probability of a sample being misclassified is high, it is necessary to increase the weight of the sample. Each iteration is a weak classifier during the iteration process. We need some kind of strategy to combine them as the final model. (For example, AdaBoost gives each weak classifier a weight and combines them linearly as the final classifier. The weaker classifier with smaller error has larger weight)

Bagging, Boosting the main difference

    sample selection: Bagging adopts Bootstrap randomly put back sampling; But Boosting the training set of each round is unchanged, changing only the weight of each sample. Sample weight: Bagging uses uniform sampling with equal weight for each sample. Boosting adjust the sample weight according to the error rate, the greater the error rate, the greater the sample weight. Prediction function: Bagging all prediction functions have equal weight; Boosting the prediction function with lower error has greater weight. Parallel computing: Bagging each prediction function can be generated in parallel; Boosting each prediction function must be generated iteratively in sequence.

The following is the new algorithm obtained by combining the decision tree with these algorithm frameworks:
1) Bagging + decision tree = random forest
2) AdaBoost + decision tree = lifting tree
3) Gradient Boosting + decision tree = GBDT


The decision tree
Common decision tree algorithms include ID3, C4.5 and CART. The model building ideas of the three algorithms are very similar, but different indexes are adopted. The process of building the decision tree model is roughly as follows:
ID3, Generation of C4.5 decision tree
Input: training set D, feature set A, threshold EPS output: decision tree T
If

    D in all samples belong to the same kind of Ck, it is single node tree T, the class Ck as A symbol of the class of the nodes, T if returns A null set, namely no characteristics as the basis, it is single node tree T, and D in implementing cases the largest class Ck as A symbol of the class of the node, return T otherwise, calculating the feature of D information gain in A (ID3)/information gain ratio (C4.5), choose the greatest feature of the information gain if Ag Ag information gain (than) is less than the threshold value of eps, is T for single node tree, and will be the biggest in implementing occuring D class Ck as A symbol of the class of the node, Otherwise, D is divided into several non-empty subsets Di according to the feature Ag, and the class with the largest number of real cases in Di is taken as the marker to construct the child node, and the tree T is formed by the node and its child nodes. T is returned to the ith child node, with Di as the training set and a-{Ag} as the feature set. Recursively, 1~5 is called to obtain the subtree Ti, and Ti

is returned
Generation of CART decision tree
Here is a brief introduction to the differences between CART and ID3 and C4.5.

    CART tree is a binary tree, while ID3 and C4.5 can be multi-partite trees. When generating subtrees, CART selects one feature and one value as the segmentation point. The basis for generating two subtrees is gini index, and the feature with the minimum gini index and the segmentation point are selected to generate subtrees

Pruning of a decision tree
The pruning of decision tree is mainly to prevent overfitting, but the process is not described in detail.
The main idea is to trace back upward from the leaf node and try to prune a node to compare the loss function value of the decision tree before and after pruning. Finally, we can get the global optimal pruning scheme through dynamic programming (tree DP, acMER should know).


Random Forests
Random forest is an important integrated learning method based on Bagging, which can be used for classification, regression and other problems.
Random forests have many advantages:
Has a very high accuracy the introduction of randomness, it is not easy to make random forests after fitting of the randomness of introduction, makes the random forest has a good ability to resist noise can deal with high dimension data, and don’t have to do feature selection can not only deal with discrete data, also can deal with continuous data, data set without standardized training speed, can be variable importance easy parallelization
Disadvantages of random forest:
When the number of decision trees in the random forest is large, the space and time required for training will be large. The random forest model has a lot of problems to explain. It is a black box model
Similar to Bagging process described above, the construction process of random forest is roughly as follows:

    from the original training focus Bootstraping method is used to back the sampling random select m sample, sampling conducted n_tree time, generate n_tree a training set for n_tree a training set, we respectively training n_tree a decision tree model for a single decision tree model, assuming that the number of training sample characteristics of n, so every time divided according to the information gain/information gain ratio/gini index, choose the best characteristics of split each tree has been divided like this, until all the training sample in the node belong to the same class. In the process of decision tree splitting, the random forest is composed of multiple decision trees generated without pruning. For the classification problem, the final classification result is decided by voting according to multiple tree classifiers. For regression problems, the mean of predicted values of multiple trees determines the final prediction result

Read More: