Tag Archives: ISLR

ISLR reading notes (3) classification

Welcome to visit the personal homepage, the current traffic is too low, Baidu search can not say… Thank you for encouraging
reading notes. Instead of translating the full text, I plan to share the important knowledge points in the book with my own understanding, and attach the application of R language related functions at the end, as a summary of my recent learning in machine learning. If you don’t understand correctly, please correct me.

preface
ISLR, fully known as An Introduction to Statistical Learning with Applications in R, is a basic version of the Elements of Statistical Learning. The formula derivation in ISLR is not much, but mainly explains some commonly used methods in Statistical Learning and the application of relevant methods in R language. The ISLR doesn’t officially have the answers to the problem sets, but someone has created one, and you can learn from the ISLR answers
Chapter 4 Understanding
This chapter explains three methods of classification.
1. Logistic Regression(Logistic Regression)
2. Linear Discriminant Analysis
3. Quadratic Discriminant Analysis
The four categories are analyzed and compared one by one.
1.Logistic Regression
Formula:

The log (1 – p (x) p (x) = 0 + beta beta 1 x1 + beta 2 x2 +…

among them,

p(x)
Is the probability of belonging to a certain anomaly, is the final output,

Beta.
Is a parameter in Logistic Regression, and the optimal solution is generally obtained by the method of maximum likelihood. The general fitting curve is as follows:


Generally speaking, LOGICAL regression is suitable for the classification of two kinds of problems, and Discriminant Analysis is generally used for the classification of more than two kinds of problems.
2.Linear Discriminant Analysis
In fact, the discriminant method is to add the assumption that the model distribution follows the normal distribution on the basis of the original Bayesian theory. In the linear discriminant, it is assumed that the covariance of different variables is the same

The ellipse in the left figure is the normal distribution curve, and the boundary line intersected by two sides forms the classification boundary, while the real line in the right figure is the Bayesian estimation, which is the actual boundary line. It can be found that the accuracy of the linear discriminant method is still very good.
3.Quadratic Discriminant Analysis(Quadratic Discriminant)
The only difference between a quadratic discriminant and a linear discriminant is that you assume that the covariances of different variables are different, and that causes the dividing line to be curved on the graph, and you take the degrees of freedom from

p(p+1)/2
Increased to

Kp(p+1)/2
K is the number of variables. The effect of increased freedom can be seen in reading Notes (1).

The purple dotted line represents the actual boundary, the black dotted line represents the linear discriminant boundary, and the green solid line represents the quadratic discriminant boundary. It can be seen that the linear discriminant performs better when the boundary is linear; When the dividing line is nonlinear, the opposite is true.
4. To summarize
When the actual dividing line is linear, the linear discriminant performs better if the data is close to the normal distribution hypothesis, and the logistic regression performs better if the data is not close to the normal distribution hypothesis.
when the actual boundary line is nonlinear, the quadratic discriminant will be fitted. In other higher order or irregular cases, KNN performs well.
R language application
1. Import data and prepare

> library(ISLR)
> dim(Caravan)
[1] 5822   86
> attach(Caravan)
> summary(Purchase)
  No  Yes
5474  348

Since KNN is to be used later and distance is needed, the variables are normalized. The normalization program is just one sentence, and the normalization effect is shown in the following sentences.

> standardized.X = scale(Caravan[,-86])
> var(Caravan[,1])
[1] 165.0378
> var(Caravan[,2])
[1] 0.1647078
> var(standardized.X[,1])
[1] 1
> var(standardized.X[,2])
[1] 1

Establish test samples and training samples

> test = 1:1000
> train.X = standardized.X[-test,]
> test.X = standardized.X[test,]
> train.Y = Purchase[-test]
> test.Y = Purchase[test]

(c) Logistic Regression

> glm.fit = glm(Purchase~., data=Caravan, family = binomial, subset = -test)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> glm.probs = predict(glm.fit, Caravan[test, ], type="response")
> glm.pred = rep("No", 1000)
> glm.pred[glm.probs>.5]="Yes"
> table(glm.pred, test.Y)
        test.Y
glm.pred  No Yes
     No  934  59
     Yes   7   0
> mean(glm.pred == test.Y)
[1] 0.934

3.Linear Discriminant Analysis

> library(MASS)
> lda.fit = lda(Purchase~.,data = Caravan, subset = -test)
> lda.pred = predict(lda.fit, Caravan[test,])
> lda.class = lda.pred$class
> table(lda.class, test.Y)
         test.Y
lda.class  No Yes
      No  933  55
      Yes   8   4
> mean(lda.class==test.Y)
[1] 0.937

4.Quadratic Discriminant Analysis(Quadratic Discriminant)

> qda.fit = qda(Purchase~.,data = Caravan, subset = -test)
Error in qda.default(x, grouping, ...) : rank deficiency in group Yes
> qda.fit = qda(Purchase~ABYSTAND+AINBOED,data = Caravan, subset = -test)

Found that direct training can cause errors… The two variables have been tried successfully. It seems that the dimension is too high, so far no solution has been found. Other applications are similar to LDA
5.KNN
parameter k can be selected by itself, and the input order of KNN function variables should be noted

> library(class)
> knn.pred = knn(train.X, test.X, train.Y,k=1)
> mean(test.Y==knn.pred)
[1] 0.882