Tag Archives: Statistical learning

Statistical learning: ANOVA (analysis of variance) (1)

Statistical learning

Recently, in the process of processing data, I found that my theoretical knowledge was weak. Therefore, I started this series of posts to record my theoretical supplement process, which is convenient for later people to understand.

ANOVA (analysis of variance)

Analysis of variance (ANOVA), also known as “analysis of variance”, is invented by r.a.fisher, which is used to test the significance of mean difference between two or more samples. Due to the influence of various factors, the data obtained from the study showed volatility. The reasons for the fluctuation can be divided into two categories, one is the uncontrollable random factors, the other is the controllable factors applied in the study.

definition

Analysis of variance (ANOVA), also known as “analysis of variance” or “F-test”, is invented by r.a.fister. It is used to test the statistical characteristics of two or more sample sets: the significance of mean difference.
The basic assumption of ANOVA is the difference between the means of different sample groups. There are two basic sources:
(1) experimental variables, that is, the differences caused by the main differences of samples (for example, male and female), are called inter group differences. It is expressed by the sum of the square deviations of the mean values of all variables in each group and the total mean values of all variables mixed together. It is denoted as SSB, and its degree of freedom is DFB.
(2) random error, such as the difference caused by measurement error or the difference between individuals, is called intra group difference. It is expressed by the sum of the square sum of the deviations between the mean value of variables in each group and the value of variables in the group. It is recorded as SSW, and the intra group degree of freedom is DFW.
The sum of squares of total deviation SST = SSB + SSW.
Divide the intra group SSW and inter group SSB by their respective degrees of freedom (intra group DFW = N-M, inter group DFB = M-1, where n is the total number of samples and M is the number of groups), and get their mean square MSW and MSB. One case is that the experimental conditions have no effect, that is, the samples of each group are from the same population with the same distribution, MSB / MSW ≈ 1. The other is that the processing does work. The mean square between groups is the result of the error and different processing, that is, each sample comes from different populations. So, MSB & gt; & gt; MSW (much larger than 1).
The ratio of MSB / MSW constitutes F distribution. The F value is compared with its critical value as the basis for inferring whether each sample comes from the same population at a given significance.

The basic idea of ANOVA is to study the contribution of variation from different sources to the total variation, so as to determine the significance of controllable variables to the research results.

For example:

Here we use a simple example to illustrate the basic idea of ANOVA:
for example, the blood phosphorus values (mmolgl) of 11 patients with Keshan disease and 13 healthy people in a Keshan disease area are as follows:
patients: 0.84 1.05 1.20 1.39 1.53 1.67 1.80 1.87 2.07 2.11
healthy people: 0.54 0.64 0.64 0.75 0.76 0.81 1.16 1.20 1.34 1.35 1.48 1.56 1.87
healthy people: 0.54 0.64 0.64 0.75 0.76 0.81 1.16 1.20 1.34 1.35 1.48 1.56 1.87 Is the blood phosphorus value of Keshan disease patients different from that of healthy people?
It can be seen from the above data that the blood phosphorus values of 24 patients and healthy people are different. If the variation around the total mean value is described by the sum of squares of deviation (SS), the total variation has the following two sources:
intra group variation, that is, the blood phosphorus values in each group are not equal due to random error;
the total variation has the following two sources:
intra group variation The variation between groups, that is, due to the influence of Keshan disease, the mean value of blood phosphorus values of patients and healthy people are different.
And: SS total = SS between groups + SS within groups, V total = V between groups + V within groups
0 If the unit mean square (sum of squares of deviations divided by degrees of freedom) is used to replace the sum of squares of deviations to eliminate the influence of different sample numbers in each group, then the analysis of variance is to use the unit mean square between groups to remove the quotient of unit mean square within groups (i.e. F value) and compare it with 1. If f value is close to 1, it means that there is no significant difference between the mean values of each group. If f value is far greater than 1, it means that there is difference between the mean values of each group There was statistical significance. In practical application, the probability of F value greater than the given significance distribution can be obtained by consulting the f-cut-off table (for analysis of variance).

The difference between ANOVA, ttest and ttest2

1. Ttest test is generally used to test the mean value of the data from the same distribution, that is, to test whether the mean value of the distribution of the group of data is 0 (statistics can be constructed to verify that the mean value of the distribution is other values). It focuses on the significance of the difference between the mean value of this group of data distribution and the mean value you give him, such as testing whether the height of a group of people is the previous value, etc.

2. The ttest2 test is generally aimed at the test of the distribution mean value from two different samples. The original zero hypothesis is that the mean value is equal. He does not care what distribution these two groups of data come from, for example, the height and weight data of a group of people. The test assumes that the mean values of these two groups of distribution are equal, that is, whether the mean values of height and weight are equal.

3. One way ANOVA
one way ANOVA is used to study whether the different levels of a control variable have a significant impact on the observed variables. Generally speaking, it is to analyze the significance of the influence of the change of variable x on variable y, so if there is some influence relationship between the general variables, it is a test to verify the significance of the influence of the change of one variable on another variable.
In general, ANOVA is paired.
If from the calculation point of view, there is no need to calculate between independent samples, only calculate the mean value and standard deviation in this group, while in ANOVA, we need to calculate the difference between groups and within groups.

4 multivariate analysis of variance
multivariate analysis of variance is the statistical analysis of how many factors affect a variable. The analysis of covariance is a variety of factors, without considering a certain factor, the impact of other factors on the variable. For example, the sales volume of corn, the price of corn, and the price of popcorn (the example is not very good, but it probably means that a has a corresponding effect on B, and B has an impact on C, but a does not necessarily have an impact on C). That is, the higher the price of popcorn, the more the sales volume of corn, so they are in direct proportion. But this is obviously irrelevant. Because the price of popcorn and the sales volume of corn are related to the temperature of corn, covariance analysis should be used to analyze such problems.


Reference:
the difference between t test and ANOVA in SPSS,
one way ANOVA
ANOVA table