ANOVA
ANOVA in statistics is a powerful tool for determining the influence of different groups of observations among themselves. The analysis of variance was introduced by Fisher, an English scientist who made a huge contribution to the development of science. ANOVA is an acronym for ANalysis Of VAriance.
Example
Suppose you want to conduct an empirical study of gasoline quality, for this you fill up the tank at one gas station and drive n kilometers, repeat such an experiment, say, five times, then conduct the same experiment, only at a different gas station. You have two sets of data - refueling A and refueling B. Certainly, the figures are scattered, but there is still some dependence, so that would determine whether refueling affects gasoline consumption (or the data are not related) You are using variance analysis.
The analysis of variance allows you to determine which of the factors affects more, intra-group or intergroup. In the example above, you will be able to determine how much the choice of gas station affects gasoline consumption. This is the essence of the dispersion analysis: to find out whether the selected factor is significant for the selected observations.
In a sense, the analysis of variance is similar to regression and correlation analyses, because it allows determine the influence of variables on each other.
Analysis
In theory, a simple model is built to analyze the variance, similar to the one studied in time series analysis.
Model
The model of the analysis of variance includes the average value, the effect of the experiment and a random error:
y = μ + τ + ε
τ - experiment effect, ε - random error
Single-factor
One-factor analysis of variance considers the influence of one criterion, it is done this way: we conduct two experiments, in one of them we include an additional factor and analyze whether this factor has made changes. As initial data, consider the results of a number of experiments:
N | E1 | E2 | E3 | E4 |
---|---|---|---|---|
1 | 60 | 44 | 89 | 32 |
2 | 40 | 60 | 76 | 39 |
3 | 40 | 30 | 133 | 33 |
4 | 48 | 53 | 76 | 35 |
5 | 32 | 60 | 125 | 43 |
μi | 44 | 49.4 | 99.8 | 36.4 |
μ = (44 + 49.4 + 99.8 + 36.4) / 4 = 57.4
The square of errors within groups (Square Sum within group):
SSw = ΣiΣj(yij - μi)2 = 4161.2
The square of errors between groups (Square Sum between group):
SSb = Σi(μi - μ)2 = 2482.32
Given the degrees of freedom, the expected average is:
MSw = SSw / a(n-1) = 277.41
MSb = SSb / a-1 = 620.58
Value of Fcrit :
F0 = MSb/MSw = 2.237
Fischer's test: if the value of F0 turns out to be greater than the value of F λ,4,15, then the factor has an impact.
For n = 20 and a = 5, Fλ,n-a,a-1 = Fλ,15,4= 5.86
Since F0 = 2.237 < 5.86, then we assume that the introduced factor did not have an effecton the results of the experiment.
Two-factor
In two- factor analysis , three hypotheses are put forward for verification:
- Factors A and B do not affect the result
- Factor A does not affect the result
- Factor B does not affect the result
To carry out a two-factor analysis, it is necessary to make groups of results: several measurements for all values of each of the factors, i.e.:
A1 | A2 | |
---|---|---|
B1 | X1a1,b1...XNa1,b1 | X1a1,b2...XNa1,b2 |
B2 | X1a1,b2...XNa1,b2 | X1a1,b2...XNa1,b2 |
Next, the average value for each factor value is calculated, i.e. the average for A1, the average for B1, etc. Then it is calculated the total average for all results. Let's set the number of criteria: k = 2 (the number of criteria A) and m = 2 (the number of criteria B).
T = ΣΣΣxijk
The sum of elements under the influence of factor A:
TAi = Σxi·k
The sum of elements under the influence of factor B:
TBj = Σx·jk
The sum of elements under the influence of factor AB:
TAiBj = Σxij·
SST = Σx2ijk - T2/N
SSA = ΣT2Ai/n·m - T2/N
SSB = ΣT2Bj/n·k - T2/N
SSAB = ΣΣT2AiBj/n - SSA - SSB - T2/N
SSE = ΣΣΣx2ijk - ΣΣT2AiBj/n
SST = SSA + SSB + SSAB + SSE
MSE = SSE/(n-1)·m·k
MSA = SSA/k-1
MSB = SSB/m-1
MSAB = SSAB/(m-1)·(k-1)
Test "Criterion A does notaffect the result", ν1= k-1:
FA = MSA/MSE
Test "Criterion B does notaffect the result", ν1= m-1:
FB = MSB/MSE
Test "Criteria A and B do notaffect the result", ν1 = (k-1)(m-1):
Fint = MSAB/MSE
For each F, if F > F α,ν1,ν2, then the hypothesis is rejected. ν2 = N-mk
Multifactorial
Multivariate analysis is similar to two-factor analysis - the same operations are performed, but the criteria are grouped and the influence of each of the factors is found iteratively.
With repeated measurements
The analysis of variance with repeated measurements indicates that several tests were performed for each criterion measurements of a random variable to obtain a more accurate result (since ANOVA) uses the intra-group sum of squares.
Application
Dispersion analysis is used in a wide variety of branches of science and production when it is necessary to study the dependence of the criteria on the difference in average values, while comparing not the average value, but the spread the results are around the mean, i.e. the variance.
Solving problems
As an example, let's give a problem from metrology. The plant houses five machines that produce shafts. It is necessary to determine whether the choice of a machine tool or the training of an employee affects the result of production. For analysis measurements are made for each machine and employee, the result is a table:
Operator 1 | ||||||||||
M1 | 30.212 | 30.102 | 30.173 | 30.171 | 30.289 | 30.193 | 30.178 | 30.11 | 30.231 | 30.1 |
---|---|---|---|---|---|---|---|---|---|---|
M2 | 30.462 | 30.324 | 30.39 | 30.413 | 30.389 | 30.433 | 30.37 | 30.321 | 30.595 | 30.439 |
M3 | 30.372 | 30.386 | 30.309 | 30.387 | 30.306 | 30.356 | 30.372 | 30.322 | 30.301 | 30.361 |
M4 | 30.249 | 30.205 | 30.274 | 30.205 | 30.288 | 30.207 | 30.232 | 30.23 | 30.3 | 30.294 |
M5 | 30.286 | 30.118 | 30.151 | 29.902 | 30.041 | 29.917 | 29.981 | 30.074 | 29.908 | 29.848 |
Operator 2 | ||||||||||
M1 | 30.3 | 30.296 | 30.153 | 30.279 | 30.299 | 30.262 | 30.216 | 30.287 | 30.268 | 30.14 |
M2 | 30.422 | 30.433 | 30.47 | 30.585 | 30.437 | 30.416 | 30.592 | 30.319 | 30.489 | 30.314 |
M3 | 30.326 | 30.335 | 30.323 | 30.366 | 30.345 | 30.335 | 30.317 | 30.335 | 30.357 | 30.316 |
M4 | 30.302 | 30.337 | 30.36 | 30.329 | 30.309 | 30.31 | 30.368 | 30.344 | 30.38 | 30.317 |
M5 | 30.32 | 30.327 | 30.366 | 30.332 | 30.363 | 30.355 | 30.307 | 30.385 | 30.314 | 30.375 |
Let's use the method of two-factor analysis, factor A is the operator, factor B is the machine. Calculate the sums of squares, to do this, you need to calculate the average value for each of the groups:
T | TA1 | TA2 | TB1 | TB2 | TB3 | TB4 | TB5 |
---|---|---|---|---|---|---|---|
3029.209 | 1512.077 | 1517.132 | 604.259 | 608.613 | 606.827 | 605.84 | 603.67 |
SSA = 0.256
SSB = 0.794
SSAB = 0.333
SSE = 0.408
MSA = 0.256
MSB = 0.199
MSAB = 0.083
MSE = 0.102
FA = 2.51
FB = 1.951
FAB = 0.814
Critical values for the Fischer test:
Fcrit A = F0.1, 1, 90 = 2.77
Fcrit B = F0.1, 4, 90 = 2.01
Fcrit AB = F0.1, 4, 90 = 2.01
Results table:
The impact of the machine on the result | Yes | 2.51 < 2.77 |
---|---|---|
The impact of the employee's qualifications on the result | Yes | 1.951 < 2.01 |
The mutual influence of the employee's qualifications and the choice of the machine on the result | Yes | 0.814 < 2.01 |
In excel/Open Calc
To solve the variance analysis in a spreadsheet, you will need the following formulas:
sumproduct | Sum of products, used to find the sum of squares |
finv | Inverse value of the distribution F - Fisher criterion |