k-tree
E-learning book

Correlation analysis

Determining the true nature of the phenomenon, the researcher determines the dependence of the result on variables. If we control the experiment, we identify the dependence by fixing one of the parameters. In the case when the experiment is difficult to reproduce or it is impossible to set the value of any of the parameters, we need to use data that may be indirectly related to the phenomenon or not have it at all.

Dependence between random variables

Let's continue with an example: let's try to find out if there is a relationship between the number of hours of sunshine per day and the number of hours of activity? Since we don't control the number of hours of sunshine, we can only record data and monitor them.

Rest hours Sundial Activity hours
10.1 4.51 13.9
9.6 0.1 14.4
9.8 3.23 14.2
9.8 3.15 14.2
10 6.36 14
9.5 3.17 14.5
9.5 0.44 14.5
9.9 3.15 14.1
10 2.43 14
10 7.23 14
10.2 0.54 13.8
10.4 3.29 13.6
10.2 7.13 13.8
10 0.17 14
9.9 7.4 14.1
10.2 3.05 13.8
10.1 7.91 13.9
10.1 5.93 13.9
10.2 7.1 13.8
9.7 5.52 14.3
Table 1. Data of sundials and rest hours

How do I know if these numbers are related? Is there a relationship between the two quantities? For clarity, we will transfer the data to the X-Y graph. Although it does not matter which data will be selected as the X-axis and which on the Y-axis, since we are interested in the relationship, not dependence, it is better to postpone the independent variable along the X-axis.

Graph 1. Correlation diagram of the number of hours of sunshine and sleep time

The correlation diagram allows us to determine the relationship between two quantities "by eye". But let's move on to the term correlation: correlation is a statistical relationship of random variables, a change in the values of one of the values is accompanied by a change in the other values. This does not mean that one value influences on the other, here the phenomenon of interrelation is considered, in order to identify the dependence of one variable on another is used regression analysis.

Correlation

Let's move on to mathematical models that allow us to determine the correlation between quantities. There are three main correlation calculation methods: Pearson coefficient, Spearman coefficient and Kendall coefficient. Choice of methodology depends on the source data, but in each case, the correlation coefficient will have values from -1 to +1. A value equal to zero means that there is no correlation, a value of one means absolute the relationship of two quantities, the sign means the direction of the relationship, the negative coefficient is an increase the values of criterion X are associated with a decrease in the value of criterion Y, a positive coefficient is an increase one value is associated with an increase in the other.

Covariance

Covariance is an indicator of the relationship between two quantities, based on the moment of the second level of each of values, covariance formula:

Cov(X,Y) = E[(X-E[X])(Y-E[Y])] covariance, E - expected value (expected) E[(X-E[X])(Y-E[Y])] = E[XY-XE[Y]-E[X]Y+E[X]E[Y]] = E[XY]-E[X]E[Y]-E[X]E[Y]+E[X]E[Y] = E[XY]-E[X]E[Y] formula transformations Cov(X,Y) = E[(X-E[X])(Y-E[Y])] = E[XY]-E[X]E[Y]

For discrete values, the formula can be simplified:

Cov(X,Y) = i=1nΣ pi(xi - E[X])(yi - E[Y])

The values of the covariance will depend on the source data, that is, it is impossible to say anything about the dependence in magnitude, the resulting value must be compared with the original data. For example, for two ranges [23,48] and [1000, 2322] the covariance value will be -300, if you just reduce the second data set by 100 times, then the covariance value will be 5. To normalize this value, the concept of linear correlation coefficient was introduced.

Graph 2. Cov = 3.3
Graph 3. Cov = -2.9

Linear correlation coefficient

The linear correlation coefficient or Pearson correlation coefficient is calculated by the formula:

rxy = cov(X,Y)/σxσy
Graph 4. Cov = 40.1, rxy= 0.51.
Expressed relationship
Graph 5. Cov = 27.2, rxy= 1.
Direct relationship

The Pearson correlation coefficient can be used only if the distribution of quantities obeys the normal distribution law, that is, we must first check the data for compliance with the normal distribution law (article).

The Pearson correlation coefficient shows the presence of a linear relationship between the values.

=CORREL(data1,data2) function in openoffice =CORRELATE(data1,data2) function in Russian in excel

The difference between correlation and regression

Regression expresses the quantitative dependence of two quantities, how a change in one quantity affects to change the other. In the case of correlation analysis, we exclusively check for the presence of linear the relationship between two numbers, considering these two numbers as independent of each other quantities. Regression can be not only linear, and correlation checks only linear relationship.

Download article in PDF format.

Do you find this article curious? /

Seen: 457