Linear regression

From the article you will learn the basics of regression analysis: how to choose a regression model, what regression models there are and why this model is needed at all. Also, what methods are used to determine the quality of the model.

Regression problem

In the study of any real processes, whether it's cooking pasta or analyzing investments, there is one general principle - they all depend from any parameters. The taste of pasta depends on the temperature of the stove, the amount of water, salt, the quality of pasta and so on, mathematically this is denoted as follows:

Taste = f(temperature, volume of water, salt, ...)

So, let's deal with cooking a portion of pasta, you have a set of random variables: the temperature of the stove, the volume of water, the amount of salt. Let 's set a goal find out how the amount of water affects the taste of pasta.

Problem statement

How to determine the effect of water volume on the taste of pasta? It is necessary to conduct a series of experiments in which each cooking of pasta will be it will be carried out with a different volume of water, but the other conditions (temperature and amount of salt) will be fixed. We will ask temperature values and amount of salt:

Temperature	t=500°C
Amount of salt	15 g
Table 1. Fixed values for the experiment

Let's start our experiments for different volumes of water, take from 500 ml to 2200 ml, and every time we will taste the pasta and write down all our results:

#	Water volume	Rating
1	500 ml	2
2	600 ml	3
3	700 ml	4
4	800 ml	6
5	900 ml	6
6	1000 ml	8
7	1100 ml	11
8	1200 ml	12
9	1300 ml	14
10	1400 ml	17
11	1500 ml	21
12	1600 ml	24
13	1700 ml	33
14	1800 ml	36
15	1900 ml	48
16	2000 ml	52
17	2100 ml	71
18	2200 ml	83
Table 2. Evaluation of the taste of pasta depending on the volume of water

Detection of dependence

So, we evaluate the taste of pasta depending on the volume of water, mathematically we study the function: Taste = f(Volume). All regression analysis it consists in the process of identifying the function f in this dependence.

In regression analysis, functions (models) are divided into two types: linear and nonlinear.

Linear model
y = a + bx
Nonlinear model
y = ab^x + c

In order to build a simple regression model (function), you need to have the courage and make an assumption, for example:

— This function is similar to a linear one!

When you have chosen a regression model, you begin to select coefficients, for example, in a linear model y=a+bx, it is necessary select the coefficients a and b. The task is relatively simple, "a" is the first value, and "b" can be found by the difference between the last and the first values. Having performed such an operation with our example, we get:

a = -22
b = 0.048
Taste = -22 + 0.048x

Let's tabulate the values of our model:

500 ml	600 ml	700 ml	800 ml	900 ml	1000 ml	1100 ml	1200 ml	1300 ml
2	6.8	11.6	16.4	21.2	26	30.8	35.6	40.4
1400 ml	1500 ml	1600 ml	1700 ml	1800 ml	1900 ml	2000 ml	2100 ml	2200 ml
45.2	50	54.8	59.6	64.4	69.2	74	78.8	83.6
Table 3. Tabulated values of the regression model

Here's how it looks on the graph:

Graph 1. Linear regression model and initial data

Getting the result

With a stretch, of course, it looks like, but for mathematical inference it is necessary to find the spread of model values and real values. These values are the sum of the squared deviations and the standard error:

RSS (sum of squared deviations) = (2 - 2)² + (6.8 - 3)² + ... + (83.6 - 83)² = 7475
MSE (Mean Square Deviation) = √RSS = 86.46

S (variance) = 20.38

What to do with this regression model? The regression model allows you to predict what will happen, for example, if we take 2300 ml, 2400 ml, etc. without conducting the experiment itself:

Taste_{2300 ml} = -22 + 0.048· 2300 = 88.4
Taste_{2400 ml} = -22 + 0.048· 2400 = 93.2

And, of course, we can find out how much water is needed for perfect pasta:

Water_{perfect pasta} = (100-22) / 0.048 = 2542 ml

Minimizing the error

So, with us our model y = a + bx and the real values of the function, the difference between the function and the model - this is the mistake that we make in every experiment. So we can build the error function, and if we have a function, then we can always find its minimum. This is what we will do, finding the minimum of the error function.

The error is the difference between the real value and the simulated one, since this difference can be as positive and negative, it is necessary to use the difference module, which is the easiest thing to do squaring the error and then extracting the root. So our error on every known result is:

Y_o - value from observation, Y_m - value from model
e = (Y_o - Y_m)² = (Y_o - a - bx)²
Total error
S = Σe = Σ(Y_o - a - bx)²

The function S is an error function that needs to be minimized, it depends on the parameters a and b. To find the minimum of the function, we will use a simple method - we will find derivatives with respect to the parameters a and b (here we will omit complex search methods minimum of the function):

Derived error functions for parameters a and b:
dS/da = Σ2(a+bx-y)
dS/db = Σ2(a+bx-y)x
Minimum condition of the function:
Σ2(a+bx-y) = 0
Σ2(a+bx-y)x = 0
Simplify, reduce by 2 and expand the brackets (n is the number of observations):
na + bΣx = Σy
aΣx + bΣx² = Σxy

Find a solution:

Σx = 24 300
Σx² = 37 650 000
Σy = 451
Σxy = 810 000

18·a + 24300·b = 451
24300·a + 37650000·b = 810000

-3589·a = 111228 ∴ a = -31
b = 0.042

Let's try our new model in action:

Graph 3. Linear regression model adjusted by the least squares method, y = -31·x + 0.042

RSS (sum of squared deviations) = (-10 - 2)² + (-5.8 - 3)² + ... + (61.4 - 83)² = 1612.4
MSE (Mean Square Deviation) = √RSS = 40.15

S (variance) = 9.46

Taste_{2300 ml} = -31 + 0.042· 2300 = 65.6
Taste_{2400 ml} = -31 + 0.042· 2400 = 69.8

As you may have noticed, the predictions of our first model are closer to the truth than the adjusted model. Why? Because that the model was chosen incorrectly, the graph of the function is more like an exponent, and even based on knowledge of the process, it is clear that the linear dependence this is not the place. But this was just an example of a linear regression model, read about more complex models and how to choose a model in the following articles.

Download article in PDF format.

Author: Zakhar Telyatnikov
Last edit time: 23.10.2023

07.06.2017

Do you find this article curious? /

Seen: 3 039