k-tree
E-learning book

Linear regression

From the article you will learn the basics of regression analysis: how to choose a regression model, what regression models there are and why this model is needed at all. Also, what methods are used to determine the quality of the model.

Regression problem

In the study of any real processes, whether it's cooking pasta or analyzing investments, there is one general principle - they all depend from any parameters. The taste of pasta depends on the temperature of the stove, the amount of water, salt, the quality of pasta and so on, mathematically this is denoted as follows:

Taste = f(temperature, volume of water, salt, ...)

So, let's deal with cooking a portion of pasta, you have a set of random variables: the temperature of the stove, the volume of water, the amount of salt. Let 's set a goal find out how the amount of water affects the taste of pasta.

Problem statement

How to determine the effect of water volume on the taste of pasta? It is necessary to conduct a series of experiments in which each cooking of pasta will be it will be carried out with a different volume of water, but the other conditions (temperature and amount of salt) will be fixed. We will ask temperature values and amount of salt:

Temperaturet=500°C
Amount of salt15 g
Table 1. Fixed values for the experiment

Let's start our experiments for different volumes of water, take from 500 ml to 2200 ml, and every time we will taste the pasta and write down all our results:

#Water volumeRating
1500 ml2
2600 ml3
3700 ml4
4800 ml6
5900 ml6
61000 ml8
71100 ml11
81200 ml12
91300 ml14
101400 ml17
111500 ml21
121600 ml24
131700 ml33
141800 ml36
151900 ml48
162000 ml52
172100 ml71
182200 ml83
Table 2. Evaluation of the taste of pasta depending on the volume of water

Detection of dependence

So, we evaluate the taste of pasta depending on the volume of water, mathematically we study the function: Taste = f(Volume). All regression analysis it consists in the process of identifying the function f in this dependence.

In regression analysis, functions (models) are divided into two types: linear and nonlinear.

Linear model
y = a + bx
Nonlinear model
y = abx + c

In order to build a simple regression model (function), you need to have the courage and make an assumption, for example:

— This function is similar to a linear one!

When you have chosen a regression model, you begin to select coefficients, for example, in a linear model y=a+bx, it is necessary select the coefficients a and b. The task is relatively simple, "a" is the first value, and "b" can be found by the difference between the last and the first values. Having performed such an operation with our example, we get:

a = -22
b = 0.048
Taste = -22 + 0.048x

Let's tabulate the values of our model:

500 ml 600 ml 700 ml 800 ml 900 ml 1000 ml 1100 ml 1200 ml 1300 ml
2 6.8 11.6 16.4 21.2 26 30.8 35.6 40.4
1400 ml 1500 ml 1600 ml 1700 ml 1800 ml 1900 ml 2000 ml 2100 ml 2200 ml
45.2 50 54.8 59.6 64.4 69.2 74 78.8 83.6
Table 3. Tabulated values of the regression model

Here's how it looks on the graph:

Graph 1. Linear regression model and initial data

Getting the result

With a stretch, of course, it looks like, but for mathematical inference it is necessary to find the spread of model values and real values. These values are the sum of the squared deviations and the standard error:

RSS (sum of squared deviations) = (2 - 2)2 + (6.8 - 3)2 + ... + (83.6 - 83)2 = 7475
MSE (Mean Square Deviation) = √RSS = 86.46

S (variance) = 20.38

What to do with this regression model? The regression model allows you to predict what will happen, for example, if we take 2300 ml, 2400 ml, etc. without conducting the experiment itself:

Taste2300 ml = -22 + 0.048· 2300 = 88.4
Taste2400 ml = -22 + 0.048· 2400 = 93.2

And, of course, we can find out how much water is needed for perfect pasta:

Waterperfect pasta = (100-22) / 0.048 = 2542 ml

Minimizing the error

So, with us our model y = a + bx and the real values of the function, the difference between the function and the model - this is the mistake that we make in every experiment. So we can build the error function, and if we have a function, then we can always find its minimum. This is what we will do, finding the minimum of the error function.

The error is the difference between the real value and the simulated one, since this difference can be as positive and negative, it is necessary to use the difference module, which is the easiest thing to do squaring the error and then extracting the root. So our error on every known result is:

Yo - value from observation, Ym - value from model
e = (Yo - Ym)2 = (Yo - a - bx)2
Total error
S = Σe = Σ(Yo - a - bx)2

The function S is an error function that needs to be minimized, it depends on the parameters a and b. To find the minimum of the function, we will use a simple method - we will find derivatives with respect to the parameters a and b (here we will omit complex search methods minimum of the function):

Derived error functions for parameters a and b:
dS/da = Σ2(a+bx-y)
dS/db = Σ2(a+bx-y)x
Minimum condition of the function:
Σ2(a+bx-y) = 0
Σ2(a+bx-y)x = 0
Simplify, reduce by 2 and expand the brackets (n is the number of observations):
na + bΣx = Σy
aΣx + bΣx2 = Σxy

Find a solution:

Σx = 24 300
Σx2 = 37 650 000
Σy = 451
Σxy = 810 000

18·a + 24300·b = 451
24300·a + 37650000·b = 810000

-3589·a = 111228 ∴ a = -31
b = 0.042

Let's try our new model in action:

Graph 3. Linear regression model adjusted by the least squares method, y = -31·x + 0.042
RSS (sum of squared deviations) = (-10 - 2)2 + (-5.8 - 3)2 + ... + (61.4 - 83)2 = 1612.4
MSE (Mean Square Deviation) = √RSS = 40.15

S (variance) = 9.46
Taste2300 ml = -31 + 0.042· 2300 = 65.6
Taste2400 ml = -31 + 0.042· 2400 = 69.8

As you may have noticed, the predictions of our first model are closer to the truth than the adjusted model. Why? Because that the model was chosen incorrectly, the graph of the function is more like an exponent, and even based on knowledge of the process, it is clear that the linear dependence this is not the place. But this was just an example of a linear regression model, read about more complex models and how to choose a model in the following articles.

Download article in PDF format.

Do you find this article curious? /

Seen: 3 039