Parameters of the discrete distribution law

In the examples in this article, data is generated every time the page loads. If you want to see an example with different values - reload the page.

Mathematical description

Looking at the law of distribution, we can understand what is the probability of an event, we can say what is the probability that a group of events will occur, and in this article we will look at how to translate our conclusions "by eye" into a mathematically sound statement.

An extremely important definition: mathematical expectation is the area under the distribution graph. If we are talking about a discrete distribution - this is the sum of events multiplied by the corresponding probabilities, also known as moment:

(2) E(X)=Σ(p_i•X_i) E - from the English word Expected (waiting)
For mathematical expectation, the equalities are valid:

(3) E(X + Y) = E(X) + E(Y)
(4) E(X•Y) = E(X) • E(Y)

Moment of degree k:

(5) ν_k = E(X^k)

The central moment of degree k:

(6) μ_k = E[X - E(X)]^k

Average value

Average value (μ) the distribution law is the mathematical expectation of a random variable (a random variable is an event), for example, how many average visitors come to the store per hour:

Number of visitors	0	1	2	3	4	5	6
Number of observations	89	10	120	39	37	12	93
Table 1. Number of visitors per hour

Graph 1. Number of visitors per hour

To find the average value of all the results, you need to add everything together and divide by the number of results:

μ = (89 • 0 + 10 • 1 + 120 • 2 + 39 • 3 + 37 • 4 + 12 • 5 + 93 • 6) / 400 = 1133/400 = 2.83

We can do the same using formula 2:

μ = M(X) = Σ(X_i•p_i) = 0 • 0.22 + 1 • 0.03 + 2 • 0.3 + 3 • 0.1 + 4 • 0.09 + 5 • 0.03 + 6 • 0.23 = 2.83 Moment of the first degree, formula (5)

Actually, formula 2 is the arithmetic mean of all values
Total: on average, 2.83 visitor per hour

Number of visitors	0	1	2	3	4	5	6
Probability (%)	22.3	2.5	30	9.8	9.3	3	23.3
Table 2. The law of distribution of the number of visitors

Deviation from the mean

Look at this distribution, we can assume that on average the random variable is 100±5, because it seems that there are incomparably more such values than those that are less than 95 or more than 105:

Graph 2. Graph of the probability function. Distribution &azimp; 100±5

The average value according to the formula (2): μ = 99.95, but how to calculate how far all values are from the average? You should be the entry 100±5 is familiar. To get this value ±, we need to define a range of values around the mean. And we could use the "difference" between the mean and random variables as a distance measure:

(7) x_i - μ

but the sum of such distances, and therefore any derivative of this number, will be zero, so the square of the differences was chosen as the measure between the values and the average value:

(8) (x_i - μ)²

Accordingly, the average distance value is the mathematical expectation of the squares of the distance:

(9) σ² = E[(X - E(X))²] Since the probabilities of any distance are equal, the probability of each of them is 1/n, from where: (10) σ² = E[(X - E(X))²] = ∑[(X_i - μ)²]/n It is also the formula of the central moment (6) of the second degree

σ is squared, because instead of distances we took the square of distances. σ² is called variance. The root of the variance it is called the mean square deviation, or the standard deviation, and it is used as a measure of the spread:

(11) μ±σ
(12) σ = √(σ²) = √[∑[(X_i - μ)²]/n]

Returning to the example, let's calculate the standard deviation for graph 2:

σ = √(∑(x-μ)²/n) = √{[(90 - 99.95)² + (91 - 99.95)² + (92 - 99.95)² + (93 - 99.95)² + (94 - 99.95)² + (95 - 99.95)² + (96 - 99.95)² + (97 - 99.95)² + (98 - 99.95)² + (99 - 99.95)² + (100 - 99.95)² + (101 - 99.95)² + (102 - 99.95)² + (103 - 99.95)² + (104 - 99.95)² + (105 - 99.95)² + (106 - 99.95)² + (107 - 99.95)² + (108 - 99.95)² + (109 - 99.95)² + (110 - 99.95)²]/21} = 6.06

So, for graph 2 we got:

X = 99.95±6.06 ≈ 100±6, which is slightly different from the received "by eye"

Quantile

Graph 3. Distribution function. Median

Graph 4. Distribution function. 4-quantile or quartile

Graph 5. Distribution function. 0.34-quantile

To analyze the distribution function, the concept of quantile was introduced. A quantile is a random variable at a given probability level, i.e.: a quantile for a probability level of 50% is a random variable on a probability density graph that has a probability of 50%. In the example with graph 3, the quantile of the level 0.5 = 99 (the nearest value, since the distribution is discrete and events with a value of 99.3 simply do not exist)

2-quantile median
4-quantile - quartile
10-quantile - decile
100-quantile - percentile

That is, if we are talking about a decile (10-quantile), it means that we have divided the graph into 10 parts, which corresponds to nine lines, and for each decile we have found the value of a random variable.

Also, the notation x-quantile is used, where x is a fractional number, for example, 0.34-quantile, such an entry means the value of a random variable when p = 0.34.

For a discrete distribution, the quantile must be chosen as follows: the quantile guarantees the probability, therefore, if the calculated the quantile does not match one and the values, it is necessary to choose a smaller value.

For example, we have a discrete distribution of 1325 values, given that each value has a probability of 1/1325, the 10th quantile will have a value that does not exceed 10% of 1325, that is, a value equal to or less than 132.5.

Building intervals

Quantiles are used to construct confidence intervals, which are necessary for the study of statistics of more than one specific event (for example, interest is a random number = 98), and for a group of events (for example, interest is a random number between 96 and 99). The confidence interval is of two types: one-sided and two-sided. The parameter of the confidence interval is the confidence level. The confidence level means the percentage of events that can be considered successful.

Two-way confidence interval

The two-way confidence interval is constructed as follows: we set the significance level, for example, 10%, and select an area on the graph so that 90% of all events will fall into this area. Since the interval is two-sided, we cut off 5% on each side, i.e. we are looking for the 5th percentile, the 95th percentile and the values of the random variable between them will be the confidence area, values outside the confidence area are called "critical area"

Graph 6. Probability density

Graph 7. Distribution function with 5 and 95 percentiles. The confidence interval with a confidence level of 0.9 is highlighted in color

Graph 8. Probability function and two-way confidence interval with a confidence level of 90%

Confidence interval

The left-sided and right-sided confidence intervals are constructed similarly to the two-sided one: for the left-sided interval, we find the percentile of the level ['one' minus 'significance level']. Thus, to construct a confidence left-sided interval of the significance level of 4%, we need to find the fourth percentile and everything on the right is a confidence interval, everything on the left is a critical area.

Graph 9. Left-sided confidence interval with a significance level of 4%. The fill highlights the confidence interval

Graph 10. Right-sided confidence interval with a significance level of 4%. The fill highlights the confidence interval

Total

The average value is the mathematical expectation of a random variable, found by the formula:

μ = E(X) = Σ(p_i•X_i)

The standard deviation is the mathematical expectation of the distance of values from the average, is found by the formula:

σ = √(σ²) = √[∑[(X_i - μ)²]/n]

n-quantile - division of the distribution function into n equal segments, the main types of quantiles:

2-quantile - median
4-quantile -quartiles
10-quantile - deciles
100-quantile - percentiles

The confidence interval of the α level is a section of the probability function containing α of all possible values. The two-way confidence interval is constructed by clipping (1-α)/2 on the right and left. The left- sided and right-sided confidence intervals are constructed by clipping areas (1-α) left and right respectively.

Construct a distribution series

Suppose we have 100 values and all are different, for example: the body weight of Somali pirates. It is inconvenient to process such a set of data, we cannot even present them on a regular graph. Therefore, we need to categorize the available data and for this we do the following:

Let's write down our data in the table:

68	95	63	103	101	82	61	75	86	102
105	90	78	114	113	96	95	100	62	77
74	109	92	90	94	104	58	96	61	60
70	63	92	106	89	79	69	97	88	115
105	97	92	97	108	64	89	115	58	80
104	60	85	75	61	68	77	76	92	72
71	93	82	85	81	100	91	90	113	90
112	104	106	58	113	90	72	89	101	64
83	84	64	81	67	73	59	97	80	58
58	71	72	76	99	87	97	84	80	70
Table 3. Weight of Somali pirates

We will divide the data into groups, to begin with, I suggest splitting it into nine intervals:

Find out the maximum and minimum values, subtract them from each other and divide by the number intervals - received segments:
Maximum value: 115
Minimum value: 58
Difference: 115 - 58 = 57
Interval length: 57 / 9 = 6.34

Now let's count the number of pirates (weights, I mean) in each interval:

#	Interval	Number of elements
1.	58 - 64.34	17
2.	64.34 - 70.68	6
3.	70.68 - 77.02	13
4.	77.02 - 83.36	10
5.	83.36 - 89.7	10
6.	89.7 - 96.04	16
7.	96.04 - 102.38	11
8.	102.38 - 108.72	9
9.	108.72 - 115.06	8
Table 4. Number of elements in intervals

Voila, our distribution on the graph:

Graph 11. Body mass distribution series of Somali pirates

Bonus

It is better to take the intervals as integers, so if with the selected number of intervals the size comes out as a non-integer, then you can expand the range of values, for example:

The interval value is 6.34, the number is not an integer, so pushing back the upper bound:
The remainder of the division: [(115 - 58) / 9] = 3
Move to: 6
New range: [58;121]

The range can be moved both up and down, but preferably in both directions.

Tip

It is customary to divide the distribution into 7-8 intervals, but in each specific situation You can choose a great number of intervals, however, as well as make them of different lengths.

List of parameters

So, here is a list of the main parameters of the discrete distribution law:

Name	Symbol	Formula
Mathematical expectation (average)	E(X)	Σ(p_i•X_i)
Central moment (standard deviation)	σ_x	σ = √(σ²) = √[∑[(X_i - μ)²]/n]
Interval length	R	max(x) - min(x)
Fashion	m_o	max P(x = m_o)
1st quantile	-	F(x) = 0.25
Median	m_e	F(x) = 0.5
Decile	-	F(x) = 0.1
Table 5. Basic parameters of the discrete distribution law

Histogram template in OpenOffice Calc

File histogram_mock.ods contains a histogram construction template.

Download article in PDF format.

Author: Zakhar Telyatnikov
Last edit time: 23.10.2023

24.10.2016

Do you find this article curious? /

Seen: 17 602

Read the following
Statistical hypothesis