A discrete variation series is constructed for discrete characteristics.

In order to construct a discrete variation series, you need to perform the following steps: 1) arrange the units of observation in increasing order of the studied value of the characteristic,

2) determine all possible values ​​of the attribute x i , arrange them in ascending order,

the value of the attribute, i .

frequency of attribute value and denote f i . The sum of all frequencies of a series is equal to the number of elements in the population being studied.

Example 1 .

List of grades received by students in exams: 3; 4; 3; 5; 4; 2; 2; 4; 4; 3; 5; 2; 4; 5; 4; 3; 4; 3; 3; 4; 4; 2; 2; 5; 5; 4; 5; 2; 3; 4; 4; 3; 4; 5; 2; 5; 5; 4; 3; 3; 4; 2; 4; 4; 5; 4; 3; 5; 3; 5; 4; 4; 5; 4; 4; 5; 4; 5; 5; 5.

Here is the number X - gradeis a discrete random variable, and the resulting list of estimates isstatistical (observable) data .

    arrange observation units in ascending order of the studied characteristic value:

2; 2; 2; 2; 2; 2; 2; 2; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5.

2) determine all possible values ​​of the attribute x i, order them in ascending order:

In this example, all estimates can be divided into four groups with the following values: 2; 3; 4; 5.

The value of a random variable corresponding to a particular group of observed data is called the value of the attribute, option (option) and designate x i .

A number that shows how many times the corresponding value of a characteristic occurs in a number of observations is called frequency of attribute value and denote f i .

For our example

score 2 occurs - 8 times,

score 3 occurs - 12 times,

score 4 occurs - 23 times,

score 5 occurs - 17 times.

There are 60 ratings in total.

4) write the received data into a table of two rows (columns) - x i and f i.

Based on these data, it is possible to construct a discrete variation series

Discrete variation series – this is a table in which the occurring values ​​of the characteristic being studied are indicated as individual values ​​in ascending order and their frequencies

  1. Construction of an interval variation series

In addition to the discrete variational series, a method of grouping data such as an interval variational series is often encountered.

An interval series is constructed if:

    the sign has a continuous nature of change;

    There were a lot of discrete values ​​(more than 10)

    the frequencies of discrete values ​​are very small (do not exceed 1-3 with a relatively large number of observation units);

    many discrete values ​​of a feature with the same frequencies.

An interval variation series is a way of grouping data in the form of a table that has two columns (the values ​​of the characteristic in the form of an interval of values ​​and the frequency of each interval).

Unlike a discrete series of attribute values interval series are represented not by individual values, but by an interval of values ​​(“from - to”).

The number that shows how many observation units fell into each selected interval is called frequency of attribute value and denote f i . The sum of all frequencies of a series is equal to the number of elements (units of observation) in the population being studied.

If a unit has a characteristic value equal to the upper limit of the interval, then it should be assigned to the next interval.

For example, a child with a height of 100 cm will fall into the 2nd interval, and not into the first; and a child with a height of 130 cm will fall into the last interval, and not into the third.

Based on these data, an interval variation series can be constructed.

Each interval has a lower limit (xn), an upper limit (xv) and an interval width ( i).

The interval boundary is the value of the attribute that lies on the border of two intervals.

children's height (cm)

children's height (cm)

amount of children

more than 130

If an interval has an upper and lower boundary, then it is called closed interval. If an interval has only a lower or only an upper boundary, then it is - open interval. Only the very first or the very last interval can be open. In the above example, the last interval is open.

Interval width (i) – the difference between the upper and lower limits.

i = x n - x in

The width of the open interval is assumed to be the same as the width of the adjacent closed interval.

children's height (cm)

amount of children

Interval width (i)

for calculations 130+20=150

20 (because the width of the adjacent closed interval is 20)

All interval series are divided into interval series with at equal intervals and interval series with unequal intervals . In spaced rows with equal intervals, the width of all intervals is the same. In interval series with unequal intervals, the width of the intervals is different.

In the example under consideration - an interval series with unequal intervals.

Laboratory work No. 1

According to mathematical statistics

Topic: Primary processing of experimental data

3. Score in points. 1

5. Test questions.. 2

6. Execution method laboratory work.. 3

Goal of the work

Acquiring skills in primary processing of empirical data using methods of mathematical statistics.

Based on the totality of experimental data, complete the following tasks:

Exercise 1. Construct an interval variation distribution series.

Task 2. Construct a histogram of frequencies of an interval variation series.

Task 3. Create an empirical distribution function and plot a graph.

a) mode and median;

b) conditional initial moments;

c) sample average;

d) sample variance, corrected variance population, corrected mean standard deviation;

e) coefficient of variation;

f) asymmetry;

g) kurtosis;

Task 5. Determine the boundaries of the true values ​​of the numerical characteristics of the random variable being studied with a given reliability.

Task 6. Content-based interpretation of the results of primary processing according to the conditions of the task.

Score in points

Tasks 1-56 points

Task 62 points

Defense of laboratory work(oral interview on test questions and laboratory work) - 2 points

The work must be submitted in written form on A4 sheets and includes:

1) Title page(Annex 1)

2) Initial data.

3) Submission of work according to the specified sample.

4) Calculation results (done manually and/or using MS Excel) in the specified order.

5) Conclusions - meaningful interpretation of the results of primary processing according to the conditions of the problem.

6) Oral interview on work and control questions.



5. Test questions


Methodology for performing laboratory work

Task 1. Construct an interval variational distribution series

In order to present statistical data in the form of a variation series with equally spaced options, it is necessary:

1.In the original data table, find the smallest and largest values.

2.Define range of variation :

3. Determine the length of the interval h, if the sample contains up to 1000 data, use the formula: , where n – sample size – the amount of data in the sample; for calculations take lgn).

The calculated ratio is rounded to convenient integer value .

4. To determine the beginning of the first interval for an even number of intervals, it is recommended to take the value ; and for an odd number of intervals .

5. Write down the grouping intervals and arrange them in ascending order of boundaries

, ,………., ,

where is the lower limit of the first interval. A convenient number is taken that is no greater than , the upper limit of the last interval should be no less than . It is recommended that the intervals contain the initial values ​​of the random variable and be separated from 5 to 20 intervals.

6. Write down the initial data on grouping intervals, i.e. use the source table to calculate the number of random variable values ​​falling within the specified intervals. If some values ​​coincide with the boundaries of the intervals, then they are attributed either only to the previous or only to the subsequent interval.

Note 1. The intervals do not have to be equal in length. In areas where the values ​​are denser, it is more convenient to take smaller, short intervals, and where there are less frequent intervals, larger ones.

Note 2.If for some values ​​“zero” or small frequency values ​​are obtained, then it is necessary to regroup the data, enlarging the intervals (increasing the step).

In many cases, the cat's statistical population includes a large or even more infinite number option, which is most often found with continuous variation, it is practically impossible and impractical to form a group of units for each option. In such cases, combining statistical units into groups is possible only on the basis of an interval, i.e. such a group that has certain limits for the values ​​of a varying characteristic. These limits are indicated by two numbers indicating the upper and lower limits of each group. The use of intervals leads to the formation of an interval distribution series.

Interval rad is a variation series, the variants of which are presented in the form of intervals.

An interval series can be formed with equal and unequal intervals, while the choice of the principle for constructing this series depends mainly on the degree of representativeness and convenience of the statistical population. If the population is large enough (representative) in terms of the number of units and is completely homogeneous in its composition, then it is advisable to base the formation of an interval series on equality of intervals. Usually, using this principle, an interval series is formed for those populations where the range of variation is relatively small, i.e. the maximum and minimum options usually differ from each other several times. In this case, the value of equal intervals is calculated by the ratio of the range of variation of a characteristic to a given number of formed intervals. To determine equal And interval, the Sturgess formula can be used (usually with a small variation of interval characteristics and a large number of units in the statistical population):

where x i - equal interval value; X max, X min - maximum and minimum options in a statistical aggregate; n . - the number of units in the aggregate.

Example. It is advisable to calculate the size of an equal interval according to the density of radioactive contamination with cesium - 137 in 100 settlements of the Krasnopolsky district of the Mogilev region, if it is known that the initial (minimum) option is equal to I km / km 2, the final ( maximum) - 65 ki/km 2. Using formula 5.1. we get:

Consequently, in order to form an interval series with equal intervals in terms of the density of cesium contamination - 137 settlements in the Krasnopolsky region, the size of the equal interval can be 8 ki/km 2 .

Under conditions of uneven distribution, i.e. when the maximum and minimum options are hundreds of times, when forming an interval series, you can apply the principle unequal intervals. Unequal intervals usually increase as we move to large values sign.

The shape of the intervals can be closed or open. Closed It is customary to call intervals that have both lower and upper boundaries. Open intervals have only one boundary: in the first interval there is an upper boundary, in the last one there is a lower boundary.

It is advisable to evaluate interval series, especially with unequal intervals, taking into account distribution density, the simplest way to calculate which is the ratio of the local frequency (or frequency) to the size of the interval.

To practically form an interval series, you can use the table layout. 5.3.

Table 5.3. The procedure for forming an interval series of settlements in the Krasnopolsky region according to the density of radioactive contamination with cesium –137

The main advantage of the interval series is its maximum compactness. at the same time in the interval distribution series individual options characteristics are hidden in the corresponding intervals

When graphically depicting an interval series in a system of rectangular coordinates, the upper boundaries of the intervals are plotted on the abscissa axis, and the local frequencies of the series are plotted on the ordinate axis. The graphical construction of an interval series differs from the construction of a distribution polygon in that each interval has lower and upper boundaries, and two abscissas correspond to one ordinate value. Therefore, on the graph of an interval series, not a point is marked, as in a polygon, but a line connecting two points. These horizontal lines are connected to each other by vertical lines and the figure of a stepped polygon is obtained, which is commonly called histogram distribution (Fig. 5.3).

When graphically constructing an interval series for a sufficiently large statistical population, the histogram approaches symmetrical form of distribution. In those cases where the statistical population is small, as a rule, asymmetrical bar chart.

In some cases, it is advisable to form a number of accumulated frequencies, i.e. cumulative row. A cumulative series can be formed on the basis of a discrete or interval distribution series. When graphically depicting a cumulative series in a system of rectangular coordinates, variants are plotted on the abscissa axis, and accumulated frequencies (frequencies) are plotted on the ordinate axis. The resulting curved line is usually called cumulative distribution (Fig. 5.4).

Formation and graphic representation various types variation series contributes to a simplified calculation of the main statistical characteristics, which are discussed in detail in topic 6, helps to better understand the essence of the laws of distribution of the statistical population. Analysis of a variation series acquires particular importance in cases where it is necessary to identify and trace the relationship between options and frequencies (frequencies). This dependence is manifested in the fact that the number of cases per option is in a certain way related to the size of this option, i.e. with increasing values ​​of the varying characteristic, the frequencies (frequencies) of these values ​​experience certain, systematic changes. This means that the numbers in the frequency (frequency) column are not subject to chaotic fluctuations, but change in a certain direction, in in a certain order and consistency.

If the frequencies show a certain systematicity in their changes, then this means that we are on the way to identifying a pattern. The system, order, sequence in changes in frequencies is a reflection of general causes, general conditions characteristic of the entire population.

It should not be assumed that the distribution pattern is always given in ready-made form. There are quite a lot of variation series in which the frequencies bizarrely jump, sometimes increasing, sometimes decreasing. In such cases, it is advisable to find out what kind of distribution the researcher is dealing with: either this distribution does not have any inherent patterns at all, or its nature has not yet been revealed: The first case is rare, but the second case is a fairly common and very widespread phenomenon.

So, when forming an interval series total number statistical units can be small, and each interval contains a small number of options (for example, 1-3 units). In such cases, one cannot count on the manifestation of any pattern. In order for a natural result to be obtained based on random observations, it is necessary for the law to come into force large numbers, i.e. so that for each interval there would be not several, but tens and hundreds of statistical units. To this end, we must try to increase the number of observations as much as possible. This is the most the right way detecting patterns in mass processes. If it doesn’t seem real opportunity increase the number of observations, then identifying a pattern can be achieved by reducing the number of intervals in the distribution series. By reducing the number of intervals in a variation series, the number of frequencies in each interval thereby increases. This means that the random fluctuations of each statistical unit are superimposed on each other, “smoothed out”, turning into a pattern.

The formation and construction of variation series allows us to obtain only a general, approximate picture of the distribution of the statistical population. For example, a histogram only in rough form expresses the relationship between the values ​​of a characteristic and its frequencies (frequencies). Therefore, variation series are essentially only the basis for further, in-depth study of the internal regularity of the static distribution.

TEST QUESTIONS FOR TOPIC 5

1. What is variation? What causes variation in a trait in a statistical population?

2. What types of varying characteristics can occur in statistics?

3. What is a variation series? What types of variation series can there be?

4. What is a ranked series? What are its advantages and disadvantages?

5. What is discrete series and what are its advantages and disadvantages?

6. What is the procedure for forming an interval series, what are its advantages and disadvantages?

7. What is a graphical representation of ranked, discrete, interval distribution series?

8. What is the cumulate of distribution and what does it characterize?

Math statistics- a branch of mathematics devoted to mathematical methods processing, systematization and use of statistical data for scientific and practical conclusions.

3.1. BASIC CONCEPTS OF MATHEMATICAL STATISTICS

In medical and biological problems it is often necessary to study the distribution of a particular characteristic for very large number individuals. In different individuals this sign has different meaning, so it is a random variable. For example, any medicinal drug has varying effectiveness when applied to different patients. However, in order to get an idea of ​​​​the effectiveness of this drug, there is no need to apply it to everyone sick. It is possible to trace the results of using the drug to a relatively small group of patients and, based on the data obtained, identify the essential features (efficacy, contraindications) of the treatment process.

Population- a set of homogeneous elements characterized by some attribute to be studied. This sign is continuous random variable with distribution density f(x).

For example, if we are interested in the prevalence of a disease in a certain region, then the general population is the entire population of the region. If we want to find out the susceptibility of men and women to this disease separately, then we should consider two general populations.

To study the properties of a general population, a certain part of its elements is selected.

Sample- part of the general population selected for examination (treatment).

If this does not cause confusion, then a sample is called as a set of objects, selected for the survey, and totality

values the studied characteristic obtained during the examination. These values ​​can be represented in several ways.

Simple statistical series - values ​​of the characteristic being studied, recorded in the order in which they were obtained.

An example of a simple statistical series obtained by measuring the surface wave velocity (m/s) in the skin of the forehead in 20 patients is given in Table. 3.1.

Table 3.1.Simple statistical series

A simple statistical series is the main and most full way records of examination results. It can contain hundreds of elements. It is very difficult to take a look at such a totality at one glance. Therefore, large samples are usually divided into groups. To do this, the area of ​​change in the characteristic is divided into several (N) intervals equal width and calculate the relative frequencies (n/n) of the attribute falling into these intervals. The width of each interval is:

The interval boundaries have the following meanings:

If any sample element is the boundary between two adjacent intervals, then it is classified as left interval. Data grouped in this way is called interval statistical series.

is a table that shows intervals of attribute values ​​and the relative frequencies of occurrence of the attribute within these intervals.

In our case, we can form, for example, the following interval statistical series (N = 5, d= 4), table. 3.2.

Table 3.2.Interval statistical series

Here, the interval 28-32 includes two values ​​equal to 28 (Table 3.1), and the interval 32-36 includes values ​​32, 33, 34 and 35.

An interval statistical series can be depicted graphically. To do this, intervals of attribute values ​​are plotted along the abscissa axis and on each of them, as on a base, a rectangle is built with a height equal to the relative frequency. The resulting bar chart is called histogram.

Rice. 3.1. bar chart

In the histogram, the statistical patterns of the distribution of the characteristic are visible quite clearly.

With a large sample size (several thousand) and small column widths, the shape of the histogram is close to the shape of the graph distribution density sign.

The number of histogram columns can be selected using the following formula:

Constructing a histogram manually is a long process. Therefore developed computer programs for their automatic construction.

3.2. NUMERIC CHARACTERISTICS OF STATISTICAL SERIES

Many statistical procedures use sample estimates for the population expectation and variance (or MSE).

Sample mean(X) is the arithmetic mean of all elements of a simple statistical series:

For our example X= 37.05 (m/s).

The sample mean isthe bestgeneral average estimateM.

Sample variance s 2 equal to the sum of squared deviations of elements from the sample mean, divided by n- 1:

In our example, s 2 = 25.2 (m/s) 2.

Please note that when calculating the sample variance, the denominator of the formula is not the sample size n, but n-1. This is due to the fact that when calculating deviations in formula (3.3), instead of the unknown mathematical expectation, its estimate is used - sample mean.

Sample variance is the best estimation of general variance (σ 2).

Sample standard deviation(s) is Square root from sample variance:

For our example s= 5.02 (m/s).

Selective root mean square deviation is the best estimate of the general standard deviation (σ).

With an unlimited increase in sample size, all sample characteristics tend to the corresponding characteristics of the general population.

Computer formulas are used to calculate sample characteristics. In Excel, these calculations perform the statistical functions AVERAGE, VARIANCE. STANDARD DEVIATION

3.3. INTERVAL ASSESSMENT

All sample characteristics are random variables. This means that for another sample of the same size, the values ​​of the sample characteristics will be different. Thus, selective

characteristics are only estimates relevant characteristics of the population.

The disadvantages of selective assessment are compensated by interval estimation, representing numeric interval inside which with a given probability R d the true value of the estimated parameter is found.

Let U r - some parameter of the general population (general mean, general variance, etc.).

Interval estimation parameter U r is called the interval (U 1, U 2), satisfying the condition:

P(U < Ur < U2) = Рд. (3.5)

Probability R d called confidence probability.

Confidence probability Pd - the probability that the true value of the estimated quantity is inside the specified interval.

In this case, the interval (U 1, U 2) called confidence interval for the parameter being estimated.

Often, instead of the confidence probability, the associated value α = 1 - Р d is used, which is called level of significance.

Significance level is the probability that the true value of the estimated parameter is outsideconfidence interval.

Sometimes α and P d are expressed as percentages, for example, 5% instead of 0.05 and 95% instead of 0.95.

In interval estimation, first select the appropriate confidence probability (usually 0.95 or 0.99), and then find the appropriate range of values ​​for the parameter being estimated.

Let's note some general properties interval estimates.

1. The lower the level of significance (the more R d), the wider the interval estimate. So, if at a significance level of 0.05 the interval estimate of the general mean is 34.7< M< 39,4, то для уровня 0,01 она будет гораздо шире: 33,85 < M< 40,25.

2. The larger the sample size n, the narrower the interval estimate with the selected significance level. Let, for example, 5 be the percentage estimate of the general average (β = 0.05) obtained from a sample of 20 elements, then 34.7< M< 39,4.

By increasing the sample size to 80, we get a more accurate estimate at the same significance level: 35.5< M< 38,6.

IN general case constructing reliable confidence estimates requires knowledge of the law according to which the estimated random attribute is distributed in the population. Let's look at how an interval estimate is constructed general average characteristic that is distributed in the population according to normal law.

3.4. INTERVAL ESTIMATION OF THE GENERAL AVERAGE FOR THE NORMAL DISTRIBUTION LAW

Construction of an interval estimate of the general average M for a population with normal law distribution is based on the following property. For sampling volume n attitude

obeys the Student distribution with the number of degrees of freedom ν = n- 1.

Here X- sample mean, and s- selective standard deviation.

Using Student distribution tables or their computer equivalent, you can find a boundary value such that, with a given confidence probability, the following inequality holds:

This inequality corresponds to the inequality for M:

Where ε - half-width of the confidence interval.

Thus, the construction of a confidence interval for M is carried out in the following sequence.

1. Select a confidence probability Р d (usually 0.95 or 0.99) and for it, using the Student distribution table, find the parameter t

2. Calculate the half-width of the confidence interval ε:

3. Obtain an interval estimate of the general average with the selected confidence probability:

Briefly it is written like this:

Computer procedures have been developed to find interval estimates.

Let us explain how to use the Student distribution table. This table has two “entrances”: the left column, called the number of degrees of freedom ν = n- 1, and the top line is the significance level α. At the intersection of the corresponding row and column, find the Student coefficient t.

Let's apply this method to our sample. A fragment of the Student distribution table is presented below.

Table 3.3. Fragment of the Student distribution table

A simple statistical series for a sample of 20 people (n= 20, ν =19) is presented in table. 3.1. For this series, calculations using formulas (3.1-3.3) give: X= 37,05; s= 5,02.

Let's choose α = 0.05 (Р d = 0.95). At the intersection of row “19” and column “0.05” we find t= 2,09.

Let us calculate the accuracy of the estimate using formula (3.6): ε = 2.09?5.02/λ /20 = 2.34.

Let's construct an interval estimate: with a probability of 95%, the unknown general mean satisfies the inequality:

37,05 - 2,34 < M< 37,05 + 2,34, или M= 37.05 ± 2.34 (m/s), R d = 0.95.

3.5. METHODS FOR TESTING STATISTICAL HYPOTHESES

Statistical hypotheses

Before formulating what a statistical hypothesis is, consider the following example.

To compare two methods of treating a certain disease, two groups of patients of 20 people each were selected and treated using these methods. For each patient it was recorded number of procedures, after which a positive effect was achieved. Based on these data, sample means (X), sample variances were found for each group (s 2) and sample standard deviations (s).

The results are presented in table. 3.4.

Table 3.4

The number of procedures required to obtain a positive effect is a random variable, all information about which is on this moment contained in the given sample.

From the table 3.4 shows that the sample average in the first group is less than in the second. Does this mean that the same relationship holds for general averages: M 1< М 2 ? Достаточно ли статистических данных для такого вывода? Ответы на эти вопросы и дает statistical testing of hypotheses.

Statistical hypothesis- it is an assumption about the properties of populations.

We will consider hypotheses about the properties two general populations.

If populations have famous, identical distribution of the value being estimated, and the assumptions concern the values some parameter of this distribution, then the hypotheses are called parametric. For example, samples are drawn from populations with normal law distribution and equal variance. Need to find out are they the same general averages of these populations.

If nothing is known about the laws of distribution of general populations, then hypotheses about their properties are called nonparametric. For example, are they the same laws of distribution of the general populations from which the samples are drawn.

Null and alternative hypotheses.

The task of testing hypotheses. Significance level

Let's get acquainted with the terminology used when testing hypotheses.

H 0 - null hypothesis (skeptic's hypothesis) is a hypothesis about the absence of differences between compared samples. The skeptic believes that the differences between sample estimates obtained from research results are random;

H 1- alternative hypothesis (optimist hypothesis) is a hypothesis about the presence of differences between the compared samples. An optimist believes that differences between sample estimates are caused by objective reasons and correspond to differences in general populations.

Testing statistical hypotheses is feasible only when it is possible to construct some size(criterion), the distribution law of which in case of fairness H 0 famous. Then for this quantity we can specify confidence interval, into which with a given probability R d its value falls. This interval is called critical area. If the criterion value falls into the critical region, then the hypothesis is accepted N 0. Otherwise, hypothesis H 1 is accepted.

In medical research, P d = 0.95 or P d = 0.99 are used. These values ​​correspond significance levelsα = 0.05 or α = 0.01.

When testing statistical hypotheseslevel of significance(α) is the probability of rejecting the null hypothesis when it is true.

Please note that, at its core, the hypothesis testing procedure is aimed at detecting differences and not to confirm their absence. When the criterion value goes beyond the critical region, we can say with a pure heart to the “skeptic” - well, what else do you want?! If there were no differences, then with a probability of 95% (or 99%) the calculated value would be within the specified limits. But no!..

Well, if the value of the criterion falls into the critical region, then there is no reason to believe that the hypothesis H 0 is correct. This most likely points to one of two possible reasons.

1. Sample sizes are not large enough to detect differences. It is likely that continued experimentation will bring success.

2. There are differences. But they are so small that they have no practical significance. In this case, continuing the experiments does not make sense.

Let's move on to consider some statistical hypotheses used in medical research.

3.6. TESTING HYPOTHESES ABOUT EQUALITY OF VARIANCES, FISCHER'S F-CRITERION

In some clinical studies, the positive effect is evidenced not so much magnitude of the parameter being studied, how much of it stabilization, reducing its fluctuations. In this case, the question arises about comparing two general variances based on the results of a sample survey. This problem can be solved using Fisher's test.

Formulation of the problem

normal law distributions. Sample sizes -

n 1 And n2, A sample variances equal s 1 and s 2 2 general variances.

Testable hypotheses:

H 0- general variances are the same;

H 1- general variances are different.

Shown if samples are drawn from populations with normal law distribution, then if the hypothesis is true H 0 the ratio of sample variances follows the Fisher distribution. Therefore, as a criterion for checking the fairness H 0 the value is taken F, calculated by the formula:

Where s 1 and s 2 are sample variances.

This ratio obeys the Fisher distribution with the number of degrees of freedom of the numerator ν 1 = n 1- 1 and the number of degrees of freedom of the denominator ν 2 = n 2 - 1. The boundaries of the critical region are found using Fisher distribution tables or using the computer function BRASPOBR.

For the example presented in table. 3.4, we get: ν 1 = ν 2 = 20 - 1 = 19; F= 2.16/4.05 = 0.53. At α = 0.05, the boundaries of the critical region are respectively: = 0.40, = 2.53.

The criterion value falls into the critical region, so the hypothesis is accepted H 0: general sample variances are the same.

3.7. TESTING HYPOTHESES REGARDING EQUALITY OF MEANS, STUDENT t-CRITERION

Comparison task average two general populations arises when practical significance is precisely magnitude the characteristic being studied. For example, when comparing the duration of treatment with two different methods or the number of complications arising from their use. In this case, you can use the Student's t-test.

Formulation of the problem

Two samples (X 1) and (X 2) were obtained, extracted from general populations with normal law distribution and identical variances. Sample sizes - n 1 and n 2, sample means are equal to X 1 and X 2, and sample variances- s 1 2 and s 2 2 respectively. Need to compare general averages.

Testable hypotheses:

H 0- general averages are the same;

H 1- general averages are different.

It is shown that if the hypothesis is true H 0 t value calculated by the formula:

distributed according to Student's law with the number of degrees of freedom ν = ν 1 + + ν2 - 2.

Here where ν 1 = n 1 - 1 - number of degrees of freedom for the first sample; ν 2 = n 2 - 1 - number of degrees of freedom for the second sample.

The boundaries of the critical region are found using t-distribution tables or using the computer function STUDRIST. The Student distribution is symmetrical about zero, so the left and right boundaries of the critical region are identical in magnitude and opposite in sign: -and

For the example presented in table. 3.4, we get:

ν 1 = ν 2 = 20 - 1 = 19; ν = 38, t= -2.51. At α = 0.05 = 2.02.

The criterion value goes beyond the left border of the critical region, so we accept the hypothesis H 1: general averages are different. At the same time, the population average first sample LESS.

Applicability of Student's t-test

The Student's t test is only applicable to samples from normal aggregates with identical general variances. If at least one of the conditions is violated, then the applicability of the criterion is questionable. The requirement of normality of the general population is usually ignored, citing central limit theorem. Indeed, the difference between sample means in the numerator (3.10) can be considered normally distributed for ν > 30. But the question of equality of variances cannot be verified, and references to the fact that the Fisher test did not detect differences cannot be taken into account. However, the t-test is widely used to detect differences in population means, although without sufficient evidence.

Below is discussed nonparametric criterion, which is successfully used for the same purposes and which does not require any normality, neither equality of variances.

3.8. NONPARAMETRIC COMPARISON OF TWO SAMPLES: MANN-WHITNEY CRITERION

Nonparametric tests are designed to detect differences in the distribution laws of two populations. Criteria that are sensitive to differences in general average, called criteria shift Criteria that are sensitive to differences in general dispersions, called criteria scale. The Mann-Whitney test refers to the criteria shift and is used to detect differences in the means of two populations, samples from which are presented in ranking scale. The measured characteristics are located on this scale in ascending order, and then numbered with integers 1, 2... These numbers are called ranks. Equal quantities are assigned equal ranks. It is not the value of the attribute itself that matters, but only ordinal place which it ranks among other quantities.

In table 3.5. the first group from Table 3.4 is presented in expanded form (line 1), ranked (line 2), and then the ranks of identical values ​​are replaced by arithmetic averages. For example, elements 4 and 4 in the first row were given ranks 2 and 3, which were then replaced by same values 2,5.

Table 3.5

Formulation of the problem

Independent samples (X 1) And (X 2) extracted from general populations with unknown distribution laws. Sample sizes n 1 And n 2 respectively. The values ​​of sample elements are presented in ranking scale. It is necessary to check whether these general populations differ from each other?

Testable hypotheses:

H 0- samples belong to the same general population; H 1- samples belong to different general populations.

To test such hypotheses, the (/-Mann-Whitney test is used.

First, a combined sample (X) is compiled from the two samples, the elements of which are ranked. Then the sum of the ranks corresponding to the elements of the first sample is found. This amount is the criterion for testing hypotheses.

U= Sum of ranks of the first sample. (3.11)

For independent samples whose volumes are greater than 20, the value U obeys the normal distribution, the mathematical expectation and standard deviation of which are equal to:

Therefore, the boundaries of the critical region are found according to normal distribution tables.

For the example presented in table. 3.4, we get: ν 1 = ν 2 = 20 - 1 = 19, U= 339, μ = 410, σ = 37. For α = 0.05 we get: left = 338 and right = 482.

The value of the criterion goes beyond the left border of the critical region, therefore hypothesis H 1 is accepted: general populations have different distribution laws. At the same time, the population average first sample LESS.

What is a grouping of statistical data, and how is it related to distribution series, was discussed in this lecture, where you can also learn about what a discrete and variational distribution series is.

Distribution series are one of the varieties of statistical series (in addition to them, dynamics series are used in statistics), they are used to analyze data on the phenomena of social life. Constructing variation series is quite a feasible task for everyone. However, there are rules that need to be remembered.

How to construct a discrete variational distribution series

Example 1. There is data on the number of children in 20 surveyed families. Construct a discrete variation series family distribution by number of children.

0 1 2 3 1
2 1 2 1 0
4 3 2 1 1
1 0 1 0 2

Solution:

  1. Let's start with a table layout, which we will then enter data into. Since the distribution rows have two elements, the table will consist of two columns. The first column is always an option - what we are studying - we take its name from the task (the end of the sentence with the task in the conditions) - by number of children– this means our option is the number of children.

The second column is frequency - how often our variant occurs in the phenomenon under study - we also take the name of the column from the task - family distribution – this means our frequency is the number of families with the corresponding number of children.

  1. Now from the source data we select those values ​​that occur at least once. In our case it is

And let’s arrange this data in the first column of our table in logical order, in this case increasing from 0 to 4. We get

And finally, let’s count how many times each value of the variant appears.

0 1 2 3 1

2 1 2 1 0

4 3 2 1 1

1 0 1 0 2

As a result, we obtain a completed table or the required row of distribution of families by number of children.

Exercise . There is data on the wage grades of 30 workers at the enterprise. Construct a discrete variation series for the distribution of workers by tariff category. 2 3 2 4 4 5 5 4 6 3

1 4 4 5 5 6 4 3 2 3

4 5 4 5 5 6 6 3 3 4

How to construct an interval variational distribution series

Let's construct an interval distribution series and see how its construction differs from a discrete series.

Example 2. There is data on the amount of profit received by 16 enterprises, million rubles. — 23 48 57 12 118 9 16 22 27 48 56 87 45 98 88 63. Construct an interval variation series of the distribution of enterprises by profit volume, identifying 3 groups with equal intervals.

The general principle of constructing the series, of course, will remain the same two columns, the same options and frequency, but in this case the options will be located in the interval and the frequencies will be counted differently.

Solution:

  1. Let's start similarly to the previous task by building a table layout, into which we will then enter data. Since the distribution rows have two elements, the table will consist of two columns. The first column is always an option - what we are studying - we take its name from the task (the end of the sentence with the task in the conditions) - by the amount of profit - which means our option is the amount of profit received.

The second column is the frequency - how often our variant occurs in the phenomenon under study - we also take the name of the column from the task - the distribution of enterprises - which means our frequency is the number of enterprises with the corresponding profit, in this case falling into the interval.

As a result, our table layout will look like this:

where i is the value or length of the interval,

Xmax and Xmin – maximum and minimum value of the attribute,

n is the required number of groups according to the conditions of the problem.

Let's calculate the size of the interval for our example. To do this, among the initial data we will find the largest and smallest

23 48 57 12 118 9 16 22 27 48 56 87 45 98 88 63 – the maximum value is 118 million rubles, and the minimum is 9 million rubles. Let's carry out the calculation using the formula.

In the calculation we got the number 36, (3) three in the period, in such situations the value of the interval must be rounded up so that after the calculations the maximum data is not lost, which is why in the calculation the value of the interval is 36.4 million rubles.

  1. Now let's construct intervals - our options in this problem. The first interval begins to be built from the minimum value, the value of the interval is added to it and the upper limit of the first interval is obtained. Then the upper limit of the first interval becomes the lower limit of the second interval, the value of the interval is added to it and the second interval is obtained. And so on as many times as required to construct intervals according to the condition.

Let us note that if we had not rounded the value of the interval to 36.4, but left it at 36.3, then the last value would have been 117.9. It is in order to avoid data loss that it is necessary to round the interval value to a larger value.

  1. Let's count the number of enterprises falling into each specific interval. When processing data, you must remember that the upper value of the interval in a given interval is not taken into account (is not included in this interval), but is taken into account in the next interval (the lower boundary of the interval is included in this interval, and the upper one is not included), with the exception of the last interval.

When carrying out data processing, it is best to indicate the selected data with symbols or colors to simplify processing.

23 48 57 12 118 9 16 22

27 48 56 87 45 98 88 63

We denote the first interval yellow- and determine how much data falls into the interval from 9 to 45.4, while this 45.4 will be taken into account in the second interval (provided that it is in the data) - in the end we get 7 enterprises in the first interval. And so on throughout all intervals.

  1. (additional action) Let's calculate the total amount of profit received by enterprises for each interval and in general. To do this, add up the data marked different colors and get the total profit value.

For the first interval - 23 + 12 + 9 + 16 + 22 + 27 + 45 = 154 million rubles.

For the second interval - 48 + 57 + 48 + 56 + 63 = 272 million rubles.

For the third interval - 118 + 87 + 98 + 88 = 391 million rubles.

Exercise . There is data on the amount of deposits in the bank of 30 depositors, thousand rubles. 150, 120, 300, 650, 1500, 900, 450, 500, 380, 440,

600, 80, 150, 180, 250, 350, 90, 470, 1100, 800,

500, 520, 480, 630, 650, 670, 220, 140, 680, 320

Build interval variation series distribution of depositors, according to the size of the deposit, identifying 4 groups with equal intervals. For each group, calculate the total amount of deposits.