When processing large amounts of information, which is especially important when carrying out modern scientific developments, the researcher faces the serious task of correctly grouping the source data. If the data is discrete in nature, then, as we have seen, no problems arise - you just need to calculate the frequency of each feature. If the characteristic under study has continuous character (what has greater distribution in practice), then choosing the optimal number of feature grouping intervals is by no means a trivial task.

To group continuous random variables, the entire variational range of the characteristic is divided into a certain number of intervals To.

Grouped interval (continuous) variation series are called intervals ranked by the value of the attribute (), where the numbers of observations falling into the r"th interval, or relative frequencies (), are indicated together with the corresponding frequencies ():

Characteristic value intervals

mi frequency

bar chart And cumulate (ogiva), already discussed in detail by us, are an excellent means of data visualization, allowing you to get a primary idea of ​​the data structure. Such graphs (Fig. 1.15) are constructed for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fills the region of its possible values, taking on any values.

Rice. 1.15.

That's why the columns on the histogram and the cumulate must touch each other and have no areas where the attribute values ​​do not fall within all possible(i.e., the histogram and cumulates should not have “holes” along the abscissa axis, which do not contain the values ​​of the variable being studied, as in Fig. 1.16). The height of the bar corresponds to frequency – the number of observations falling within a given interval, or relative frequency – the proportion of observations. Intervals must not intersect and are usually the same width.

Rice. 1.16.

The histogram and polygon are approximations of the probability density curve ( differential function) f(x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is as follows important during primary statistical processing of quantitative continuous data - by their appearance one can judge the hypothetical distribution law.

Cumulate – a curve of accumulated frequencies (frequencies) of an interval variation series. The graph of the cumulative distribution function is compared with the cumulate F(x), also discussed in the probability theory course.

Basically, the concepts of histogram and cumulate are associated specifically with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task is perhaps the most difficult, important and controversial in the issue under study.

The number of intervals should not be too small, as this will make the histogram too smooth ( oversmoothed), loses all the features of variability of the original data - in Fig. 1.17 you can see how the same data on which the graphs in Fig. 1.15, used to construct a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the studied data along the numerical axis: the histogram will be under-smoothed (undersmoothed), with empty intervals, uneven (see Fig. 1.17, right graph).

Rice. 1.17.

How to determine the most preferable number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to divide the original set of values ​​of the characteristic being studied. This formula has truly become extremely popular - most statistical textbooks offer it, and many statistical packages use it by default. How justified this is and in all cases is a very serious question.

So, what is the Sturges formula based on?

Consider the binomial distribution)