A person can recognize his abilities only by trying to apply them. (Seneca)

Confidence intervals

general review

By taking a sample from the population, we obtain a point estimate of the parameter of interest and calculate the standard error to indicate the precision of the estimate.

However, for most cases the standard error as such is not acceptable. It is much more useful to combine this measure of accuracy with an interval estimate for the population parameter.

This can be done by using knowledge of the theoretical probability distribution of the sample statistic (parameter) in order to calculate a confidence interval (CI - Confidence Interval, CI - Confidence Interval) for the parameter.

In general, a confidence interval extends estimates in both directions by a certain multiple of the standard error (of a given parameter); the two values ​​(confidence limits) defining the interval are usually separated by a comma and enclosed in parentheses.

Confidence interval for the mean

Using Normal Distribution

The sample mean is normally distributed if the sample size is large, so you can apply knowledge of the normal distribution when considering the sample mean.

Specifically, 95% of the distribution of sample means is within 1.96 standard deviations (SD) of the population mean.

When we only have one sample, we call it the standard error of the mean (SEM) and calculate the 95% confidence interval for the mean as follows:

If we repeat this experiment several times, the interval will contain the true population mean 95% of the time.

Typically this is a confidence interval, such as the interval of values ​​within which the true population mean (general mean) lies with a 95% confidence probability.

While it is not entirely rigorous (the population mean is a fixed value and therefore cannot have a probability attached to it) to interpret a confidence interval this way, it is conceptually easier to understand.

Usage t- distribution

You can use the normal distribution if you know the value of the variance in the population. Also, when the sample size is small, the sample mean follows a normal distribution if the underlying population data are normally distributed.

If the data underlying the population are not normally distributed and/or the population variance is unknown, the sample mean obeys Student's t-distribution.

We calculate the 95% confidence interval for the general population mean as follows:

Where is the percentage point (percentile) t- Student's t distribution with (n-1) degrees of freedom, which gives a two-sided probability of 0.05.

In general, it provides a wider range than when using the normal distribution, since it takes into account the additional uncertainty that is introduced when estimating standard deviation population and/or due to small sample size.

When the sample size is large (on the order of 100 or more), the difference between the two distributions ( t-Student and normal) is insignificant. However, they always use t- distribution when calculating confidence intervals, even if the sample size is large.

Typically the 95% CI is reported. Other confidence intervals can be calculated, such as the 99% CI for the mean.

Instead of the product of the standard error and the table value t- distribution, which corresponds to a two-sided probability of 0.05, multiply it (standard error) by the value that corresponds to a two-sided probability of 0.01. This is a wider confidence interval than the 95% confidence interval because it reflects increased confidence that the interval actually includes the population mean.

Confidence interval for proportion

The sampling distribution of proportions has a binomial distribution. However, if the sample size n is reasonably large, then the sampling distribution of the proportion is approximately normal with the mean .

We evaluate by selective ratio p=r/n(Where r- the number of individuals in the sample with the ones we are interested in characteristic features), and the standard error is estimated:

The 95% confidence interval for the proportion is estimated:

If the sample size is small (usually when n.p. or n(1-p) less 5 ), then it is necessary to use the binomial distribution in order to calculate accurate confidence intervals.

Note that if p expressed as a percentage, then (1-p) replaced by (100-p).

Interpretation of confidence intervals

When interpreting a confidence interval, we are interested in the following questions:

How wide is the confidence interval?

A wide confidence interval indicates that the estimate is imprecise; narrow indicates an accurate estimate.

The width of the confidence interval depends on the size of the standard error, which in turn depends on the sample size and, when considering a numerical variable, the variability of the data produces wider confidence intervals than studies of a large data set of few variables.

Does the CI include any values ​​of particular interest?

You can check whether the likely value for a population parameter falls within the confidence interval. If so, the results are consistent with this likely value. If not, then it is unlikely (for a 95% confidence interval the chance is almost 5%) that the parameter has that value.

There are two types of estimates in statistics: point and interval. Point estimate is a single sample statistic that is used to estimate a population parameter. For example, the sample mean is a point estimate mathematical expectation population, and sample variance S 2- point estimate of population variance σ 2. it has been shown that the sample mean is an unbiased estimate of the mathematical expectation of the population. A sample mean is called unbiased because the average of all sample means (with the same sample size) n) is equal to the mathematical expectation of the general population.

In order for the sample variance S 2 became an unbiased estimate of the population variance σ 2, the denominator of the sample variance should be set equal to n – 1 , but not n. In other words, the population variance is the average of all possible sample variances.

When estimating population parameters, it should be kept in mind that sample statistics such as , depend on specific samples. To take this fact into account, to obtain interval estimation mathematical expectation of the general population, analyze the distribution of sample means (for more details, see). The constructed interval is characterized by a certain confidence level, which represents the probability that the true population parameter is estimated correctly. Similar confidence intervals can be used to estimate the proportion of a characteristic R and the main distributed mass of the population.

Download the note in or format, examples in format

Constructing a confidence interval for the mathematical expectation of the population with a known standard deviation

Constructing a confidence interval for the share of a characteristic in the population

This section extends the concept of confidence interval to categorical data. This allows us to estimate the share of the characteristic in the population R using sample share RS= X/n. As indicated, if the quantities nR And n(1 – p) exceed the number 5, the binomial distribution can be approximated as normal. Therefore, to estimate the share of a characteristic in the population R it is possible to construct an interval whose confidence level is equal to (1 – α)х100%.


Where pS- sample proportion of the characteristic equal to X/n, i.e. number of successes divided by sample size, R- the share of the characteristic in the general population, Z- critical value of the standardized normal distribution, n- sample size.

Example 3. Let's assume that a sample consisting of 100 invoices filled out during the last month is extracted from the information system. Let's say that 10 of these invoices were compiled with errors. Thus, R= 10/100 = 0.1. The 95% confidence level corresponds to the critical value Z = 1.96.

Thus, the probability that between 4.12% and 15.88% of invoices contain errors is 95%.

For a given sample size, the confidence interval containing the proportion of the characteristic in the population appears wider than for a continuous random variable. This is because measurements of a continuous random variable contain more information than measurements of categorical data. In other words, categorical data that takes only two values ​​contains insufficient information to estimate the parameters of their distribution.

INcalculating estimates extracted from a finite population

Estimation of mathematical expectation. Correction factor for the final population ( fpc) was used to reduce the standard error by a factor. When calculating confidence intervals for population parameter estimates, a correction factor is applied in situations where samples are drawn without being returned. Thus, a confidence interval for the mathematical expectation having a confidence level equal to (1 – α)х100%, is calculated by the formula:

Example 4. To illustrate the use of the correction factor for a finite population, let us return to the problem of calculating the confidence interval for the average amount of invoices, discussed above in Example 3. Suppose that a company issues 5,000 invoices per month, and =110.27 dollars, S= $28.95, N = 5000, n = 100, α = 0.05, t 99 = 1.9842. Using formula (6) we obtain:

Estimation of the share of a feature. When choosing without return, the confidence interval for the proportion of the attribute having a confidence level equal to (1 – α)х100%, is calculated by the formula:

Confidence intervals and ethical issues

When sampling a population and drawing statistical conclusions, ethical issues often arise. The main one is how confidence intervals and point estimates of sample statistics agree. Publishing point estimates without specifying the associated confidence intervals (usually at the 95% confidence level) and the sample size from which they are derived can create confusion. This may give the user the impression that the point estimate is exactly what he needs to predict the properties of the entire population. Thus, it is necessary to understand that in any research the focus should be not on point estimates, but on interval estimates. Besides, Special attention should be given the right choice sample sizes.

Most often, the objects of statistical manipulation are the results of sociological surveys of the population on certain political issues. In this case, the survey results are published on the front pages of newspapers, and the error sample survey and methodology statistical analysis printed somewhere in the middle. To prove the validity of the obtained point estimates, it is necessary to indicate the sample size on the basis of which they were obtained, the boundaries of the confidence interval and its level of significance.

Next note

Materials from the book Levin et al. Statistics for Managers are used. – M.: Williams, 2004. – p. 448–462

Central limit theorem states that with a sufficiently large sample size, the sample distribution of means can be approximated by a normal distribution. This property does not depend on the type of distribution of the population.

Let us have a large number of objects with a normal distribution of certain characteristics (for example, a complete warehouse of vegetables of the same type, the size and weight of which varies). You want to know the average characteristics of the entire batch of goods, but you have neither the time nor the desire to measure and weigh each vegetable. You understand that this is not necessary. But how many pieces would need to be taken for a spot check?

Before giving several formulas useful for this situation, let us recall some notation.

Firstly, if we did measure the entire warehouse of vegetables (this set of elements is called the general population), then we would know with all the accuracy available to us the average weight of the entire batch. Let's call this average X avg .g en . - general average. We already know what is completely determined if its mean value and deviation s are known . True, while we are neither X average gen. nor s We don’t know the general population. We can only take a certain sample, measure the values ​​we need and calculate for this sample both the average value X avg. and the standard deviation S select.

It is known that if our sample check contains a large number of elements (usually n is greater than 30), and they are taken really random, then s the general population will hardly differ from S selection ..

In addition, for the case of normal distribution we can use the following formulas:

With a probability of 95%


With a probability of 99%



IN general view with probability P (t)


The relationship between the t value and the probability value P (t), with which we want to know the confidence interval, can be taken from the following table:


Thus, we have determined in which range the average value for the population lies (with a given probability).

Unless we have a large enough sample, we cannot say that the population has s = S select In addition, in this case the closeness of the sample to the normal distribution is problematic. In this case, we also use S select instead s in the formula:




but the value of t for a fixed probability P(t) will depend on the number of elements in the sample n. The larger n, the closer the resulting confidence interval will be to the value given by formula (1). The t values ​​in this case are taken from another table ( Student's t-test), which we present below:

Student's t-test values ​​for probability 0.95 and 0.99


Example 3. 30 people were randomly selected from the company's employees. According to the sample, it turned out that the average salary (per month) is 30 thousand rubles with a standard deviation of 5 thousand rubles. Determine the average salary in the company with a probability of 0.99.

Solution: By condition we have n = 30, X avg. =30000, S=5000, P = 0.99. To find the confidence interval, we will use the formula corresponding to the Student's t test. From the table for n = 30 and P = 0.99 we find t = 2.756, therefore,


those. sought-after trustee interval 27484< Х ср.ген < 32516.

So, with a probability of 0.99 we can say that the interval (27484; 32516) contains within itself the average salary in the company.

We hope that you will use this method, and it is not necessary that you have a table with you every time. Calculations can be carried out automatically in Excel. While in the Excel file, click the fx button in the top menu. Then, select the “statistical” type among the functions, and from the proposed list in the window - STUDAR DISCOVER. Then, at the prompt, placing the cursor in the “probability” field, enter the value of the inverse probability (i.e. in our case, instead of the probability of 0.95, you need to type the probability of 0.05). Apparently, the spreadsheet is designed in such a way that the result answers the question of how likely we are to be wrong. Similarly, in the Degree of Freedom field, enter a value (n-1) for your sample.

Confidence intervallimit values a statistical quantity that, with a given confidence probability γ, will be in this interval when sampling a larger volume. Denoted as P(θ - ε. In practice, choose confidence probabilityγ from values ​​quite close to unity: γ = 0.9, γ = 0.95, γ = 0.99.

Purpose of the service. Using this service, you can determine:

  • confidence interval for the general mean, confidence interval for the variance;
  • confidence interval for the standard deviation, confidence interval for the general share;
The resulting solution is saved in a Word file (see example). Below is a video instruction on how to fill out the initial data.

Example No. 1. On a collective farm, out of a total herd of 1000 sheep, 100 sheep underwent selective control shearing. As a result, an average wool clipping of 4.2 kg per sheep was established. Determine with a probability of 0.99 the mean square error of the sample when determining the average wool shearing per sheep and the limits within which the shearing value is contained if the variance is 2.5. The sample is non-repetitive.
Example No. 2. From a batch of imported products at the post of the Moscow Northern Customs, 20 samples of product “A” were taken by random repeated sampling. As a result of the test, the average moisture content of product “A” in the sample was established, which turned out to be equal to 6% with a standard deviation of 1%.
Determine with probability 0.683 the limits of the average moisture content of the product in the entire batch of imported products.
Example No. 3. A survey of 36 students showed that the average number of textbooks they read per year academic year, turned out to be equal to 6. Considering that the number of textbooks read by a student per semester has normal law distributions with a standard deviation equal to 6, find: A) with a reliability of 0.99, an interval estimate for the mathematical expectation of this random variable; B) with what probability can we say that the average number of textbooks read by a student per semester, calculated from a given sample, will deviate from the mathematical expectation according to absolute value no more than 2.

Classification of confidence intervals

By type of parameter being assessed:

By sample type:

  1. Confidence interval for an infinite sample;
  2. Confidence interval for the final sample;
The sample is called resampling, if the selected object is returned to the population before selecting the next one. The sample is called non-repeat, if the selected object is not returned to the population. In practice, we usually deal with non-repetitive samples.

Calculation of the average sampling error for random sampling

The discrepancy between the values ​​of indicators obtained from the sample and the corresponding parameters of the general population is called representativeness error.
Designations of the main parameters of the general and sample populations.
Average sampling error formulas
re-selectionrepeat selection
for averagefor sharefor averagefor share
The relationship between the sampling error limit (Δ) guaranteed with some probability Р(t), and the average sampling error has the form: or Δ = t·μ, where t– confidence coefficient, determined depending on the probability level P(t) according to the table of Laplace integral function.

Formulas for calculating the sample size using a purely random sampling method