When designing sample observation, the question arises about the required sample size. This number can be determined on the basis of the permissible error during sample observation, on the basis of the probability on the basis of which the magnitude of the established error can be guaranteed, and, finally, on the basis of the selection method.

Formulas for the required sample size for various sampling methods can be derived from the corresponding relationships used in calculating the maximum sampling errors. Here are the most commonly used expressions for the required sample size in practice:

· actual random and mechanical sampling:

(re-selection)

(non-repetitive selection)

typical sample:

(re-selection)

(non-repetitive selection)

serial sampling:

(re-selection)

(non-repetitive selection)

Moreover, depending on the purposes of the study, variances and sampling errors can be calculated for the average value or proportion of the characteristic.

Let's look at examples of determining the required sample size when in various ways formation of the sample population.

Example 5. At 100 travel agencies In the city, it is planned to conduct a survey of the average monthly number of vouchers sold using a mechanical selection method. What should the sample size be so that with a probability of 0.683 the error does not exceed 3 trips, if according to the test survey the variance is 225.

Solution. Let's calculate the required sample size:

Agencies

Example 6. In order to determine the proportion of employees of commercial banks in the region over the age of 40, it is proposed to organize a typical sample proportional to the number of male and female employees with mechanical selection within the groups. The total number of bank employees is 12 thousand people, including 7 thousand men and 5 thousand women.

Based on previous surveys, it is known that the average of the within-group variances is 1600. Determine the required sample size with a probability of 0.997 and an error of 5%.

Solution. Let's calculate the total size of a typical sample:

people

Let us now calculate the volume of individual typical groups:

people

people

Thus, the required sample size of bank employees is 550 people, incl. 319 men and 231 women.

Example 7. IN joint stock company 200 teams of workers. It is planned to conduct a sample survey to determine the proportion of workers with occupational diseases. It is known that the inter-series variance of the proportion is 225. With a probability of 0.954, calculate the required number of teams to survey workers if the sampling error should not exceed 5%.

Solution. We will calculate the required number of teams based on the formula for the volume of serial non-repetitive sampling:

brigades

3. Determination of the required sample size

Very important has a definition of the optimal sample size, which with a certain probability will ensure the specified accuracy of the observation results. As the sample size increases, sampling error decreases. But since the selected units for survey are often destroyed, the norms for selecting units in the sample must be optimal. The optimal sample size can be obtained from sampling error formulas.

Table 8.4

Formulas for determining the optimal sample size

Selection method

For average

Properly random repeated

Random and mechanical non-repetitive

Typological non-repetitive

Serial non-repetitive with equal series

The formulas show that as the estimated sampling error increases, the required sample size decreases significantly.

To calculate the sample size, you need to know the variance. It can be borrowed from previous surveys of the same or similar population, or a special small-scale sample survey can be conducted.

Example 2 : At the enterprise, 100 workers out of 1000 were interviewed in a random non-repetitive sample and the following data were obtained on their income for October (Table 8.5).

Table 8.5

Distribution of workers by average monthly income

Define:

1) the average monthly income of employees of a given enterprise, guaranteeing a result with a probability of 0.997;

2) the share of enterprise workers with a monthly income of 19 thousand rubles. and higher, guaranteeing a result with a probability of 0.954;

3) the required sample size when determining the average monthly income of the enterprise’s employees, so that with a probability of 0.954 the maximum sampling error does not exceed 200 rubles.

Solution:

1) Let us determine the average monthly income of employees of this enterprise, guaranteeing the result with a probability of 0.997.

n= 100 people

N= 1000 people

Solution: to determine the interval of average monthly income of employees of a given enterprise in the general population, it is necessary to know the value of the maximum sampling error and the average monthly income of workers according to sample survey data .


t and average sampling error .

Since P= 0.997, then (according to Table 8.2) t= 3.

A random non-repetitive selection was made, according to table. 8.3, choose the formula for calculating the average sampling error for the average:

, Where
– sample variance.

The size of the average monthly income of workers based on the sample survey data will be determined using the weighted arithmetic average formula:
.

We will carry out additional calculations in the following table:

Monthly income

Number of workers, people

Middle of the interval

thousand roubles.

thousand roubles.

Knowing t And
Let's determine the value of the maximum sampling error:

Thousand rub.

Then the interval of the average monthly income of workers of this enterprise will be as follows:

;

.

Answer: the average monthly income of employees of this enterprise with a probability of 0.997 is within the range of 18.08 thousand rubles. up to 18.92 thousand rubles.

2) Let us determine the share of enterprise workers who have a monthly income of 19 thousand rubles. and higher, guaranteeing a result with a probability of 0.954.

n= 100 people

N= 1000 people

Solution: to determine the interval of the share of workers with a monthly income of 19 thousand rubles. and above it is necessary to know the value of the maximum sampling error of the share
and the share of workers with such average monthly income according to the sample data W.

The maximum sampling error is determined by the formula
. It depends on the value of the confidence coefficient t and average sampling error.

Since P= 0.954, then (according to Table 8.2) t= 2.

A random non-repetitive selection was made, according to table. 8.3, choose the formula for calculating the average sampling error for the share:

, Where W– the share of enterprise workers with an average monthly income of 19 thousand rubles. and higher in the sample.

The sample proportion is determined by the ratio of the number of units possessing the characteristic being studied m To total number sample units n, or
.

Then the average fraction error is

Knowing t and determine the value of the maximum sampling error for the share:

Then the interval of the share of workers with a monthly income of 19 thousand rubles. and higher in the general population will be like this:

.

Answer: the share of enterprise workers with a monthly income of 19 thousand rubles. and higher, with a probability of 0.954 ranging from 19.4% to 36.6%.

    Let us determine the required sample size when determining the average monthly income of enterprise employees so that with a probability of 0.954 the maximum sampling error does not exceed 200 rubles.

N= 1000 people

Solution: the required sample size to determine the average monthly income is determined by the formula (according to Table 8.4):

According to the conditions of the problem, we know: with probability P = 0.954 t= 2 (see table 8.2) ;

0.2 thousand rubles;
(based on data from the previous sample).

people

Answer: in order for the maximum sampling error to not exceed 200 rubles with a probability of 0.954, 189 people must be examined.

4.5. Determining sample size

The sampling plan procedure includes sequential solution the following three tasks:

Definition of the research object;

Determining the sampling structure;

Determining the sample size.

Usually, object of marketing research is a set of observation objects, which can be consumers, company employees, intermediaries, etc. If this population is so small that the research team has the necessary labor, financial and time resources to establish contact with each of its elements, then it is quite possible to conduct a continuous study of the entire population. In this case, having determined the object of research, you can proceed to the next procedure (choosing a data collection method, research instrument and method of communication with the audience).

However, in practice, it is often not possible or advisable to conduct a comprehensive study of the entire population. There may be the following reasons for this:

Inability to establish contact with some elements of the totality;

Unreasonably high costs for conducting a continuous study or the presence of financial restrictions that do not allow conducting a complete study;

The short time frame allocated for research is due to the loss of relevance of information over time or other reasons and does not allow the collection, systematization and analysis of extensive data for the entire population.

Therefore, large and dispersed populations are often studied using a sample, which, as is known, is understood as a part of the population intended to represent the population as a whole.

The accuracy with which a sample reflects the population as a whole depends on sample structure and size.

There are two approaches to sampling design- probabilistic and deterministic.

Probabilistic approach to sampling design assumes that any element of the population can be selected with a certain (non-zero) probability. Exist different kinds samples based on probability theory (typical, nested, etc.). The simplest and most common in practice is simple random sampling, in which each element of the population has an equal probability of being selected for research.

Probability sampling is more accurate and allows the researcher to assess the degree of reliability of the data he collected, although it is more complex and more expensive than deterministic sampling.

Deterministic approach to the sampling frame assumes that the selection of population elements is made by methods based either on considerations of convenience, or on the decision of the researcher, or on contingent groups.

for reasons of convenience, consists in selecting any elements of the population based on the ease of establishing contact with them. The imperfection of this method is possibly due to the low representativeness of the resulting sample, because elements of the population that are convenient for the researcher may not be sufficiently representative representatives of the population due to their non-random and unreasonable selection.

However, on the other hand, the simplicity, economy and efficiency of research carried out by this method have earned it quite widespread use in practice and, above all, during preliminary research aimed at clarifying the main problems.

Sampling method based on the researcher's decision, consists in selecting elements of the population, which, in his opinion, are its characteristic representatives. This method is more advanced than the previous one, since it is based on an orientation towards characteristic representatives of the population under study, although selected on the basis of the researchers’ subjective ideas about it.

Sampling method based on contingent standards, consists in selecting characteristic elements of the population in accordance with the previously obtained characteristics of the population as a whole. These characteristics can be obtained by conducting preliminary research and, unlike the previous method, are not subjective. That's why this method is more advanced, it allows you to obtain sample populations no less representative than probability samples at significantly lower costs for conducting the survey.

Having chosen the sample structure (the approach to its formation, the type of probabilistic or deterministic sampling), the researcher will have to determine the volume, i.e. number of elements in the sample population.

Sample size determines the reliability of information obtained as a result of her research, as well as the costs necessary to conduct the research. Sample size depends on the level of homogeneity or variety of the objects being studied.

The larger the sample size, the higher its accuracy and the higher the costs of conducting its survey. With a probabilistic approach to the sample structure, its volume can be determined using well-known statistical formulas, based on specified requirements for its accuracy.

In practice, several approaches are used to determine sample size:

1. Free approach based on the application of the "rule of thumb". For example, it is accepted without evidence that to obtain accurate results, the sample must be 5% of the population. This approach is simple and easy to implement, but it is not possible to determine the accuracy of the results obtained. With a sufficiently large population, it can also be quite expensive.

The sample size can be set based on certain pre-agreed conditions. For example, the customer of marketing research knows that when studying public opinion, the sample is usually 1000-1200 people, so he recommends that the researcher stick to this figure. If annual research is conducted on a certain market, then a sample of the same size is used in each year. In contrast to the first approach, here, when determining the sample size, well-known logic is used, which, however, is very vulnerable.

For example, when conducting certain studies, less accuracy may be required than when studying public opinion, and the size of the population may be many times smaller than when studying public opinion. Thus, this approach does not take current circumstances into account and can be quite expensive.

In some cases, the cost of conducting a survey is used as the main argument in determining the sample size. Thus, the marketing research budget provides for the costs of conducting certain surveys, which cannot be exceeded. Obviously, the value of the information received is not taken into account. However, in some cases, a small sample can give fairly accurate results.

It seems reasonable to consider costs not in absolute terms, but in relation to the usefulness of the information obtained from the surveys conducted. The client and researcher should consider different sample sizes and data collection methods, costs, and other factors

2. Sample size from the level of the confidence interval of the permissible error, which, as already mentioned, is determined by the expedient accuracy of the final generalizations: from increased to indicative. However, here we mean the so-called random errors associated with the nature of any statistical errors. They are calculated as errors in the representativeness of probability samples.

V.I. Paniotto provides the following calculations for a representative sample with the assumption of a 5 percent error (Table 4.2).

Table 4.2

Sample calculation table

For a population greater than 100,000, the sample size is 400 units. If we keep in mind general populations of 5 thousand or more, then, according to the calculations of the same author, we can indicate the magnitude of the actual sampling error depending on its volume, which is very important for us, bearing in mind that the magnitude of the permissible error depends on the purpose study and does not have to be close to the 5 percent level.

Table 4.3

Calculation table

Along with random errors, systematic errors are possible. They depend on the organization of the sample survey. These are various sampling biases towards one of the poles of the sample parameter.

3. Sample size based statistical analysis . This approach is based on determining the minimum sample size based on certain requirements for the reliability and validity of the results obtained. It is also used when analyzing the results obtained for individual subgroups formed within the sample by gender, age, level of education, etc. Requirements for the reliability and accuracy of results for individual subgroups dictate certain requirements for the sample size as a whole.

The most theoretically based and correct approach to determining sample size is based on calculating credible intervals. The concept of variation characterizes the amount of dissimilarity (similarity) of respondents' answers to a certain question. In a more strict sense, the variation in the values ​​of a characteristic in the aggregate is the difference in its values ​​among different units of a given population at the same period or point in time. The results of survey responses are usually presented in the form of a distribution curve (Figure 4.1). When the similarity of answers is high, we speak of low variation (narrow distribution curve), and when similarity of answers is low, we speak of high variation (wide distribution curve).

As a measure of variation, the standard deviation is usually taken, which characterizes the average distance from average rating each respondent's answers to a specific question.

Small variation

High Variation

Rice. 4.1. Variation and distribution curves

Since all marketing decisions are made under conditions of uncertainty, it is advisable to take this circumstance into account when determining the sample size. Since the determination of the studied values ​​for a population in a narrow manner is carried out on the basis of sample statistics, it is necessary to establish the range (confidence interval) in which the estimates for the population as a whole are expected to fall, and the error in their determination.

A confidence interval is a range whose extreme points correspond to a certain percentage of certain answers to a question. The confidence interval is closely related to the standard deviation of the characteristic being studied in the population: the larger it is, the wider the confidence interval must be in order to include a certain percentage of responses.

A confidence interval of either 95% or 99% is standard when conducting market research. No company conducts marketing research using multiple samples. AND math statistics makes it possible to obtain some information about the sampling distribution, having only data on the variation of a single sample.

An indicator of the degree to which the estimate true for the population as a whole differs from the estimate expected for a typical sample is the mean square error. Moreover, the larger the sample size, the smaller the error. A high value of variation determines high value errors and vice versa.

When on asked question There are only two answer options, expressed as a percentage (a percentage measure is used), the sample size is determined by the following formula:

where n is the sample size; z – normalized deviation, determined based on the selected confidence level; p – found variation for the sample; g – (100-r); e – permissible error.

When determining the variation index for a certain population, it is first of all advisable to conduct a preliminary qualitative analysis of the population under study, first of all, to establish the similarity of population units in demographic, social and other respects of interest to the researcher. It is possible to conduct a pilot study, using the results of similar studies conducted in the past. When using the percentage measure of variability, it is taken into account that the maximum variability is achieved for p = 50%, which is the worst case. Moreover, this indicator does not radically affect the sample size. The opinion of the research customer regarding the sample size is also taken into account.

It is possible to determine sample size using means rather than percentages.

where s is the standard deviation.

In practice, if the sample is newly formed and similar surveys have not been conducted, then s is unknown. In this case, it is advisable to set the error e in fractions of the standard deviation. The calculation formula is transformed and acquires next view:

Where .

Above there was talk about aggregates very large sizes. However, in some cases the aggregates are not large. Typically, if the sample is less than five percent of the population, then the population is considered large and calculations are carried out according to the above rules. If the sample size exceeds 5% of the population, then the latter is considered small and a correction factor is introduced into the above formulas.

The sample size in this case is determined as follows:

,

Practical work No. 8. “Determination of the required sample size”

“Determining the required sample size”

The most widespread type of non-continuous observation is selective observation, in which not all units of the population being studied are examined, but only a selected part of them in a certain way.

The entire set of objects (observations) to be studied is called the general population. Sample population or selection is called a part of the general population selected for the study of properties that ensure representativeness.

Selection from the population is carried out in such a way that, based on the sample, it is possible to obtain a fairly accurate idea of ​​the main parameters of the population as a whole. Wherein we're talking about both about a point estimate, which takes the corresponding value of the average, share, etc., obtained as a result of the sample, and about an interval estimate, i.e. about the limits within which, with a certain probability, the value of the desired parameter in the population may lie. The main requirement that a sample population must meet is the requirement of its representativeness, i.e. representativeness.

In statistics, the results of continuous observation are sometimes assessed as sample characteristics. This interpretation of the obtained data takes place in cases where the number of surveyed units is small and there is no firm confidence that the characteristics being studied cannot take on values ​​other than those identified as a result of observation. When conducting experiments, the number of values ​​can be infinitely large, therefore, when formulating conclusions based on a limited number, it is necessary to consider the data obtained as sample characteristics.

When extending the results of a sample survey to the general population, it should be borne in mind that there may be a discrepancy between the characteristics of the general and sample populations due to the fact that not the entire population is being surveyed, but only a part of it.

Error statistical observation The value of the deviation between the calculated and actual values ​​of the characteristics of the objects under study is considered.

The sampling method provides significant savings in material and financial resources when conducting statistical observation, which makes it possible to expand the survey program and increase its efficiency. The second advantage is the high reliability of the data obtained, since with a relatively small sample size it is possible to organize effective control over the quality of the collected information. Thus, the likelihood of registration errors occurring and not being detected at the stage of checking the primary information is reduced. And finally, in a number of cases when continuous observation is associated with the destruction or damage of the surveyed units (for example, when checking the quality of food products going on sale), only a sample survey is possible.

The accuracy of estimates obtained using the sampling method depends not on the proportion of units surveyed, but on their number.

Main stages of sample observation;

1) defining goals, objectives and drawing up an observation program;

2) sampling;

3) data collection based on the developed program;

4) analysis of the results obtained and calculation of the main characteristics of the sample population;

5) calculation of sampling error and distribution of its results to the general population.

Distinguish types of sampling:

1) random(proper random);

2) mechanical(for example, every 10, 20, etc.);

3) typical (stratified), When population divided into groups and in each group several objects are examined));

4) serial (nesting), when entire series are randomly selected.

The simplest way to form a sample population is actually random selection. Theoretical basis The sampling method, originally developed in relation to random sampling itself, is also used to determine sampling errors in other methods of observation.

Actually, random selection can be repeated or non-repetitive. At repeated In sampling, each unit randomly selected from the general population returns to this population after observation and can be re-examined. In practice, this selection method is rare. Much more common is the actual random unrepeatable selection, in which the surveyed units are not returned to the general population and cannot be surveyed again. With repeated selection, the probability of being included in the sample for each unit of the population remains unchanged. With non-repetitive sampling, it changes, but for all units remaining in the population after selecting several units from it, the probability of being included in the sample is the same.

One of the main components of a well-designed study is defining the sample and what a representative sample is. It's like the cake example. After all, you don’t have to eat the whole dessert to understand its taste? A small part is enough.

So, the cake is population (that is, all respondents who are eligible for the survey). It can be expressed geographically, for example, only residents of the Moscow region. Gender - women only. Or have age restrictions - Russians over 65 years old.

Calculating the population is difficult: you need to have data from the population census or preliminary assessment surveys. Therefore, usually the general population is “estimated”, and from the resulting number they calculate sample population or sample.

What is a representative sample?

Sample– this is a clearly defined number of respondents. Its structure should coincide as much as possible with the structure of the general population in terms of the main characteristics of selection.

For example, if potential respondents are the entire population of Russia, where 54% are women and 46% are men, then the sample should contain exactly the same percentage. If the parameters coincide, then the sample can be called representative. This means that inaccuracies and errors in the study are reduced to a minimum.

The sample size is determined taking into account the requirements of accuracy and economy. These requirements are inversely proportional to each other: the larger the sample size, the more accurate the result. Moreover, the higher the accuracy, the correspondingly more costs are required to conduct the study. And vice versa, the smaller the sample, the less costs it costs, the less accurately and more randomly the properties of the general population are reproduced.

Therefore, to calculate the volume of choice, sociologists invented a formula and created special calculator:

Confidence probability And confidence error

What do the terms " confidence probability" And " confidence error"? Confidence probability is an indicator of measurement accuracy. And the confidence error is possible error research results. For example, with a population of more than 500,00 people (let’s say living in Novokuznetsk), the sample will be 384 people with confidence probability 95% and margin of error 5% OR (with a confidence interval of 95±5%).

What follows from this? When conducting 100 studies with such a sample (384 people), in 95 percent of cases the answers obtained, according to the laws of statistics, will be within ±5% of the original one. And we will get a representative sample with a minimum probability of statistical error.

After the sample size has been calculated, you can see if there is a sufficient number of respondents in the demo version of the Questionnaire Panel. You can find out more about how to conduct a panel survey.

Interval estimation of event probability. Formulas for calculating the sample size using a purely random sampling method.

To determine the probabilities of events that interest us, we use a sampling method: we conduct n independent experiments, in each of which event A may occur (or not occur) (probability R occurrence of event A in each experiment is constant). Then the relative frequency p* of occurrences of events A in a series of n tests is taken as a point estimate for the probability p occurrence of an event A in a separate trial. In this case, the value p* is called sample share occurrences of the event A, and p - general shares .

Due to the corollary of the central limit theorem (Moivre-Laplace theorem), the relative frequency of an event with a large sample size can be considered normally distributed with parameters M(p*)=p and

Therefore, for n>30, a confidence interval for the general share can be constructed using the formulas:


where u cr is found from the tables of the Laplace function, taking into account the given confidence probability γ: 2Ф(u cr)=γ.

With a small sample size n≤30, the maximum error ε is determined from the Student distribution table:
where tcr =t(k; α) and the number of degrees of freedom k=n-1 probability α=1-γ (two-sided area).

The formulas are valid if the selection was carried out in a random, repeated manner (the general population is infinite), otherwise it is necessary to make an adjustment for the non-repetition of selection (table).

Average sampling error for the general share

PopulationInfiniteFinal volume N
Type of selectionRepeatedRepeatless
Average sampling error

Formulas for calculating the sample size using a purely random sampling method

Selection methodFormulas for determining sample size
for averagefor share
Repeated
Repeatless
Fraction of units w = . Accuracy ε = . Probability γ =

General share problems

To the question “Does the confidence interval cover the given p0 value?” - can be answered by checking the statistical hypothesis H 0:p=p 0 . It is assumed that the experiments are carried out according to the Bernoulli test scheme (independent, probability p occurrence of an event A is constant). By volume sample n determine the relative frequency p * of occurrence of event A: where m- number of occurrences of the event A in a series of n tests. To test the hypothesis H 0, statistics are used that, with a sufficiently large sample size, have a standard normal distribution (Table 1).
Table 1 - Hypotheses about the general share

Hypothesis

H 0:p=p 0H 0:p 1 =p 2
AssumptionsBernoulli test circuitBernoulli test circuit
Sample estimates
Statistics K
Statistics distribution K Standard normal N(0,1)

Example No. 1. Using random repeat sampling, the firm's management conducted a sample survey of 900 of its employees. Among the respondents there were 270 women. Construct a confidence interval with a probability of 0.95 covering the true proportion of women in the entire team of the company.
Solution. According to the condition, the sample proportion of women is (relative frequency of women among all respondents). Since the selection is repeated and the sample size is large (n=900), the maximum sampling error is determined by the formula

The value of u cr is found from the table of the Laplace function from the relation 2Ф(u cr) = γ, i.e. The Laplace function (Appendix 1) takes the value 0.475 at u cr =1.96. Therefore, the marginal error and the desired confidence interval
(p – ε, p + ε) = (0.3 – 0.18; 0.3 + 0.18) = (0.12; 0.48)
So, with a probability of 0.95, we can guarantee that the proportion of women in the entire team of the company is in the range from 0.12 to 0.48.

Example No. 2. The owner of the parking lot considers the day “lucky” if the parking lot is more than 80% full. During the year, 40 inspections of the car park were carried out, of which 24 were “successful”. With a probability of 0.98, find a confidence interval for estimating the true proportion of “lucky” days during the year.
Solution. The sample proportion of “lucky” days is
Using the table of the Laplace function, we find the value of u cr for a given
confidence probability
Ф(2.23) = 0.49, ucr = 2.33.
Considering the selection to be non-repetitive (i.e., two checks were not carried out on the same day), we will find the limiting error:
where n=40, N = 365 (days). From here
and confidence interval for the general share: (p – ε, p + ε) = (0.6 – 0.17; 0.6 + 0.17) = (0.43; 0.77)
With a probability of 0.98, we can expect that the proportion of “lucky” days during the year will be in the range from 0.43 to 0.77.

Example No. 3. Having checked 2500 products in the batch, we found that 400 products premium, but n–m is not. How many products need to be checked in order to determine with 95% confidence the proportion of the highest grade with an accuracy of 0.01?
We look for a solution using the formula for determining the sample size for re-selection.

Ф(t) = γ/2 = 0.95/2 = 0.475 and this value according to the Laplace table corresponds to t=1.96
Sample proportion w = 0.16; sampling error ε = 0.01

Example No. 4. A batch of products is accepted if the probability that the product will comply with the standard is at least 0.97. Among the randomly selected 200 products of the tested batch, 193 were found to meet the standard. Is it possible to accept the batch at the significance level α=0.02?
Solution. Let us formulate the main and alternative hypotheses.
H 0:p=p 0 =0.97 - unknown general share p equal to the specified value p 0 =0.97. In relation to the condition - the probability that a part from the inspected batch will comply with the standard is equal to 0.97; those. The batch of products can be accepted.
H 1:p<0,97 - вероятность того, что деталь из проверяемой партии окажется соответствующей стандарту, меньше 0.97; т.е. партию изделий нельзя принять. При такой альтернативной гипотезе критическая область будет левосторонней.
Observed Statistic Value K(table) calculate for given values ​​p 0 =0.97, n=200, m=193


We find the critical value from the table of the Laplace function from the equality


According to the condition, α = 0.02, hence F(Kcr) = 0.48 and Kcr = 2.05. The critical region is left-sided, i.e. is the interval (-∞;-K kp)= (-∞;-2.05). The observed value K obs = -0.415 does not belong to the critical region, therefore, at this level of significance there is no reason to reject the main hypothesis. You can accept a batch of products.

Example No. 5. Two factories produce the same type of parts. To assess their quality, samples were taken from the products of these factories and the following results were obtained. Among the 200 selected products from the first plant, 20 were defective, and among the 300 products from the second plant, 15 were defective.
At a significance level of 0.025, find out whether there is a significant difference in the quality of parts manufactured by these factories.

According to the condition, α = 0.025, hence F(Kcr) = 0.4875 and Kcr = 2.24. With a two-sided alternative, the range of acceptable values ​​has the form (-2.24;2.24). The observed value K obs =2.15 falls within this interval, i.e. at this level of significance there is no reason to reject the main hypothesis. Factories produce products of the same quality.

Every profession has its own set of favorite questions. For market researchers, the question of sample size certainly tops the list. It is usually formulated like this:

  • We would like to commission a study on visitors to Moscow shopping centers. What sample do we need?
  • Our target audience is approximately 300,000 people. How many people do we need to survey to be representative? What if the target audience is 3 million?
  • We need to assess the potential for sales of apartments in St. Petersburg to residents of northern Russian cities. What sample should I make?
Sample size is really important because it determines the cost of future research, not to mention the quality of the resulting results and conclusions. In this article, we'll cover how to calculate the optimal sample size for a mass survey. Our material will be useful to everyone who, in one way or another, is faced with the need to conduct marketing research on their own or order it from a specialized agency.

The main misconception about sample size

Many people believe that the larger the target group, the larger the sample size should be. Therefore, supposedly, to find out the opinion of the residents of a small town, it is enough to interview 200-300 people, but to find out the opinion on Russia as a whole, 5000 will not be enough.

Meanwhile, this stereotype has nothing to do with reality. The sample size does not depend on the size of the target group (in statistical parlance it is called the “general population”) and is determined by two completely different factors. The only exception to this rule is cases when the population is very small, for example, 1-2 thousand people, but such situations are rare in the actual practice of marketing research.

Two Factors That Determine Sample Size

The sample size of a mass survey depends on two factors:

  1. The accuracy of the data that needs to be obtained at the output is the same “statistical error”. For a sample of 100 respondents it will be within plus or minus 10%, and for a sample of 1000 respondents it will be within plus or minus 3.1%. More details about this below.
  2. The number and size of subgroups into which the sample should be divided during analysis. For example, if an electoral study is being conducted, then we will be mainly interested in the core of active voters. As a rule, the share of the “core” rarely exceeds 20-25% of the total population. Therefore, the sample size should be calculated so that one quarter of its total volume allows for full statistical analysis.
Contrary to popular belief, the quality of a sample is determined not by its size, but by its representativeness. Representativeness is the correspondence between the sample and the population on key parameters. Most often, easily measured socio-demographic indicators are used as such “reference points”: gender, age, education, occupation and place of residence.

Two types of sampling error

Any selective observation (that is, when we do not interview everyone, but make a random selection from the general population) is associated with data error. This error is usually called "sampling error". It can be of two types:

  1. Systematic– is associated with sampling design errors. Assessing its size, direction and degree of displacement is very difficult, most often impossible. For example, if respondents are asked questions from representatives of marginalized social classes, this will affect the willingness to participate in the study on the part of representatives of more affluent groups of the population. As a result, this will lead to extremely difficult to assess systematic error and distortion of the data.
  2. Random– is connected with the action of the laws of statistics. Its size is easily calculated using the formulas of mathematical statistics and probability theory. They allow you to make informed conclusions about the confidence interval of a sign. For example, if the statistical error is plus or minus 10%, and the resulting indicator value turns out to be 25%, then the confidence interval is from 15% to 35%.

The researcher's goal is to collect data in a way that minimizes sampling bias. Then it will be possible to reduce the statistical error only to a random error, which can be calculated using formulas.

How to Calculate the Size of Random Sampling Error

Random sampling error depends not only on the sample size, but also on the dispersion, that is, the degree of homogeneity of the data. The more homogeneous the data (i.e., the smaller the spread of the obtained values, or dispersion), the smaller the sampling error.

There is a formula for calculating random sampling error, but for convenience, we recommend using online calculators, for example, this one. It allows you to easily carry out two types of calculations:

  • calculate the amount of statistical error based on sample size and estimated variance;
  • determine the sample size required to obtain an estimate of the desired degree of precision.
This is what its working window looks like:

The confidence parameter (one of the fields in the calculator) is usually set to 95%. This means that in 95% of cases the distribution of the characteristic in the population will fall within the calculated confidence interval (i.e. the value of the characteristic itself in the sample plus or minus the size of the statistical error). Less commonly used is a reliability value of 97% or 99% - it, respectively, means that such a hit will occur in 97% or 99% of cases. In this case, the reliability of the sample increases, but the sample size increases.

The most difficult part of determining sample size is the trade-off between the required accuracy and the cost of data collection. This process is complicated by the fact that quadrupling the sample size only results in a doubling of the accuracy (corresponding to the square root of the sample increase).

Case: determining the sample size to assess the potential of the market for sales of metropolitan real estate to buyers from the regions

In November-December 2016, we conducted a study of the demand for apartments in new buildings in Moscow and St. Petersburg from residents of different cities of Russia. The study included three data collection methods: a mass representative survey of the population aged 20 to 60 years (conducted using CATI technology), as well as a series of expert interviews with realtors and in-depth interviews with potential apartment buyers.

The study covered 33 cities characterized by increased demand for St. Petersburg and Moscow real estate. The planned sample of the study, calculated using formulas, amounted to 21,500 respondents. This size is significantly larger than the “standard” sample size used in marketing research. What is the reason for such a large sample size?

The thing is that the client needed estimates separately for each city, and not just “for the whole country.” In fact, we are not working with 1 sample, but with 33 separate samples for each city. The share of people interested in buying an apartment in St. Petersburg or Moscow was expertly determined to be 5% of the number of residents of the cities surveyed.

Depending on the importance of the city for the customer, the project manager from the Agency determined the permissible statistical error within which the final results should fit. We used a special macro in MS Excel for this, but these calculations can also be performed using a sampling calculator. As a result, the sample size varied from 500 to 1,000 respondents for each of the cities in the study, which gave a total of 21,500 people.

  1. Determine the structure of the target group. Are you planning to analyze individual subgroups, or will analysis of the sample as a whole suffice?
  2. Determine the desired data accuracy. For example, if you need to estimate the dynamics of the market share over a year, plug in the approximate value of the share into a special calculator and “play” with different sample sizes.
  3. Find a balance between the cost of data collection (directly proportional to sample size) and the required accuracy.

In practice, deciding on sample size is a trade-off between the assumption of the accuracy of the survey results and the feasibility of their practical implementation (i.e., based on the costs of conducting the survey).

In practice, several approaches are used to determine sample size. Let's pay attention to the simplest of them. The first of these is called the random approach and is based on the application of the “rule of thumb”.

For example, it is accepted without evidence that to obtain accurate results, the sample must be 5% of the population. This approach is simple and easy to implement, but does not allow obtaining accurate results. Its advantage is the relative low cost. In the second approach, the sample size can be set based on pre-specified conditions. The customer of marketing research, for example, knows that when studying public opinion the sample is usually 1000 - 1200 people, so he recommends that the researcher stick to this figure.

The third approach means that in some cases the main consideration in determining sample size may be the cost of conducting the survey. Although the value and reliability of the information received is not taken into account.

In the fourth approach, the sample size is determined based on statistical analysis. This approach involves determining the minimum sample size, taking into account the requirements for the reliability and validity of the results obtained.

The fifth approach is considered the most theoretically based and correct approach in determining the sample size. It is based on the calculation of a confidence interval.

A confidence interval is a range whose extreme points characterize the percentage of certain answers to a question. This concept of dough is associated with the concept of “the standard deviation of the resulting characteristic in the general population.” The larger it is, the wider the confidence interval must be in order to include, for example, 9.5% of responses.

From the properties of the normal distribution curve it follows that the end points of the confidence interval, equal to, for example, 9.5%, are defined as the product of: 1.96 (normalized deviation) and the standard deviation.

The numbers 1.96 and 2.58 (for the 99% confidence interval) are designated z.

There are tables “Value of the probability integral”, which make it possible to determine z values ​​for various confidence intervals. A confidence interval of 95% or 99% is standard when conducting marketing research.

For example, a study was conducted on the number of visits of car owners to service workshops per year. The confidence interval for the mean number of visits was calculated to be 5 to 7 visits at the 99% confidence level. This means that if it becomes possible to independently conduct sample studies 100 times, then for 99 sample studies the average number of visits will fall in the range from 5 to 7 visits. To put it another way, 99% of car owners will fall into the confidence interval.

Let's say a study was conducted on up to 50 independent samples. The mean scores for these samples formed a normal distribution curve called sampling distribution.

The mean score for the population as a whole is equal to the mean score of the distribution curve. The concept of “sampling distribution” is also considered as one of the basic concepts of the theoretical concept that underlies the definition of V sample.

Naturally, no company is able to form 10, 20, 50 independent samples. Typically only one sample is used.

Mathematical statistics allows you to obtain some information about the sampling distribution by having accurate data about the variation of a single sample.

An indicator of the degree to which an estimate true for the population as a whole differs from that expected for a typical sample is root mean square error. For example, the opinion of consumers about a new product is being studied, and the customer of this study indicated that he would be satisfied with the accuracy of the results obtained, equal to plus or minus 5%.

Let's assume that 30% of the sample is in favor of the new product. This means that the range of possible estimates for the entire population is 25 - 35%. Moreover, the larger the sample size, the smaller the error. A high variation value causes a high error value and vice versa.

Let's determine the sample size based on calculating the confidence interval. The initial information necessary to implement this approach is:

· the amount of variation that a population is believed to have;

· desired accuracy;

· the level of reliability that the results of the survey must satisfy.

When there are only two possible answers to a given question, expressed as a percentage (a percentage measure is used), the sample size is determined by the following formula:

where n is the sample size;

z - normalized deviation, determined based on the selected confidence level (Table 7);

p - found variation for the sample;

e - permissible error.

Table 7

The value of the normalized deviation of the z score from the mean value

depending on the confidence probability (a) of the result obtained

For example, a tire manufacturing company conducts a survey of motorists who use radial tires.

Therefore, to the question: “Do you use radial tires?” Only 2 answers are possible: “Yes” or “No”. If we assume that the population of car enthusiasts has a low coefficient of variation, this means that almost everyone surveyed uses radial tires. In this case, a fairly small sample size can be formed. In formula (1), the product pg expresses the variation inherent in the population. For example, let's say 90% of the units in the population use radial tires. This means that pg = 900. If we assume that the coefficient of variation is higher (p = 70%), then pg = 2100. The greatest variation is achieved when one half of the population (50%) uses radial tires and the others do not. In this case, the product reaches a value equal to 2500.

When conducting a survey, it is important to indicate the accuracy of the estimates obtained. For example, it was found that 44% of respondents use radial tires. The measurement results must be presented in the form: the percentage of motorists using radial tires is 44 plus or minus %. The amount of permissible error is jointly determined in advance by the research customer and the contractor.

The level of confidence in marketing research is usually assessed based on two values: 95% or 99%. The first value corresponds to the value z = 1.96; the second - z = 2.58. If a confidence level of 99% is selected, then this means the following: we are 99% confident (in other words, the confidence level is 0.99) that the percentage of population members falling within the range of plus - minus e% is equal to the percentage of sample members , falling within the same error range. Assuming a variation of 50%, an accuracy of 10% at a 95% confidence level, we calculate the sample size:

n = 1.962 (50 x 50) / 102 = 96.

With a confidence level of 99%, and e = ±3%, n = 1067.

When determining the variation index for a specific population, it is advisable to conduct a preliminary qualitative analysis of the population under study and establish the similarity of population units in demographic, social and other respects of interest to the researcher. It is possible to determine sample size using means rather than percentages. Assume that the confidence level is chosen to be 95% (z = 1.96,), the standard deviation (S) is calculated to be 100, and the desired precision (bias) is ±10. Then the sample size will be

In reality, in practice, if the sample is formed anew and similar surveys have not been conducted, S is unknown.

In this case, it is advisable to set the error e in fractions of the standard deviation. The calculation formula is transformed and takes the following form:

We've been talking mostly about the very large sized aggregations that characterize consumer goods markets. But in some cases the aggregates are not so large, for example in the markets of certain types of industrial products.

Typically, if the sample is less than 5% of the population, then the population is considered large, and calculations are carried out according to the above rules.

If V of the sample exceeds 5% of the population, then the latter is considered small, and a correction factor is introduced into the above formulas. The sample size in this case is determined as follows:

where n1 is the sample size for a small population,

n is the sample size (either for percentage measures or for averages), calculated using the above formulas,

N is the volume of the general population.

For example, the opinion of members of a population consisting of 1000 companies is being studied regarding the construction of a chemical plant within the boundaries of the city of Tomsk. Due to the lack of information about the variation, the worst case is assumed: 50:50. The researcher decided to use a confidence level of 95%. The customer of the study indicated that he would be satisfied with the accuracy of the results plus or minus 5%. In this case, the following formula for the percentage measure is used:

This approach to forming the V sample, with certain reservations, can also be used when calculating the size of the panel and expert group.

The sample calculation formulas given are based on the assumption that all sampling rules have been followed, and the only error is the error due to its size.

F. Kotler Marketing management / Transl. from English - 9th International ed. - St. Petersburg: Peter Kom., 1998. - P.174

F. Kotler and others. Marketing Basics. -M; St. Petersburg, 1999. P - 370.