Ministry of Education and Science of the Russian Federation

Federal Agency for Education

State educational institution higher professional education

All-Russian Correspondence Financial and Economic Institute

Branch in Tula

Test

in the discipline "Econometrics"

Tula - 2010

Problem 2 (a, b)

For light industry enterprises, information was obtained characterizing the dependence of the volume of output (Y, million rubles) on the volume of capital investments (X, million rubles) Table. 1.

X 33 17 23 17 36 25 39 20 13 12
Y 43 27 32 29 45 35 47 32 22 24

Required:

1. Find the parameters of the equation linear regression, give an economic interpretation of the regression coefficient.

2. Calculate the remainders; find the residual sum of squares; estimate the variance of the residuals

; plot the residuals.

3. Check the fulfillment of the prerequisites of the MNC.

4. Check the significance of the parameters of the regression equation using Student's t-test (α=0.05).

5. Calculate the coefficient of determination, check the significance of the regression equation using Fisher's F test (α=0.05), find the average relative error of approximation. Draw a conclusion about the quality of the model.

6. Predict the average value of indicator Y at the significance level α=0.1, if the predicted value of factor X is 80% of its maximum value.

7. Present graphically: actual and model Y values, forecast points.

8. Create nonlinear regression equations:

hyperbolic;

sedate;

indicative.

Provide graphs of the constructed regression equations.

9. For the indicated models, find the coefficients of determination and average relative errors of approximation. Compare the models based on these characteristics and draw a conclusion.

1. Linear model has the form:

We find the parameters of the linear regression equation using the formulas

Calculation of parameter values ​​is presented in table. 2.

t y x yx
1 43 33 1419 1089 42,236 0,764 0,584 90,25 88,36 0,018
2 27 17 459 289 27,692 -0,692 0,479 42,25 43,56 0,026
3 32 23 736 529 33,146 -1,146 1,313 0,25 2,56 0,036
4 29 17 493 289 27,692 1,308 1,711 42,25 21,16 0,045
5 45 36 1620 1296 44,963 0,037 0,001 156,25 129,96 0,001
6 35 25 875 625 34,964 0,036 0,001 2,25 1,96 0,001
7 47 39 1833 1521 47,69 -0,69 0,476 240,25 179,56 0,015
8 32 20 640 400 30,419 1,581 2,500 12,25 2,56 0,049
9 22 13 286 169 24,056 -2,056 4,227 110,25 134,56 0,093
10 24 12 288 144 23,147 0,853 0,728 132,25 92,16 0,036
336 235 8649 6351 12,020 828,5 696,4 0,32
Avg. 33,6 23,5 864,9 635,1

Let's determine the parameters of the linear model

The linear model has the form

Regression coefficient

shows that output Y increases by an average of 0.909 million rubles. with an increase in the volume of capital investments X by 1 million rubles.

2. Calculate the remainders

, the residual sum of squares, we find the residual variance using the formula:

The calculations are presented in table. 2.


Rice. 1. Graph of residuals ε.

3. Let us check the fulfillment of the prerequisites of the OLS based on the Durbin-Watson criterion.

0,584
2,120 0,479
0,206 1,313
6,022 1,711
1,615 0,001
0,000 0,001
0,527 0,476
5,157 2,500
13,228 4,227
2,462 0,728
31,337 12,020

d1=0.88; d2=1.32 for α=0.05, n=10, k=1.

,

This means that a number of residuals are not correlated.

4. Let's check the significance of the equation parameters based on Student's t-test. (α=0.05).

for ν=8; α=0.05.

Calculation of value

produced in table. 2. We get:
, then we can conclude that the regression coefficients a and b are significant with a probability of 0.95.

5. Find the correlation coefficient using the formula

We will make the calculations in the table. 2.

. That. the relationship between the amount of capital investment X and output Y can be considered close, because .

We find the coefficient of determination using the formula

The study of correlation dependencies is based on the study of such connections between variables in which the values ​​of one variable, which can be taken as a dependent variable, “on average” change depending on the values ​​​​taken by another variable, considered as a cause in relation to the dependent variable. The action of this cause is carried out under conditions of complex interaction of various factors, as a result of which the manifestation of the pattern is obscured by the influence of chance. By calculating the average values ​​of the effective attribute for a given group of values ​​of the attribute-factor, the influence of chance is partly eliminated. By calculating the parameters of the theoretical communication line, they are further eliminated and an unambiguous (in form) change in “y” with a change in the factor “x” is obtained.

To study stochastic relationships, the method of comparing two parallel series, the method of analytical groupings, correlation analysis, regression analysis and some nonparametric methods. IN general view The task of statistics in the field of studying relationships is not only quantification their presence, direction and strength of connection, but also in determining the form (analytical expression) of the influence of factor characteristics on the resultant one. To solve it, methods of correlation and regression analysis are used.

CHAPTER 1. REGRESSION EQUATION: THEORETICAL FOUNDATIONS

1.1. Regression equation: essence and types of functions

Regression (lat. regressio - reverse movement, transition from more complex shapes development to less complex ones) is one of the basic concepts in probability theory and mathematical statistics, expressing the dependence of the average value of a random variable on the values ​​of another random variable or several random variables. This concept was introduced by Francis Galton in 1886.

The theoretical regression line is the line around which the points of the correlation field are grouped and which indicates the main direction, the main tendency of the connection.

The theoretical regression line should reflect the change in the average values ​​of the effective attribute “y” as the values ​​of the factor attribute “x” change, provided that all other causes, random in relation to the factor “x”, are completely cancelled. Consequently, this line must be drawn so that the sum of deviations of the points of the correlation field from the corresponding points of the theoretical regression line is equal to zero, and the sum of the squares of these deviations is minimal.

y=f(x) - regression equation is a formula for the statistical relationship between variables.

A straight line on a plane (in two-dimensional space) is given by the equation y=a+b*x. In more detail, the variable y can be expressed in terms of a constant (a) and a slope (b) multiplied by the variable x. The constant is sometimes also called the intercept term, and the slope is sometimes called the regression or B-coefficient.

An important stage of regression analysis is determining the type of function with which the dependence between characteristics is characterized. The main basis should be a meaningful analysis of the nature of the dependence being studied and its mechanism. At the same time, it is not always possible to theoretically substantiate the form of connection between each of the factors and the performance indicator, since the socio-economic phenomena under study are very complex and the factors that shape their level are closely intertwined and interact with each other. Therefore, on the basis of theoretical analysis, the most general conclusions can often be drawn regarding the direction of the relationship, the possibility of its change in the population under study, the legitimacy of using a linear relationship, the possible presence of extreme values, etc. A necessary complement to such assumptions must be an analysis of specific factual data.

An approximate idea of ​​the relationship line can be obtained based on the empirical regression line. The empirical regression line is usually a broken line and has a more or less significant break. This is explained by the fact that the influence of other unaccounted factors that influence the variation of the effective characteristic is not completely extinguished in the average, due to insufficient large quantity observations, therefore, the empirical line of communication can be used to select and justify the type of theoretical curve, provided that the number of observations is large enough.

One of the elements of specific studies is the comparison of various dependence equations, based on the use of quality criteria for approximating empirical data by competing versions of models. The following types of functions are most often used to characterize the relationships of economic indicators:

1. Linear:

2. Hyperbolic:

3. Demonstrative:

4. Parabolic:

5. Power:

6. Logarithmic:

7. Logistics:

A model with one explanatory and one explained variable is a paired regression model. If two or more explanatory (factor) variables are used, then we speak of using a multiple regression model. In this case, linear, exponential, hyperbolic, exponential and other types of functions connecting these variables can be selected as options.

To find parameters a and b of the regression equation, use the method least squares. When applying the least squares method to find a function that best fits empirical data, it is believed that the bag of squares of deviations of empirical points from the theoretical regression line should be a minimum value.

The least squares criterion can be written as follows:

Consequently, the use of the least squares method to determine the parameters a and b of the line that best matches the empirical data is reduced to an extremum problem.

Regarding the assessments, the following conclusions can be drawn:

1. Least squares estimators are functions of the sample, making them easy to calculate.

2. Least squares estimates are point estimates of the theoretical regression coefficients.

3. The empirical regression line necessarily passes through the point x, y.

4. The empirical regression equation is constructed in such a way that the sum of deviations

.

A graphical representation of the empirical and theoretical line of communication is presented in Figure 1.


The parameter b in the equation is the regression coefficient. If there is a direct correlation, the regression coefficient has positive value, and in the case inverse relationship the regression coefficient is negative. The regression coefficient shows how much on average the value of the effective attribute “y” changes when the factor attribute “x” changes by one. Geometrically, the regression coefficient is the slope of the straight line depicting the correlation equation relative to the “x” axis (for the equation

).

Multidimensional section statistical analysis, dedicated to dependency recovery, is called regression analysis. The term “linear regression analysis” is used when the function under consideration linearly depends on the estimated parameters (the dependence on independent variables can be arbitrary). Assessment theory

unknown parameters is well developed specifically in the case of linear regression analysis. If there is no linearity and it is impossible to go to linear problem, then, as a rule, one cannot expect good properties from the estimates. We will demonstrate approaches in the case of dependencies various types. If the dependence has the form of a polynomial (polynomial). If the calculation of correlation characterizes the strength of the relationship between two variables, then regression analysis serves to determine the type of this relationship and makes it possible to predict the value of one (dependent) variable based on the value of another (independent) variable. To conduct linear regression analysis, the dependent variable must have an interval (or ordinal) scale. At the same time, binary logistic regression reveals the dependence of a dichotomous variable on some other variable related to any scale. The same application conditions apply to probit analysis. If the dependent variable is categorical but has more than two categories, then multinomial logistic regression is a suitable method; non-linear relationships between variables that belong to an interval scale can be analyzed. The nonlinear regression method is designed for this purpose.

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is discussed in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equality is used in statistics and econometrics.

Definition of regression

In mathematics, regression means a certain quantity that describes the dependence of the average value of a set of data on the values ​​of another quantity. The regression equation shows, as a function of a particular characteristic, the average value of another characteristic. The regression function has the form simple equation y = x, in which y acts as a dependent variable, and x as an independent variable (feature-factor). In fact, regression is expressed as y = f (x).

What are the types of relationships between variables?

In general, there are two opposite types relationships: correlation and regression.

The first is characterized by the equality of conditional variables. In this case, it is not reliably known which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to construct a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

Today, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c+t*x+E. The hyperbolic equation has the form of a regular hyperbola y = c + m / x + E. Logarithmically linear equation expresses relationships using logarithmic function: In y = In c + t* In x + In E.

Multiple and nonlinear

The two more complex types of regression are multiple and nonlinear. The multiple regression equation is expressed by the function y = f(x 1, x 2 ... x c) + E. In this situation, y acts as a dependent variable, and x acts as an explanatory variable. The E variable is stochastic; it includes the influence of other factors in the equation. The nonlinear regression equation is a bit controversial. On the one hand, relative to the indicators taken into account, it is not linear, but on the other hand, in the role of evaluating indicators, it is linear.

Inverse and paired types of regressions

An inverse is a type of function that needs to be converted to a linear form. In the most traditional application programs it has the form of a function y = 1/c + m*x+E. A pairwise regression equation shows the relationship between the data as a function of y = f (x) + E. Just like in other equations, y depends on x, and E is a stochastic parameter.

Concept of correlation

This is an indicator demonstrating the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. Negative indicator indicates the presence of feedback, positive indicates direct feedback. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1, the stronger the relationship between the parameters; the closer to 0, the weaker it is.

Methods

Correlation parametric methods can assess the strength of the relationship. They are used on the basis of distribution estimation to study parameters that obey the law of normal distribution.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the selected relationship formula. The correlation field is used as a connection identification method. To do this, all existing data must be depicted graphically. All known data must be plotted in a rectangular two-dimensional coordinate system. This is how a correlation field is formed. The values ​​of the describing factor are marked along the abscissa axis, while the values ​​of the dependent factor are marked along the ordinate axis. If there is a functional relationship between the parameters, they are lined up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about practically complete absence communications. If it is between 30% and 70%, then this indicates the presence of medium-close connections. A 100% indicator is evidence of a functional connection.

A nonlinear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of multiple correlation. He talks about the close relationship of the presented set of indicators with the characteristic being studied. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is estimated using this indicator.

In order to calculate the multiple correlation indicator, it is necessary to calculate its index.

Least square method

This method is a way to estimate regression factors. Its essence is to minimize the sum of squared deviations obtained as a result of the dependence of the factor on the function.

A pairwise linear regression equation can be estimated using such a method. This type of equations is used when a paired linear relationship is detected between indicators.

Equation Parameters

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter m demonstrates the average change in the final indicator of the function y, provided that the variable x decreases (increases) by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not carry economic meaning. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say that the change in the result is slow compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed through an equation. For example, factor c has the form c = y - mx.

Grouped data

There are task conditions in which all information is grouped by attribute x, but for a certain group the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depending on x changes. Thus, the grouped information helps to find the regression equation. It is used as an analysis of relationships. However, this method has its drawbacks. Unfortunately, average indicators are often subject to external fluctuations. These fluctuations do not reflect the pattern of the relationship; they just mask its “noise.” Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the number of an individual population by the corresponding average, one can obtain the sum y within the group. Next, you need to add up all the amounts received and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. If the intervals are small, we can conditionally take the x indicator for all units (within the group) to be the same. You should multiply it with the sum of y to find out the sum of the products of x and y. Next, all the amounts are added together and the total amount xy is obtained.

Multiple pairwise regression equation: assessing the importance of a relationship

As discussed earlier, multiple regression has a function of the form y = f (x 1,x 2,…,x m)+E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, and to study the causes and type of the production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the microeconomics level this equation is used a little less frequently.

The main task of multiple regression is to build a model of data containing a huge amount of information in order to further determine what influence each of the factors individually and in their totality has on the indicator that needs to be modeled and its coefficients. The regression equation can take on a wide variety of values. In this case, to assess the relationship, two types of functions are usually used: linear and nonlinear.

The linear function is depicted in the form of the following relationship: y = a 0 + a 1 x 1 + a 2 x 2,+ ... + a m x m. In this case, a2, a m are considered “pure” regression coefficients. They are necessary to characterize the average change in parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of stable values ​​of other indicators.

Nonlinear equations have, for example, the form power function y=ax 1 b1 x 2 b2 ...x m bm . In this case, the indicators b 1, b 2 ..... b m are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors need to be taken into account when constructing multiple regression

In order to build correctly multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationships between economic factors and what is being modeled. Factors that will need to be included must meet the following criteria:

  • Must be subject to quantitative measurement. In order to use a factor that describes the quality of an object, in any case it should be given a quantitative form.
  • There should be no intercorrelation of factors, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditional, and this entails its unreliability and unclear estimates.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction methods

There are a huge number of methods and methods that explain how you can select factors for an equation. However, all these methods are based on the selection of coefficients using a correlation indicator. Among them are:

  • Elimination method.
  • Switching method.
  • Stepwise regression analysis.

The first method involves filtering out all coefficients from the total set. The second method involves introducing many additional factors. Well, the third is the elimination of factors that were previously used for the equation. Each of these methods has a right to exist. They have their pros and cons, but they can all solve the issue of eliminating unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Multivariate analysis methods

Such methods for determining factors are based on consideration of individual combinations of interrelated characteristics. These include discriminant analysis, shape recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, but it appeared due to the development of the component method. All of them apply in certain circumstances, subject to certain conditions and factors.

If there is a correlation between factor and performance characteristics, doctors often have to establish by what amount the value of one characteristic can change when the other changes to a generally accepted unit of measurement or one established by the researcher himself.

For example, how will the body weight of 1st grade schoolchildren (girls or boys) change if their height increases by 1 cm? For these purposes, the method of regression analysis is used.

The regression analysis method is most often used to develop normative scales and standards physical development.

  1. Definition of Regression. Regression is a function that allows, from the average value of one characteristic, to determine the average value of another characteristic that is correlated with the first.

    For this purpose, the regression coefficient and a number of other parameters are used. For example, you can calculate the number of colds on average at certain values average monthly temperature air in the autumn-winter period.

  2. Determination of the regression coefficient. Regression coefficient - absolute value, by which on average the value of one attribute changes when another associated attribute changes by the established unit of measurement.
  3. Regression coefficient formula. R y/x = r xy x (σ y / σ x)
    where R у/х - regression coefficient;
    r xy - correlation coefficient between characteristics x and y;
    (σ y and σ x) - standard deviations of characteristics x and y.

    In our example;
    σ x = 4.6 (standard deviation of air temperature in the autumn-winter period;
    σ y = 8.65 (standard deviation of the number of infectious and cold diseases).
    Thus, R y/x is the regression coefficient.
    R у/х = -0.96 x (4.6 / 8.65) = 1.8, i.e. when the average monthly air temperature (x) decreases by 1 degree, the average number of infectious and cold diseases (y) in the autumn-winter period will change by 1.8 cases.

  4. Regression equation. y = M y + R y/x (x - M x)
    where y is the average value of the characteristic, which should be determined when the average value of another characteristic changes (x);
    x is the known average value of another characteristic;
    R y/x - regression coefficient;
    M x, M y - known average values ​​of characteristics x and y.

    For example, the average number of infectious and cold diseases (y) can be determined without special measurements at any average value of the average monthly air temperature (x). So, if x = - 9°, R y/x = 1.8 diseases, M x = -7°, M y = 20 diseases, then y = 20 + 1.8 x (9-7) = 20 + 3 .6 = 23.6 diseases.
    This equation is applied in the case of a linear relationship between two characteristics (x and y).

  5. Purpose of the Regression Equation. The regression equation is used to construct a regression line. The latter makes it possible, without special measurements, to determine any average value (y) of one characteristic if the value (x) of another characteristic changes. Based on these data, a graph is constructed - regression line, which can be used to determine the average number of colds at any value of the average monthly temperature within the range between the calculated values ​​of the number of colds.
  6. Regression Sigma (formula).
    where σ Rу/х - sigma (standard deviation) of regression;
    σ y - standard deviation of the characteristic y;
    r xy - correlation coefficient between characteristics x and y.

    So, if σ y - standard deviation of the number of colds = 8.65; r xy - the correlation coefficient between the number of colds (y) and the average monthly air temperature in the autumn-winter period (x) is equal to - 0.96, then

  7. Regression sigma assignment. Gives a description of the measure of diversity of the resulting characteristic (y).

    For example, it characterizes the diversity of the number of colds at a certain value of the average monthly air temperature in the autumn-winter period. Thus, the average number of colds at air temperature x 1 = -6° can range from 15.78 diseases to 20.62 diseases.
    At x 2 = -9°, the average number of colds can range from 21.18 diseases to 26.02 diseases, etc.

    Regression sigma is used to construct a regression scale, which reflects the deviation of the values ​​of the resulting characteristic from its average value plotted on the regression line.

  8. Data required to calculate and plot the regression scale
    • regression coefficient - R у/х;
    • regression equation - y = M y + R y/x (x-M x);
    • regression sigma - σ Rx/y
  9. Sequence of calculations and graphical representation of the regression scale.
    • determine the regression coefficient using the formula (see paragraph 3). For example, it is necessary to determine how much body weight will change on average (at a certain age depending on gender) if the average height changes by 1 cm.
    • using the regression equation formula (see point 4), determine what, for example, body weight will be on average (y, y 2, y 3 ...) * for a certain height value (x, x 2, x 3 ...) .
      ________________
      * The value of "y" should be calculated for at least three known values"X".

      At the same time, the average values ​​of body weight and height (M x, and M y) for a certain age and gender are known

    • calculate the regression sigma, knowing the corresponding values ​​of σ y and r xy and substituting their values ​​into the formula (see paragraph 6).
    • based on the known values ​​x 1, x 2, x 3 and the corresponding average values ​​y 1, y 2 y 3, as well as the smallest (y - σ rу/х) and largest (y + σ rу/х) values ​​(y) construct a regression scale.

      To graphically represent the regression scale, the values ​​x, x2, x3 (ordinate axis) are first marked on the graph, i.e. a regression line is constructed, for example, the dependence of body weight (y) on height (x).

      Then at the corresponding points y 1, y 2, y 3 are marked numeric values regression sigma, i.e. find the smallest on the graph and highest value y 1, y 2, y 3.

  10. Practical use regression scales. Normative scales and standards are being developed, in particular for physical development. Using a standard scale, you can give an individual assessment of children's development. In this case, physical development is assessed as harmonious if, for example, at a certain height, the child’s body weight is within one sigma of regression to the average calculated unit of body weight - (y) for given growth(x) (y ± 1 σ Ry/x).

    Physical development is considered disharmonious in terms of body weight if the child’s body weight for a certain height is within the second sigma of regression: (y ± 2 σ Ry/x)

    Physical development will be sharply disharmonious due to both excess and insufficient body weight if the body weight for a certain height is within the third sigma of regression (y ± 3 σ Ry/x).

According to the results statistical research physical development of 5-year-old boys, it is known that their average height (x) is 109 cm, and the average body weight (y) is 19 kg. The correlation coefficient between height and body weight is +0.9, standard deviations are presented in the table.

Required:

  • calculate the regression coefficient;
  • using the regression equation, determine what the expected body weight of 5-year-old boys will be with a height equal to x1 = 100 cm, x2 = 110 cm, x3 = 120 cm;
  • calculate the regression sigma, construct a regression scale, and present the results of its solution graphically;
  • draw appropriate conclusions.

The conditions of the problem and the results of its solution are presented in the summary table.

Table 1

Conditions of the problem Results of solving the problem
regression equation regression sigma regression scale (expected body weight (in kg))
M σ r xy R y/x X U σ R x/y y - σ Rу/х y + σ Rу/х
1 2 3 4 5 6 7 8 9 10
Height (x) 109 cm ± 4.4cm +0,9 0,16 100cm 17.56 kg ± 0.35 kg 17.21 kg 17.91 kg
Body mass (y) 19 kg ± 0.8 kg 110 cm 19.16 kg 18.81 kg 19.51 kg
120 cm 20.76 kg 20.41 kg 21.11 kg

Solution.

Conclusion. Thus, the regression scale within the limits of the calculated values ​​of body weight makes it possible to determine it at any other value of height or to assess the individual development of the child. To do this, restore the perpendicular to the regression line.

  1. Vlasov V.V. Epidemiology. - M.: GEOTAR-MED, 2004. - 464 p.
  2. Lisitsyn Yu.P. Public health and healthcare. Textbook for universities. - M.: GEOTAR-MED, 2007. - 512 p.
  3. Medic V.A., Yuryev V.K. Course of lectures on public health and healthcare: Part 1. Public health. - M.: Medicine, 2003. - 368 p.
  4. Minyaev V.A., Vishnyakov N.I. and others. Social medicine and healthcare organization (Manual in 2 volumes). - St. Petersburg, 1998. -528 p.
  5. Kucherenko V.Z., Agarkov N.M. and others. Social hygiene and healthcare organization ( Tutorial) - Moscow, 2000. - 432 p.
  6. S. Glanz. Medical and biological statistics. Translation from English - M., Praktika, 1998. - 459 p.

Concept of regression. Dependence between variables x And y can be described in different ways. In particular, any form of connection can be expressed by a general equation, where y treated as a dependent variable, or functions from another - independent variable x, called argument. The correspondence between an argument and a function can be specified by a table, formula, graph, etc. Changing a function depending on a change in one or more arguments is called regression. All means used to describe correlations constitute the content regression analysis.

To express regression, correlation equations, or regression equations, empirical and theoretically calculated regression series, their graphs, called regression lines, as well as linear and nonlinear regression coefficients are used.

Regression indicators express the correlation relationship bilaterally, taking into account changes in the average values ​​of the characteristic Y when changing values x i sign X, and, conversely, show a change in the average values ​​of the characteristic X according to changed values y i sign Y. The exception is time series, or time series, showing changes in characteristics over time. The regression of such series is one-sided.

There are many different forms and types of correlations. The task comes down to identifying the form of the connection in each specific case and expressing it with the corresponding correlation equation, which allows us to anticipate possible changes in one characteristic Y based on known changes in another X, related to the first correlationally.

12.1 Linear regression

Regression equation. Results of observations carried out on a particular biological object based on correlated characteristics x And y, can be represented by points on a plane by constructing a system of rectangular coordinates. The result is a kind of scatter diagram that allows one to judge the form and closeness of the relationship between varying characteristics. Quite often this relationship looks like a straight line or can be approximated by a straight line.

Linear relationship between variables x And y is described by a general equation, where a, b, c, d,... – parameters of the equation that determine the relationships between the arguments x 1 , x 2 , x 3 , …, x m and functions.

In practice, not all possible arguments are taken into account, but only some arguments; in the simplest case, only one:

In the linear regression equation (1) a is the free term, and the parameter b determines the slope of the regression line relative to the rectangular coordinate axes. IN analytical geometry this parameter is called slope, and in biometrics – regression coefficient. A visual representation of this parameter and the position of the regression lines Y By X And X By Y in the rectangular coordinate system gives Fig. 1.

Rice. 1 Regression lines of Y by X and X by Y in the system

rectangular coordinates

Regression lines, as shown in Fig. 1, intersect at point O (,), corresponding to the arithmetic average values ​​of characteristics correlated with each other Y And X. When constructing regression graphs, the values ​​of the independent variable X are plotted along the abscissa axis, and the values ​​of the dependent variable, or function Y, are plotted along the ordinate axis. Line AB passing through point O (,) corresponds to the complete (functional) relationship between the variables Y And X, when the correlation coefficient . The stronger the connection between Y And X, the closer the regression lines are to AB, and, conversely, the weaker the connection between these quantities, the more distant the regression lines are from AB. If there is no connection between the characteristics, the regression lines are at right angles to each other and .

Since regression indicators express the correlation relationship bilaterally, regression equation (1) should be written as follows:

The first formula determines the average values ​​when the characteristic changes X per unit of measure, for the second - average values ​​when changing by one unit of measure of the attribute Y.

Regression coefficient. The regression coefficient shows how much on average the value of one characteristic y changes when the measure of another, correlated with, changes by one Y sign X. This indicator is determined by the formula

Here are the values s multiplied by the size of class intervals λ , if they were found from variation series or correlation tables.

The regression coefficient can be calculated without calculating averages square deviations s y And s x according to the formula

If the correlation coefficient is unknown, the regression coefficient is determined as follows:

Relationship between regression and correlation coefficients. Comparing formulas (11.1) (topic 11) and (12.5), we see: their numerator has the same value, which indicates a connection between these indicators. This relationship is expressed by the equality

Thus, the correlation coefficient is equal to the geometric mean of the coefficients b yx And b xy. Formula (6) allows, firstly, based on the known values ​​of the regression coefficients b yx And b xy determine the regression coefficient R xy, and secondly, check the correctness of the calculation of this correlation indicator R xy between varying characteristics X And Y.

Like the correlation coefficient, the regression coefficient characterizes only a linear relationship and is accompanied by a plus sign for a positive relationship and a minus sign for a negative relationship.

Determination of linear regression parameters. It is known that the sum of squared deviations is a variant x i from the average is the smallest value, i.e. This theorem forms the basis of the least squares method. Regarding linear regression [see formula (1)] the requirement of this theorem is satisfied by a certain system of equations called normal:

Joint solution of these equations with respect to parameters a And b leads to the following results:

;

;

, from where and.

Considering the two-way nature of the relationship between the variables Y And X, formula for determining the parameter A should be expressed like this:

And . (7)

Parameter b, or regression coefficient, is determined by the following formulas:

Construction of empirical regression series. In the presence of large number observations, regression analysis begins with the construction of empirical regression series. Empirical regression series is formed by calculating the values ​​of one varying characteristic X average values ​​of another, correlated with X sign Y. In other words, the construction of empirical regression series comes down to finding group averages from the corresponding values ​​of characteristics Y and X.

An empirical regression series is a double series of numbers that can be represented by points on a plane, and then, by connecting these points with straight line segments, an empirical regression line can be obtained. Empirical regression series, especially their graphs, called regression lines, give a clear idea of ​​the form and closeness of the correlation between varying characteristics.

Alignment of empirical regression series. Graphs of empirical regression series turn out, as a rule, not to be smooth, but broken lines. This is explained by the fact that, along with the main reasons that determine the general pattern in the variability of correlated characteristics, their magnitude is affected by the influence of numerous secondary reasons that cause random fluctuations in the nodal points of regression. To identify the main tendency (trend) of the conjugate variation of correlated characteristics, it is necessary to replace broken lines with smooth, smoothly running regression lines. The process of replacing broken lines with smooth ones is called alignment of empirical series And regression lines.

Graphic alignment method. This is the simplest method that does not require computational work. Its essence boils down to the following. The empirical regression series is depicted as a graph in a rectangular coordinate system. Then the midpoints of regression are visually outlined, along which a solid line is drawn using a ruler or pattern. The disadvantage of this method is obvious: it does not exclude the influence of the individual properties of the researcher on the results of alignment of empirical regression lines. Therefore, in cases where higher accuracy is needed when replacing broken regression lines with smooth ones, other methods of aligning empirical series are used.

Moving average method. The essence of this method comes down to the sequential calculation of arithmetic averages from two or three adjacent terms of the empirical series. This method is especially convenient in cases where the empirical series is represented by a large number of terms, so that the loss of two of them - the extreme ones, which is inevitable with this method of alignment, will not noticeably affect its structure.

Least square method. This method was proposed at the beginning of the 19th century by A.M. Legendre and, independently of him, K. Gauss. It allows you to most accurately align empirical series. This method, as shown above, is based on the assumption that the sum of squared deviations is an option x i from their average there is a minimum value, i.e. Hence the name of the method, which is used not only in ecology, but also in technology. The least squares method is objective and universal; it is used in the most various cases when finding empirical equations for regression series and determining their parameters.

The requirement of the least squares method is that the theoretical points of the regression line must be obtained in such a way that the sum of the squared deviations from these points for the empirical observations y i was minimal, i.e.

By calculating the minimum of this expression in accordance with the principles of mathematical analysis and transforming it in a certain way, one can obtain a system of so-called normal equations, in which the unknown values ​​are the required parameters of the regression equation, and the known coefficients are determined by the empirical values ​​of the characteristics, usually the sums of their values ​​and their cross products.

Multiple linear regression. The relationship between several variables is usually expressed by a multiple regression equation, which can be linear And nonlinear. In its simplest form, multiple regression is expressed as an equation with two independent variables ( x, z):

Where a– free term of the equation; b And c– parameters of the equation. To find the parameters of equation (10) (using the least squares method), the following system of normal equations is used:

Dynamic series. Alignment of rows. Changes in characteristics over time form the so-called time series or dynamics series. A characteristic feature of such series is that the independent variable X here is always the time factor, and the dependent variable Y is a changing feature. Depending on the regression series, the relationship between the variables X and Y is one-sided, since the time factor does not depend on the variability of the characteristics. Despite these features, dynamics series can be likened to regression series and processed using the same methods.

Like regression series, empirical series of dynamics bear the influence of not only the main, but also numerous secondary (random) factors that obscure the main trend in the variability of characteristics, which in the language of statistics is called trend.

Analysis of time series begins with identifying the shape of the trend. To do this, the time series is depicted as line graph in a rectangular coordinate system. In this case, time points (years, months and other units of time) are plotted along the abscissa axis, and the values ​​of the dependent variable Y are plotted along the ordinate axis. If there is a linear relationship between the variables X and Y (linear trend), the least squares method is the most appropriate for aligning the time series is a regression equation in the form of deviations of the terms of the series of the dependent variable Y from the arithmetic mean of the series of the independent variable X:

Here is the linear regression parameter.

Numerical characteristics of dynamics series. The main generalizing numerical characteristics of dynamics series include geometric mean and an arithmetic mean close to it. They characterize the average rate at which the value of the dependent variable changes over certain periods of time:

An assessment of the variability of members of the dynamics series is standard deviation. When choosing regression equations to describe time series, the shape of the trend is taken into account, which can be linear (or reduced to linear) and nonlinear. The correctness of the choice of regression equation is usually judged by the similarity of the empirically observed and calculated values ​​of the dependent variable. A more accurate solution to this problem is the regression analysis of variance method (topic 12, paragraph 4).

Correlation of time series. It is often necessary to compare the dynamics of parallel time series related to each other by certain general conditions, for example, to find out the relationship between agricultural production and the growth of livestock numbers over a certain period of time. In such cases, the characteristic of the relationship between variables X and Y is correlation coefficient R xy (in the presence of a linear trend).

It is known that the trend of time series is, as a rule, obscured by fluctuations in the series of the dependent variable Y. This gives rise to a twofold problem: measuring the dependence between compared series, without excluding the trend, and measuring the dependence between neighboring members of the same series, excluding the trend. In the first case, the indicator of the closeness of the connection between the compared time series is correlation coefficient(if the relationship is linear), in the second – autocorrelation coefficient. These indicators have different meanings, although they are calculated using the same formulas (see topic 11).

It is easy to see that the value of the autocorrelation coefficient is affected by the variability of the series members of the dependent variable: the less the series members deviate from the trend, the higher the autocorrelation coefficient, and vice versa.