Regression analysis is a method of modeling measured data and studying their properties. The data consists of pairs of values ​​of the dependent variable (response variable) and the independent variable (explanatory variable). A regression model is a function of the independent variable and parameters with an added random variable.

Correlation analysis and regression analysis are related sections of mathematical statistics, and are intended to study the statistical dependence of a number of quantities using sample data; some of which are random. With statistical dependence, the quantities are not functionally related, but are defined as random variables by a joint probability distribution.

The study of the dependence of random variables leads to regression models and regression analysis based on sample data. Probability theory and mathematical statistics represent only a tool for studying statistical dependence, but do not aim to establish a causal relationship. Ideas and hypotheses about a causal relationship must be brought from some other theory that allows a meaningful explanation of the phenomenon being studied.

Numerical data usually has explicit (known) or implicit (hidden) relationships with each other.

The indicators that are obtained by direct calculation methods, i.e., calculated using previously known formulas, are clearly related. For example, percentages of plan completion, levels, specific weights, deviations in the amount, deviations in percentages, growth rates, growth rates, indices, etc.

Connections of the second type (implicit) are unknown in advance. However, it is necessary to be able to explain and predict (forecast) complex phenomena in order to manage them. Therefore, specialists, with the help of observations, strive to identify hidden dependencies and express them in the form of formulas, that is, to mathematically model phenomena or processes. One such opportunity is provided by correlation-regression analysis.

Mathematical models are built and used for three general purposes:

  • * for explanation;
  • * for prediction;
  • * For driving.

Using correlation methods regression analysis, analysts measure the closeness of connections between indicators using the correlation coefficient. In this case, connections are discovered that are different in strength (strong, weak, moderate, etc.) and different in direction (direct, reverse). If the connections turn out to be significant, then it would be advisable to find their mathematical expression in the form of a regression model and evaluate the statistical significance of the model.

Regression analysis is called the main method of modern mathematical statistics for identifying implicit and veiled connections between observational data.

The problem statement of regression analysis is formulated as follows.

There is a set of observational results. In this set, one column corresponds to an indicator for which it is necessary to establish a functional relationship with the parameters of the object and environment represented by the remaining columns. Required: establish a quantitative relationship between the indicator and the factors. In this case, the problem of regression analysis is understood as the task of identifying such a functional dependence y = f (x2, x3, ..., xт), which best describes the available experimental data.

Assumptions:

the number of observations is sufficient to demonstrate statistical patterns regarding factors and their relationships;

the processed data contains some errors (noise) due to measurement errors and the influence of unaccounted random factors;

the matrix of observation results is the only information about the object being studied that is available before the start of the study.

The function f (x2, x3, ..., xт), which describes the dependence of the indicator on the parameters, is called the regression equation (function). The term “regression” (regression (lat.) - retreat, return to something) is associated with the specifics of one of specific tasks, decided at the stage of development of the method.

It is advisable to split the solution to the problem of regression analysis into several stages:

data pre-processing;

choosing the type of regression equations;

calculation of regression equation coefficients;

checking the adequacy of the constructed function to the observation results.

Pre-processing includes standardizing the data matrix, calculating correlation coefficients, checking their significance and excluding insignificant parameters from consideration.

Choosing the type of regression equation The task of determining the functional relationship that best describes the data involves overcoming a number of fundamental difficulties. IN general case for standardized data, the functional dependence of the indicator on the parameters can be represented as

y = f (x1, x2, …, xm) + e

where f is a previously unknown function to be determined;

e - data approximation error.

This equation is usually called the sample regression equation. This equation characterizes the relationship between the variation of the indicator and the variations of the factors. And the correlation measure measures the proportion of variation in an indicator that is associated with variation in factors. In other words, the correlation between an indicator and factors cannot be interpreted as a connection between their levels, and regression analysis does not explain the role of factors in creating an indicator.

Another feature concerns the assessment of the degree of influence of each factor on the indicator. The regression equation does not provide an assessment of the separate influence of each factor on the indicator; such an assessment is possible only in the case when all other factors are not related to the one being studied. If the factor being studied is related to others that influence the indicator, then the result will be mixed characteristics factor influence. This characteristic contains both the direct influence of the factor and the indirect influence exerted through the connection with other factors and their influence on the indicator.

It is not recommended to include factors that are weakly related to the indicator, but are closely related to other factors, in the regression equation. Factors that are functionally related to each other are not included in the equation (for them the correlation coefficient is 1). The inclusion of such factors leads to degeneration of the system of equations for estimating regression coefficients and to the uncertainty of the solution.

The function f must be selected so that the error e is in some sense minimal. In order to select a functional connection, a hypothesis is put forward in advance about which class the function f may belong to, and then the “best” function in this class is selected. The selected class of functions must have some “smoothness”, i.e. "small" changes in argument values ​​should cause "small" changes in function values.

A special case widely used in practice is a first degree polynomial or linear regression equation

To select the type of functional dependence, the following approach can be recommended:

points with indicator values ​​are graphically displayed in the parameter space. At large quantities parameters, you can construct points in relation to each of them, obtaining two-dimensional distributions of values;

based on the location of the points and based on an analysis of the essence of the relationship between the indicator and the parameters of the object, a conclusion is made about the approximate type of regression or its possible options;

After calculating the parameters, the quality of the approximation is assessed, i.e. evaluate the degree of similarity between calculated and actual values;

if the calculated and actual values ​​are close throughout the entire task area, then the problem of regression analysis can be considered solved. Otherwise, you can try to choose a different type of polynomial or another analytical function, such as a periodic one.

Calculating Regression Equation Coefficients

It is impossible to unambiguously solve a system of equations based on the available data, since the number of unknowns is always greater than the number of equations. To overcome this problem, additional assumptions are needed. Common sense suggests: it is advisable to choose the coefficients of the polynomial in such a way as to ensure a minimum error in data approximation. Various measures can be used to evaluate approximation errors. The root mean square error is widely used as such a measure. On its basis, a special method for estimating the coefficients of regression equations has been developed - the method least squares(MNC). This method allows you to obtain maximum likelihood estimates of the unknown coefficients of the regression equation under the normal distribution option, but it can be used for any other distribution of factors.

The MNC is based on the following provisions:

the values ​​of the errors and factors are independent, and therefore uncorrelated, i.e. it is assumed that the mechanisms for generating interference are not related to the mechanism for generating factor values;

expected value error e must be equal to zero (the constant component is included in the coefficient a0), in other words, the error is a centered quantity;

the sample estimate of error variance should be minimal.

If linear model is inaccurate or the parameters are measured inaccurately, then in this case the least squares method allows us to find such values ​​of the coefficients at which the linear model best describes the real object in the sense of the selected standard deviation criterion.

The quality of the resulting regression equation is assessed by the degree of closeness between the results of observations of the indicator and the values ​​predicted by the regression equation in given points parameter space. If the results are close, then the problem of regression analysis can be considered solved. Otherwise, you should change the regression equation and repeat the calculations to estimate the parameters.

If there are several indicators, the problem of regression analysis is solved independently for each of them.

Analyzing the essence of the regression equation, the following points should be noted. The considered approach does not provide separate (independent) assessment of coefficients - a change in the value of one coefficient entails a change in the values ​​of others. The obtained coefficients should not be considered as the contribution of the corresponding parameter to the value of the indicator. A regression equation is just a good analytical description of the available data, and not a law describing the relationship between parameters and an indicator. This equation is used to calculate the values ​​of the indicator in a given range of parameter changes. It is of limited suitability for calculations outside this range, i.e. it can be used for solving interpolation problems and, to a limited extent, for extrapolation.

The main reason for the inaccuracy of the forecast is not so much the uncertainty of extrapolation of the regression line, but rather the significant variation of the indicator due to factors not taken into account in the model. The limitation of the forecasting ability is the condition of stability of parameters not taken into account in the model and the nature of the influence of the model factors taken into account. If it changes abruptly external environment, then the compiled regression equation will lose its meaning.

The forecast obtained by substituting the expected value of the parameter into the regression equation is a point one. The likelihood of such a forecast being realized is negligible. It is advisable to determine confidence interval forecast. For individual values ​​of the indicator, the interval should take into account errors in the position of the regression line and deviations of individual values ​​from this line.

Lecture 3.

Regression analysis.

1) Numerical characteristics of regression

2) Linear regression

3) Nonlinear regression

4) Multiple regression

5) Using MS EXCEL to perform regression analysis

Control and evaluation tool - test tasks

1. Numerical characteristics of regression

Regression analysis - statistical method studies of the influence of one or more independent variables on a dependent variable. Independent variables are otherwise called regressors or predictors, and dependent variables are called criterion variables. The terminology of dependent and independent variables reflects only the mathematical dependence of the variables, and not cause-and-effect relationships.

Goals of Regression Analysis

  • Determining the degree of determination of the variation of a criterion (dependent) variable by predictors (independent variables).
  • Predicting the value of a dependent variable using the independent variable(s).
  • Determination of the contribution of individual independent variables to the variation of the dependent variable.

Regression analysis cannot be used to determine whether there is a relationship between variables, since the presence of such a relationship is a prerequisite for applying the analysis.

To conduct regression analysis, you first need to become familiar with basic concepts statistics and probability theory.

Basic numerical characteristics of discrete and continuous random variables: mathematical expectation, dispersion and standard deviation.

Random variables are divided into two types:

  • · discrete, which can only take on specific, predetermined values ​​(for example, the values ​​of numbers on the upper edge of a thrown dice or ordinal values ​​of the current month);
  • · continuous (most often - the values ​​of some physical quantities: weight, distance, temperature, etc.), which, according to the laws of nature, can take on any values, at least in a certain interval.

The distribution law of a random variable is the correspondence between the possible values ​​of a discrete random variable and its probabilities, usually written in a table:

The statistical definition of probability is expressed through the relative frequency of a random event, that is, it is found as the ratio of the number of random variables to the total number of random variables.

Mathematical expectation of a discrete random variableX is called the sum of products of the values ​​of a quantity X on the probability of these values. The mathematical expectation is denoted by or M(X) .

n

= M(X) = x 1 p 1 + x 2 p 2 +… + x n p n = S x i p i

i=1

The dispersion of a random variable relative to its mathematical expectation is determined using a numerical characteristic called dispersion. Simply put, variance is the spread of a random variable around the mean value. To understand the essence of dispersion, consider an example. Average wage nationwide is about 25 thousand rubles. Where does this figure come from? Most likely, all salaries are added up and divided by the number of employees. In this case, there is a very large dispersion (the minimum salary is about 4 thousand rubles, and the maximum is about 100 thousand rubles). If everyone's salary was the same, then the variance would be zero and there would be no spread.

Dispersion of a discrete random variableX is the mathematical expectation of the squared difference of a random variable and its mathematical expectation:

D = M [ ((X - M (X)) 2 ]

Using the definition of mathematical expectation to calculate variance, we obtain the formula:

D = S (x i - M (X)) 2 p i

The variance has the dimension of the square of the random variable. In cases where you need to have numerical characteristic scattering of possible values ​​in the same dimension as the random variable itself uses the mean standard deviation.

Standard deviation a random variable is called the square root of its variance.

The standard deviation is a measure of the dispersion of the values ​​of a random variable around its mathematical expectation.

Example.

The distribution law of the random variable X is given by the following table:

Find its mathematical expectation, variance and standard deviation .

We use the above formulas:

M (X) = 1 0.1 + 2 0.4 + 4 0.4 + 5 0.1 = 3

D = (1-3) 2 0.1 + (2 - 3) 2 0.4 + (4 - 3) 2 0.4 + (5 - 3) 2 0.1 = 1.6

Example.

In the cash lottery, 1 win of 1000 rubles, 10 wins of 100 rubles each and 100 wins of 1 ruble each are played out. total number 10,000 tickets. Draw up a law of distribution of random winnings X for the owner of one lottery ticket and determine the mathematical expectation, dispersion and standard deviation of the random variable.

X 1 = 1000, X 2 = 100, X 3 = 1, X 4 = 0,

P 1 = 1/10000 = 0.0001, P 2 = 10/10000 = 0.001, P 3 = 100/10000 = 0.01, P 4 = 1 - (P 1 + P 2 + P 3) = 0.9889 .

Let's put the results in the table:

Mathematical expectation is the sum of paired products of the value of a random variable and its probability. For this task, it is advisable to calculate it using the formula

1000 · 0.0001 + 100 · 0.001 + 1 · 0.01 + 0 · 0.9889 = 0.21 rubles.

We received a real “fair” ticket price.

D = S (x i - M (X)) 2 p i = (1000 - 0.21) 2 0.0001 + (100 - 0.21) 2 0.001 +

+ (1 - 0,21) 2 0,01 + (0 - 0,21) 2 0,9889 ≈ 109,97

Distribution function of continuous random variables

A value that, as a result of a test, will take on one possible value (which is not known in advance) is called a random variable. As mentioned above, random variables can be discrete (discontinuous) and continuous.

Discrete is a random variable that takes separate from each other possible values with certain probabilities that can be numbered.

Continuous is a random variable that can take all values ​​from some finite or infinite interval.

Up to this point, we were limited to only one “type” of random variables - discrete, i.e. taking finite values.

But the theory and practice of statistics require the use of the concept of a continuous random variable - allowing any numeric values, from any interval.

It is convenient to define the distribution law of a continuous random variable using the so-called probability density function. f(x). Probability P (a< X < b) того, что значение, принятое случайной величиной Х, попадет в промежуток (a; b), определяется равенством

P(a< X < b) = ∫ f(x) dx

The graph of the function f (x) is called the distribution curve. Geometrically, the probability of a random variable falling into the interval (a; b) is equal to the area of ​​the corresponding curved trapezoid, limited by the distribution curve, the Ox axis and the straight lines x = a, x = b.

P(a £ X

If a finite or countable set is subtracted from a complex event, the probability of the occurrence of a new event remains unchanged.

Function f(x) - a numerical scalar function of the real argument x is called the probability density, and exists at a point x if a limit exists at this point:

Properties of probability density:

  1. The probability density is a non-negative function, i.e. f(x) ≥ 0

(if all values ​​of the random variable X are contained in the interval (a;b), then the last

the equality can be written as ∫ f (x) dx = 1).

Let us now consider the function F(x) = P(X< х). Эта функция называется функцией распределения вероятности случайной величины Х. Функция F(х) существует как для дискретных, так и для непрерывных случайных величин. Если f (x) - функция плотности распределения вероятности

continuous random variable X, then F (x) = ∫ f(x) dx = 1).

From the last equality it follows that f (x) = F" (x)

Sometimes the function f(x) is called the differential probability distribution function, and the function F(x) is called the cumulative probability distribution function.

Let us note the most important properties of the probability distribution function:

  1. F(x) is a non-decreasing function.
  2. F (- ∞) = 0.
  3. F (+ ∞) = 1.

The concept of distribution function is central to probability theory. Using this concept, we can give another definition of a continuous random variable. A random variable is called continuous if its cumulative distribution function F(x) is continuous.

Numerical characteristics of continuous random variables

The mathematical expectation, dispersion and other parameters of any random variables are almost always calculated using formulas arising from the distribution law.

For a continuous random variable, the mathematical expectation is calculated using the formula:

M(X) = ∫ x f(x) dx

Dispersion:

D (X) = ∫ ( x- M (X)) 2 f(x) dx or D(X) = ∫ x 2 f(x) dx - (M (X)) 2

2. Linear regression

Let the components X and Y of a two-dimensional random variable (X, Y) be dependent. We will assume that one of them can be approximately represented as a linear function of the other, for example

Y ≈ g(Х) = α + βХ, and we determine the parameters α and β using the least squares method.

Definition. The function g(Х) = α + βХ is called best approximation Y in the sense of the least squares method, if the mathematical expectation M(Y - g(X)) 2 takes the smallest possible value; the function g(X) is called mean square regression Y to X.

Theorem Linear mean square regression of Y on X has the form:

where is the correlation coefficient of X and Y.

Equation coefficients.

It can be verified that for these values ​​the function F(α, β)

F(α, β ) = M(Y - α - βX)² has a minimum, which proves the theorem.

Definition. The coefficient is called regression coefficient Y on X, and the straight line - - direct mean square regression of Y on X.

By substituting the coordinates of the stationary point into the equality, we can find the minimum value of the function F(α, β), equal to This quantity is called residual variance Y relative to X and characterizes the amount of error allowed when replacing Y with

g(X) = α+βX. When the residual variance is equal to 0, that is, the equality is not approximate, but exact. Therefore, at Y and X are related by a linear functional dependence. Similarly, you can get a direct mean square regression of X on Y:

and the residual variance of X relative to Y. At both direct regressions coincide. By comparing the regression equations Y on X and X on Y and solving the system of equations, you can find the point of intersection of the regression lines - a point with coordinates (m x, m y), called the center of the joint distribution of X and Y values.

We will consider the algorithm for composing regression equations from the textbook by V. E. Gmurman “Probability Theory and Mathematical Statistics” p. 256.

1) Draw up a calculation table in which the numbers of sample elements, sampling options, their squares and product will be recorded.

2) Calculate the sum for all columns except the number.

3) Calculate the average values ​​for each value, variance and standard deviations.

5) Test the hypothesis about the existence of a connection between X and Y.

6) Create equations for both regression lines and draw graphs of these equations.

The slope of the straight regression line Y on X is the sample regression coefficient

Coefficient b=

We obtain the required equation for the regression line of Y on X:

Y = 0.202 X + 1.024

The regression equation for X on Y is similar:

The slope of the straight regression line Y on X is the sample regression coefficient pxy:

Coefficient b=

X = 4.119U - 3.714

3. Nonlinear regression

If there are nonlinear relationships between economic phenomena, then they are expressed using the corresponding nonlinear functions.

There are two classes of nonlinear regressions:

1. Regressions that are nonlinear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, for example:

Polynomials of different degrees

Equilateral hyperbola - ;

Semilogarithmic function - .

2. Regressions that are nonlinear in terms of the parameters being estimated, for example:

Power - ;

Demonstrative - ;

Exponential - .

Regressions that are nonlinear with respect to the included variables are brought to a linear form by simply replacing variables, and further estimation of the parameters is carried out using the least squares method. Let's look at some features.

A parabola of the second degree is reduced to linear form using the replacement: . As a result, we arrive at a two-factor equation, the estimation of parameters of which using the Least Squares Method leads to a system of equations:

A parabola of the second degree is usually used in cases where, for a certain interval of factor values, the nature of the connection between the characteristics under consideration changes: direct connection changes to reverse or reverse to direct.

An equilateral hyperbola can be used to characterize the relationship between the specific costs of raw materials, materials, fuel and the volume of output, the time of circulation of goods and the amount of turnover. Its classic example is the Phillips curve, which characterizes the nonlinear relationship between the unemployment rate x and the percentage of wage growth y.

The hyperbola is reduced to a linear equation by a simple substitution: . You can also use the Method of Least Squares to construct a system of linear equations.

In a similar way, the dependences are reduced to a linear form: , and others.

An equilateral hyperbola and a semi-logarithmic curve are used to describe the Engel curve (a mathematical description of the relationship between the share of expenditures on durable goods and total expenditures (or income)). The equations that include are used in studies of productivity and labor intensity of agricultural production.

4. Multiple regression

Multiple regression is a relationship equation with several independent variables:

where is the dependent variable (resultative attribute);

Independent variables (factors).

To build a multiple regression equation, the following functions are most often used:

linear -

power -

exponent -

hyperbole - .

You can use other functions that can be reduced to linear form.

To estimate the parameters of the multiple regression equation, the least squares method (OLS) is used. For linear equations and nonlinear equations reducible to linear ones, the following system of normal equations is constructed, the solution of which allows us to obtain estimates of the regression parameters:

To solve it, the method of determinants can be used:

where is the determinant of the system;

Particular qualifiers; which are obtained by replacing the corresponding column of the system determinant matrix with the data on the left side of the system.

Another type of multiple regression equation is a regression equation on a standardized scale; OLS is applied to a multiple regression equation on a standardized scale.

5.UsageMSEXCELto perform regression analysis

Regression analysis establishes the forms of dependence between the random variable Y (dependent) and the values ​​of one or more variable quantities (independent), and the values ​​of the latter are considered to be precisely specified. Such a dependence is usually determined by some mathematical model (regression equation) containing several unknown parameters. During regression analysis, based on sample data, estimates of these parameters are found, statistical errors in estimates or boundaries of confidence intervals are determined, and the compliance (adequacy) of the adopted mathematical model with experimental data is checked.

In linear regression analysis, the relationship between random variables is assumed to be linear. In the simplest case, in a paired linear regression model there are two variables X and Y. And it is required to construct (fit) a straight line using n pairs of observations (X1, Y1), (X2, Y2), ..., (Xn, Yn), called the regression line that "best" approximates the observed values. The equation of this line y=ax+b is a regression equation. Using a regression equation, you can predict the expected value of the dependent variable y corresponding to a given value of the independent variable x. In the case when the dependence between one dependent variable Y and several independent variables X1, X2, ..., Xm is considered, we speak of multiple linear regression.

In this case, the regression equation has the form

y = a 0 +a 1 x 1 +a 2 x 2 +…+a m x m ,

where a0, a1, a2, …, am are regression coefficients that require determination.

The coefficients of the regression equation are determined using the least squares method, achieving the minimum possible sum of squared differences between the actual values ​​of the Y variable and those calculated from the regression equation. Thus, for example, a linear regression equation can be constructed even in the case where there is no linear correlation.

A measure of the effectiveness of a regression model is the coefficient of determination R2 (R-square). The coefficient of determination can take values ​​between 0 and 1; it determines the degree of accuracy with which the resulting regression equation describes (approximates) the original data. The significance of the regression model is also examined using the F-test (Fisher) and the reliability of the difference between the coefficients a0, a1, a2, ..., am and zero is checked using the Student’s t-test.

In Excel, experimental data are approximated by a linear equation up to the 16th order:

y = a0+a1x1+a2x2+…+a16x16

To obtain linear regression coefficients, the “Regression” procedure from the analysis package can be used. Also, complete information about the linear regression equation is provided by the LINEST function. In addition, the SLOPE and INTERCEPT functions can be used to obtain the parameters of the regression equation, and the TREND and FORECAST functions can be used to obtain the predicted Y values ​​at the desired points (for pairwise regression).

Let us consider in detail the use of the LINEST function (known_y, [known_x], [constant], [statistics]): known_y - the range of known values ​​of the dependent parameter Y. In paired regression analysis it can have any form; in plural must be a row or column; known_x - range of known values ​​of one or more independent parameters. Must have the same shape as the Y range (for several parameters - several columns or rows, respectively); constant is a logical argument. If, based on the practical meaning of the regression analysis problem, it is necessary that the regression line passes through the origin, that is, the free coefficient is equal to 0, the value of this argument should be set equal to 0 (or “false”). If the value is set to 1 (or true) or omitted, then the free coefficient is calculated in the usual way; statistics are a logical argument. If the value is set to 1 (or “true”), then regression statistics are additionally returned (see table) used to evaluate the effectiveness and significance of the model. In general, for pair regression y=ax+b, the result of applying the LINEST function has the form:

Table. Output range of the LINEST function for pairwise regression analysis

In the case of multiple regression analysis for the equation y=a0+a1x1+a2x2+…+amxm, the first line displays the coefficients am,…,a1,a0, and the second line displays the standard errors for these coefficients. Rows 3-5, excluding the first two columns filled with regression statistics, will return #N/A.

The LINEST function should be entered as an array formula, first selecting an array of the required size for the result (m+1 columns and 5 rows if regression statistics are required) and completing the entry of the formula by pressing CTRL+SHIFT+ENTER.

Result for our example:

In addition, the program has a built-in function - Data Analysis on the Data tab.

It can also be used to perform regression analysis:

The slide shows the result of regression analysis performed using Data Analysis.

CONCLUSION OF RESULTS

Regression statistics

Plural R

R-square

Normalized R-squared

Standard error

Observations

Analysis of variance

Significance F

Regression

Odds

Standard error

t-statistic

P-Value

Bottom 95%

Top 95%

Bottom 95.0%

Top 95.0%

Y-intersection

Variable X 1

The regression equations that we looked at earlier were also built in MS Excel. To perform them, first build a Scatter Chart, then through the context menu select - Add Trend Line. In the new window, check the box - Show the equation on the diagram and place the approximation reliability value (R^2) on the diagram.

Literature:

  1. Theory of Probability and Mathematical Statistics. Gmurman V. E. Textbook for universities. - Ed. 10th, erased. - M.: Higher. school, 2010. - 479 p.
  2. Higher mathematics in exercises and problems. Textbook for universities / Danko P. E., Popov A. G., Kozhevnikova T. Ya., Danko S. P. In 2 hours - Ed. 6th, erased. - M.: Onyx Publishing House LLC: Mir and Education Publishing House LLC, 2007. - 416 p.
    1. 3. http://www.machinelearning.ru/wiki/index.php?title=%D0%A0%D0%B5%D0%B3%D1%80%D0%B5%D1%81%D1%81%D0%B8 %D1%8F - some information about regression analysis

Regression analysis underlies the creation of most econometric models, which include cost estimation models. To build valuation models, this method can be used if the number of analogues (comparable objects) and the number of cost factors (comparison elements) are related to each other as follows: P> (5 -g-10) x To, those. there should be 5-10 times more analogues than cost factors. The same requirement for the ratio of the amount of data and the number of factors applies to other tasks: establishing a connection between the cost and consumer parameters of the object; justification of the procedure for calculating corrective indices; identifying price trends; establishing a connection between wear and changes in influencing factors; obtaining dependencies for calculating cost standards, etc. Compliance with this requirement is necessary in order to reduce the likelihood of working with a data sample that does not satisfy the requirement of normal distribution of random variables.

The regression relationship reflects only the average trend of changes in the resulting variable, for example, cost, from changes in one or more factor variables, for example, location, number of rooms, area, floor, etc. This is the difference between a regression relationship and a functional one, in which the value of the resulting variable is strictly defined for a given value of the factor variables.

The presence of a regression relationship / between the resulting at and factor variables x p ..., x k(factors) indicates that this relationship is determined not only by the influence of selected factor variables, but also by the influence of variables, some of which are generally unknown, others cannot be assessed and taken into account:

The influence of unaccounted variables is indicated by the second term of this equation ?, which is called the approximation error.

The following types of regression dependencies are distinguished:

  • ? paired regression - relationship between two variables (resultant and factor);
  • ? multiple regression - the relationship between one outcome variable and two or more factor variables included in the study.

The main task of regression analysis is to quantify the closeness of the relationship between variables (in paired regression) and multiple variables (in multiple regression). The closeness of the connection is quantitatively expressed by the correlation coefficient.

The use of regression analysis makes it possible to establish the pattern of influence of the main factors (hedonic characteristics) on the indicator being studied, both in their entirety and for each of them separately. With the help of regression analysis, as a method of mathematical statistics, it is possible, firstly, to find and describe the form of the analytical dependence of the resulting (searched) variable on the factor ones and, secondly, to evaluate the closeness of this dependence.

By solving the first problem, a mathematical regression model is obtained, with the help of which the desired indicator is then calculated for given values ​​of the factors. Solving the second problem allows us to establish the reliability of the calculated result.

Thus, regression analysis can be defined as a set of formal (mathematical) procedures designed to measure the closeness, direction and analytical expression of the form of relationship between the resulting and factor variables, i.e. the output of such an analysis should be a structurally and quantitatively defined statistical model of the form:

Where y - the average value of the resulting variable (the desired indicator, for example, cost, rent, capitalization rate) by P her observations; x - value of the factor variable (/th cost factor); To - number of factor variables.

Function f(x l ,...,x lc), describing the dependence of the resulting variable on the factor factors is called a regression equation (function). The term “regression” (regression (Latin) - retreat, return to something) is associated with the specifics of one of the specific problems solved at the stage of formation of the method, and currently does not reflect the entire essence of the method, but continues to be used.

Regression analysis generally includes the following steps:

  • ? forming a sample of homogeneous objects and collecting initial information about these objects;
  • ? selection of the main factors influencing the resulting variable;
  • ? checking the sample for normality using X 2 or binomial test;
  • ? acceptance of a hypothesis about the form of communication;
  • ? mathematical data processing;
  • ? obtaining a regression model;
  • ? assessment of its statistical indicators;
  • ? verification calculations using a regression model;
  • ? analysis of results.

The specified sequence of operations takes place when studying both a paired relationship between a factor variable and one resultant variable, and a multiple relationship between a resultant variable and several factorial ones.

The use of regression analysis imposes certain requirements on the initial information:

  • ? the statistical sample of objects must be homogeneous in functional and structural-technological terms;
  • ? quite numerous;
  • ? the cost indicator under study - the resulting variable (price, cost, expenses) - must be brought to the same conditions for its calculation for all objects in the sample;
  • ? factor variables must be measured accurately enough;
  • ? factor variables must be independent or minimally dependent.

The requirements for homogeneity and completeness of the sample are in conflict: the stricter the selection of objects based on their homogeneity, the smaller the sample obtained, and, conversely, to enlarge the sample it is necessary to include objects that are not very similar to each other.

After data on a group of homogeneous objects has been collected, they are analyzed to establish the form of connection between the resulting and factor variables in the form of a theoretical regression line. The process of finding a theoretical regression line consists of a reasonable choice of the approximating curve and calculating the coefficients of its equation. The regression line is a smooth curve (in a particular case, a straight line) that describes, using a mathematical function, the general trend of the relationship under study and smoothes out irregular, random emissions from the influence of side factors.

To display paired regression dependencies in assessment tasks, the following functions are most often used: linear - y - a 0 + ars + s power - y - aj&i + s indicative - y - linear exponential - y - a 0 + ap* + c. Here - e approximation error caused by the action of unaccounted random factors.

In these functions, y is the resulting variable; x - factor variable (factor); A 0 , a r a 2 - regression model parameters, regression coefficients.

The linear exponential model belongs to the class of so-called hybrid models of the form:

Where

where x (i = 1, /) - values ​​of factors;

b t (i = 0, /) - coefficients of the regression equation.

In this equation the components A, B And Z correspond to the cost of individual components of the asset being valued, for example, the cost of a land plot and the cost of improvements, and the parameter Q is common. It is designed to adjust the value of all components of the asset being valued for a common influencing factor, such as location.

The values ​​of factors that are in the power of the corresponding coefficients are binary variables (0 or 1). The factors at the base of the degree are discrete or continuous variables.

Factors associated with multiplication coefficients are also continuous or discrete.

Specification is carried out, as a rule, using an empirical approach and includes two stages:

  • ? plotting regression field points on a graph;
  • ? graphical (visual) analysis of the type of possible approximating curve.

The type of regression curve cannot always be selected immediately. To determine it, first plot the points of the regression field based on the original data. Then visually draw a line along the position of the points, trying to find out the qualitative pattern of the connection: uniform growth or uniform decline, growth (decrease) with an increase (decrease) in the rate of dynamics, smooth approach to a certain level.

This empirical approach is complemented by logical analysis, starting from already known ideas about the economic and physical nature of the factors under study and their mutual influence.

For example, it is known that the dependences of the resulting variables - economic indicators (price, rent) on a number of factor variables - price-forming factors (distance from the center of the settlement, area, etc.) are non-linear in nature, and they can be described quite strictly by power, exponential or quadratic functions . But for small ranges of factor changes, acceptable results can be obtained using a linear function.

If, however, it is still impossible to immediately make a confident choice of any one function, then two or three functions are selected, their parameters are calculated, and then, using the appropriate criteria for the closeness of the connection, the function is finally selected.

In theory, the regression process of finding the shape of a curve is called specification model, and its coefficients - calibration models.

If it is found that the resulting variable y depends on several factor variables (factors) x ( , x 2 , ..., x k, then they resort to building a multiple regression model. Typically, three forms of multiple communication are used: linear - y - a 0 + a x x x + a^x 2 + ... + a k x k, indicative - y - a 0 a*i a x t- a x b, power - y - a 0 x x ix 2 a 2. .x^or combinations thereof.

Exponential and power functions are more universal, since they approximate nonlinear relationships, which are the majority of those studied in the assessment of dependencies. In addition, they can be used when assessing objects and in the method of statistical modeling in mass assessment, and in the method of direct comparison in individual assessment when establishing correction factors.

At the calibration stage, the parameters of the regression model are calculated using the least squares method, the essence of which is that the sum of squared deviations of the calculated values ​​of the resulting variable at., i.e. calculated using the selected coupling equation, from the actual values ​​should be minimal:

Values ​​j) (. and u. are known, therefore Q is a function of only the coefficients of the equation. To find the minimum S you need to take partial derivatives Q by the coefficients of the equation and equate them to zero:

As a result, we obtain a system of normal equations, the number of which is equal to the number of determined coefficients of the desired regression equation.

Suppose we need to find the coefficients of a linear equation y - a 0 + ars. The sum of squared deviations has the form:

/=1

Differentiate the function Q by unknown coefficients a 0 and and equate the partial derivatives to zero:

After the transformations we get:

Where P - number of original actual values at them (number of analogues).

The given procedure for calculating the coefficients of the regression equation is also applicable for nonlinear dependencies, if these dependencies can be linearized, i.e. lead to a linear form using a change of variables. Power and exponential functions after logarithm and appropriate change of variables acquire a linear form. For example, a power function after logarithm takes the form: In y = 1p 0 +a x 1ph. After replacing variables Y- In y, L 0 - In and No. X- In x we ​​get a linear function

Y=A 0 + cijX, the coefficients of which are found in the manner described above.

The least squares method is also used to calculate the coefficients of a multiple regression model. Thus, a system of normal equations for calculating a linear function with two variables Xj And x 2 after a series of transformations it looks like this:

Typically, this system of equations is solved using linear algebra methods. The multiple power function is reduced to linear form by taking logarithms and changing variables in the same way as the pair power function.

When using hybrid models, multiple regression coefficients are found using numerical procedures of the method of successive approximations.

To make a final choice from several regression equations, it is necessary to test each equation for the strength of the relationship, which is measured by the correlation coefficient, variance and coefficient of variation. Student's and Fisher's tests can also be used for evaluation. The greater the closeness of the connection a curve exhibits, the more preferable it is, all other things being equal.

If a problem of this class is being solved, when it is necessary to establish the dependence of a cost indicator on cost factors, then the desire to take into account as many influencing factors as possible and thereby build a more accurate multiple regression model is understandable. However, expanding the number of factors is hampered by two objective limitations. Firstly, to build a multiple regression model, a much larger sample of objects is required than to build a paired model. It is generally accepted that the number of objects in the sample should exceed the number P factors by at least 5-10 times. It follows that to build a model with three influencing factors, it is necessary to collect a sample of approximately 20 objects with a different set of factor values. Secondly, the factors selected for the model in their influence on the cost indicator must be sufficiently independent of each other. This is not easy to ensure, since the sample usually combines objects belonging to the same family, for which there is a natural change in many factors from object to object.

The quality of regression models is usually checked using the following statistical indicators.

Standard deviation of regression equation error (estimation error):

Where P - sample size (number of analogues);

To - number of factors (cost factors);

Error not explained by the regression equation (Figure 3.2);

u. - the actual value of the resulting variable (for example, cost); y t - the calculated value of the result variable.

This indicator is also called standard error of estimation (RMS error). In the figure, the dots indicate specific sample values, the symbol indicates the line of sample average values, and the sloping dash-dotted line is the regression line.


Rice. 3.2.

The standard deviation of the estimation error measures the amount of deviation of the actual values ​​of y from the corresponding calculated values at( , obtained using a regression model. If the sample on which the model is based is subject to the normal distribution law, then it can be argued that 68% of the real values at are in the range at ± &e from the regression line, and 95% is in the range at ± 2d e. This indicator is convenient because the units of measurement sg? match the units of measurement at,. In this regard, it can be used to indicate the accuracy of the result obtained in the assessment process. For example, in a certificate of value you can indicate that the market value obtained using a regression model V with a 95% probability of being in the range from (V -2d,.) before (y + 2d s).

Coefficient of variation of the resulting variable:

Where y - the average value of the resulting variable (Fig. 3.2).

In regression analysis, the coefficient of variation var is the standard deviation of the outcome expressed as a percentage of the mean of the outcome variable. The coefficient of variation can serve as a criterion for the predictive qualities of the resulting regression model: the smaller the value var, the higher the predictive qualities of the model. The use of the coefficient of variation is preferable to the &e indicator, since it is a relative indicator. When using this indicator in practice, it can be recommended not to use a model whose coefficient of variation exceeds 33%, since in this case it cannot be said that the sample data is subject to a normal distribution law.

Determination coefficient (squared multiple correlation coefficient):

This indicator is used to analyze the overall quality of the resulting regression model. It indicates what percentage of the variance in the resulting variable is explained by the influence of all factor variables included in the model. The coefficient of determination always lies in the range from zero to one. The closer the value of the coefficient of determination is to one, the better the model describes the original data series. The coefficient of determination can be represented differently:

Here is the error explained by the regression model,

A - error, unexplained

regression model. From an economic point of view, this criterion allows us to judge what percentage of price variation is explained by the regression equation.

The exact limit of acceptability of the indicator R 2 It is impossible to specify for all cases. Both the sample size and the meaningful interpretation of the equation must be taken into account. As a rule, when studying data about objects of the same type obtained at approximately the same point in time, the value R 2 does not exceed the level of 0.6-0.7. If all forecast errors are zero, i.e. when the relationship between the resultant and factor variables is functional, then R 2 =1.

Adjusted coefficient of determination:

The need to introduce an adjusted coefficient of determination is explained by the fact that with an increase in the number of factors To the usual coefficient of determination almost always increases, but the number of degrees of freedom decreases (p - k- 1). The entered adjustment always reduces the value R2, because the (P - 1) > (p-k- 1). As a result, the value R 2 CKOf) may even become negative. This means that the value R 2 was close to zero before adjustment and the proportion of variance of the variable explained using the regression equation at very small.

Of the two options for regression models that differ in the value of the adjusted coefficient of determination, but have equally good other quality criteria, the option with a larger value of the adjusted coefficient of determination is preferable. The coefficient of determination is not adjusted if (p - k): k> 20.

Fisher coefficient:

This criterion is used to assess the significance of the coefficient of determination. Residual sum of squares represents a measure of prediction error using regression of known cost values y.. Its comparison with the regression sum of squares shows how many times the regression dependence predicts the result better than the average at. There is a table of critical values F R Fisher coefficient, depending on the number of degrees of freedom of the numerator - To, denominator v 2 = p - k- 1 and significance level a. If the calculated value of the Fisher test F R is greater than the table value, then the hypothesis about the insignificance of the coefficient of determination, i.e. about the discrepancy between the connections embedded in the regression equation and those that actually exist, with probability p = 1 - a is rejected.

Average approximation error(average percentage deviation) is calculated as the average relative difference, expressed as a percentage, between the actual and calculated values ​​of the resulting variable:

The lower the value of this indicator, the better the predictive qualities of the model. When this indicator is no higher than 7%, the model is highly accurate. If 8 > 15% indicates unsatisfactory accuracy of the model.

Standard error of the regression coefficient:

where (/I) -1 .- diagonal element of the matrix (X G X)~ 1 k - number of factors;

X- matrix of factor variable values:

X 7 - transposed matrix of factor variable values;

(ZhL) _| - matrix inverse of the matrix.

The smaller these indicators for each regression coefficient, the more reliable the estimate of the corresponding regression coefficient.

Student's test (t-statistics):

This criterion allows you to measure the degree of reliability (significance) of the relationship determined by a given regression coefficient. If the calculated value t. greater than the table value

t av, where v - p - k - 1 is the number of degrees of freedom, then the hypothesis that this coefficient is statistically insignificant is rejected with probability (100 - a)%. There are special tables of /-distribution that allow the critical value of the criterion to be determined based on a given level of significance a and the number of degrees of freedom v. The most commonly used value for a is 5%.

Multicollinearity, i.e. the effect of mutual relationships between factor variables leads to the need to be content with a limited number of them. If this is not taken into account, then you can end up with an illogical regression model. To avoid the negative effect of multicollinearity, pairwise correlation coefficients are calculated before building a multiple regression model r xjxj between selected variables X. And X

Here XjX; - the average value of the product of two factor variables;

XjXj- the product of the average values ​​of two factor variables;

Estimation of the variance of the factor variable x..

It is considered that two variables are regression related to each other (i.e., collinear) if their pairwise correlation coefficient in absolute value is strictly greater than 0.8. In this case, any of these variables must be excluded from consideration.

In order to expand the capabilities of economic analysis of the resulting regression models, average elasticity coefficients, determined by the formula:

Where Xj- the average value of the corresponding factor variable;

y - the average value of the resulting variable; a i - regression coefficient for the corresponding factor variable.

The elasticity coefficient shows by what percentage on average the value of the resulting variable will change when the factor variable changes by 1%, i.e. how the resulting variable reacts to changes in the factor variable. For example, how does the price of sq. m. react? m of apartment area at a distance from the city center.

From the point of view of analyzing the significance of a particular regression coefficient, it is useful to estimate partial coefficient of determination:

Here is the estimate of the variance of the resulting

variable. This coefficient shows by what percentage the variation in the resulting variable is explained by the variation in the i-th factor variable included in the regression equation.

  • Hedonic characteristics are understood as characteristics of an object that reflect its useful (valuable) properties from the point of view of buyers and sellers.

A) Graphical analysis of simple linear regression.

Simple linear regression equation y=a+bx. If there is a correlation between the random variables Y and X, then the value y = ý + ,

where ý is the theoretical value of y obtained from the equation ý = f(x),

 – error of deviation of the theoretical equation ý from the actual (experimental) data.

The equation for the dependence of the average value ý on x, that is, ý = f(x), is called the regression equation. Regression analysis consists of four stages:

1) setting the problem and establishing the reasons for the connection.

2) limitation of the research object, collection of statistical information.

3) selection of the coupling equation based on the analysis and nature of the data collected.

4) calculation of numerical values, characteristics of correlation connections.

If two variables are related in such a way that a change in one variable corresponds to a systematic change in the other variable, then regression analysis is used to estimate and select the equation for the relationship between them if these variables are known. Unlike regression analysis, correlation analysis is used to analyze the closeness of the relationship between X and Y.

Let's consider finding a straight line in regression analysis:

Theoretical regression equation.

The term "simple regression" indicates that the value of one variable is estimated based on knowledge about another variable. Unlike simple multivariate regression, it is used to estimate a variable based on knowledge of two, three or more variables. Let's look at the graphical analysis of simple linear regression.

Suppose there are results of screening tests on pre-employment and labor productivity.

Selection results (100 points), x

Productivity (20 points), y

By plotting the points on a graph, we obtain a scatter diagram (field). We use it to analyze the results of selection tests and labor productivity.

Using the scatterplot, let's analyze the regression line. In regression analysis, at least two variables are always specified. A systematic change in one variable is associated with a change in another. primary goal regression analysis consists of estimating the value of one variable if the value of another variable is known. For a complete task, the assessment of labor productivity is important.

Independent variable in regression analysis, a quantity that is used as a basis for analyzing another variable. In this case, these are the results of selection tests (along the X axis).

Dependent variable is called the estimated value (along the Y axis). In regression analysis, there can be only one dependent variable and more than one independent variable.

For simple regression analysis, the dependence can be represented in a two-coordinate system (x and y), with the X axis being the independent variable and the Y axis being the dependent variable. We plot the intersection points so that a pair of values ​​is represented on the graph. The schedule is called scatterplot. Its construction is the second stage of regression analysis, since the first is the selection of analyzed values ​​and collection of sample data. Thus, regression analysis is used for statistical analysis. The relationship between the sample data in a chart is linear.

To estimate the magnitude of a variable y based on a variable x, it is necessary to determine the position of the line that best represents the relationship between x and y based on the location of the points on the scatterplot. In our example, this is performance analysis. Line drawn through scattering points – regression line. One way to construct a regression line based on visual experience is the freehand method. Our regression line can be used to determine labor productivity. When finding the equation of the regression line

The least squares test is often used. The most suitable line is the one where the sum of squared deviations is minimal

The mathematical equation of a growth line represents the law of growth in an arithmetic progression:

at = AbX.

Y = A + bX– the given equation with one parameter is the simplest type of coupling equation. It is acceptable for average values. To more accurately express the relationship between X And at, an additional proportionality coefficient is introduced b, which indicates the slope of the regression line.

B) Construction of a theoretical regression line.

The process of finding it consists in choosing and justifying the type of curve and calculating parameters A, b, With etc. The construction process is called leveling, and the supply of curves offered by the mat. analysis, varied. Most often, in economic problems, a family of curves is used, equations that are expressed by polynomials of positive integer powers.

1)
– equation of a straight line,

2)
– hyperbola equation,

3)
– equation of a parabola,

where ý are the ordinates of the theoretical regression line.

Having chosen the type of equation, you need to find the parameters on which this equation depends. For example, the nature of the location of points in the scattering field showed that the theoretical regression line is straight.

A scatterplot allows you to represent labor productivity using regression analysis. In economics, regression analysis is used to predict many characteristics that affect the final product (taking into account pricing).

B) The criterion of the smallest frames for finding a straight line.

One criterion we might apply for a suitable regression line in a scatterplot is based on choosing the line for which the sum of squared errors is minimal.

The proximity of the scattering points to the straight line is measured by the ordinates of the segments. The deviations of these points can be positive and negative, but the sum of the squares of the deviations of the theoretical line from the experimental line is always positive and should be minimal. The fact that all scattering points do not coincide with the position of the regression line indicates the existence of a discrepancy between the experimental and theoretical data. Thus, we can say that no other regression line, except the one found, can give a smaller amount of deviations between the experimental and experimental data. Therefore, having found the theoretical equation ý and the regression line, we satisfy the least squares requirement.

This is done using the coupling equation
using formulas to find parameters A And b. Taking the theoretical value
and denoting the left side of the equation by f, we get the function
from unknown parameters A And b. Values A And b will satisfy the minimum function f and are found from partial differential equations
And
. This necessary condition, however, for a positive quadratic function this is also a sufficient condition for finding A And b.

Let us derive the parameter formulas from the partial derivative equations A And b:



we obtain a system of equations:

Where
– arithmetic mean errors.

Substituting numerical values, we find the parameters A And b.

There is a concept
. This is the approximation factor.

If e < 33%, то модель приемлема для дальнейшего анализа;

If e> 33%, then we take a hyperbola, parabola, etc. This gives the right for analysis in various situations.

Conclusion: according to the criterion of the approximation coefficient, the most suitable line is the one for which

, and no other regression line for our problem gives a minimum deviation.

D) Square error of estimation, checking their typicality.

In relation to a population in which the number of research parameters is less than 30 ( n < 30), для проверки типичности параметров уравнения регрессии используется t-Student's t-test. This calculates the actual value t-criteria:

From here

Where – residual root-mean-square error. Received t a And t b compared with critical t k from the Student's table taking into account the accepted significance level ( = 0.01 = 99% or  = 0.05 = 95%). P = f = k 1 = m– number of parameters of the equation under study (degree of freedom). For example, if y = a + bx; m = 2, k 2 = f 2 = p 2 = n – (m+ 1), where n– number of studied characteristics.

t a < t k < t b .

Conclusion: using the parameters of the regression equation tested for typicality, a mathematical model of communication is built
. In this case, the parameters of the mathematical function used in the analysis (linear, hyperbola, parabola) receive the corresponding quantitative values. The semantic content of the models obtained in this way is that they characterize the average value of the resulting characteristic
from factor sign X.

D) Curvilinear regression.

Quite often, a curvilinear relationship occurs when a changing relationship is established between variables. The intensity of the increase (decrease) depends on the level of X. There are different types of curvilinear dependence. For example, consider the relationship between crop yield and precipitation. With an increase in precipitation under equal natural conditions, there is an intensive increase in yield, but up to a certain limit. After the critical point, precipitation turns out to be excessive, and yields drop catastrophically. The example shows that at first the relationship was positive and then negative. The critical point is the optimal level of attribute X, which corresponds to the maximum or minimum value of attribute Y.

In economics, such a relationship is observed between price and consumption, productivity and experience.

Parabolic dependence.

If the data show that an increase in a factor characteristic leads to an increase in the resultant characteristic, then a second-order equation (parabola) is taken as a regression equation.

. Coefficients a,b,c are found from partial differential equations:

We get a system of equations:

Types of curvilinear equations:

,

,

We have the right to assume that there is a curvilinear relationship between labor productivity and selection test scores. This means that as the scoring system increases, performance will begin to decrease at some level, so the straight model may turn out to be curvilinear.

The third model will be a hyperbola, and in all equations the variable x will be replaced by the expression .

What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a two-dimensional scatter plot and say that we have linear relation, if the data is approximated by a straight line.

If we believe that y depends on x, and changes in y are caused precisely by changes in x, we can determine the regression line (regression y on x), which best describes the linear relationship between these two variables.

The statistical use of the word regression comes from the phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that although tall fathers tend to have tall sons, the average height of sons is shorter than that of their tall fathers. The average height of sons "regressed" and "moved backward" towards the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still quite tall) sons, and short fathers have taller (but still quite short) sons.

Regression line

A mathematical equation that estimates a simple (pairwise) linear regression line:

x called the independent variable or predictor.

Y- dependent variable or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the "predicted value" y»

  • a- free member (intersection) of the evaluation line; this is the meaning Y, When x=0(Fig.1).
  • b- slope or gradient of the estimated line; it represents the amount by which Y increases on average if we increase x for one unit.
  • a And b are called regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intercept a and the slope b (the amount Y increases as x increases by one unit)

Least square method

We perform regression analysis using a sample of observations where a And b- sample estimates of the true (general) parameters, α and β, which determine the linear regression line in the population (general population).

The simplest method for determining coefficients a And b is least square method(MNC).

The fit is assessed by looking at the residuals (the vertical distance of each point from the line, e.g. residual = observed y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observed value, the remainder is equal to the difference and the corresponding predicted value. Each remainder can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

  • The residuals are normally distributed with a mean of zero;

If the assumptions of linearity, normality, and/or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (for example, use a logarithmic transformation, etc.).

Anomalous values ​​(outliers) and influence points

An "influential" observation, if omitted, changes one or more model parameter estimates (ie, slope or intercept).

An outlier (an observation that is inconsistent with the majority of values ​​in a data set) can be an "influential" observation and can be easily detected visually by inspecting a bivariate scatterplot or residual plot.

Both for outliers and for “influential” observations (points), models are used, both with and without their inclusion, and attention is paid to changes in estimates (regression coefficients).

When conducting an analysis, you should not automatically discard outliers or influence points, since simply ignoring them can affect the results obtained. Always study the reasons for these outliers and analyze them.

Linear regression hypothesis

When constructing linear regression, the null hypothesis is tested that the general slope of the regression line β is equal to zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which is subject to a distribution with degrees of freedom, where the standard error of the coefficient


,

- estimation of the dispersion of the residuals.

Typically, if the significance level is reached, the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom, which gives the probability of a two-sided test

This is the interval that contains the general slope with a probability of 95%.

For large samples, say, we can approximate with a value of 1.96 (that is, the test statistic will tend to be normally distributed)

Assessing the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as , and call it the variation that is due to or explained by regression. The residual variation should be as small as possible.

If this is true, then most of the variation will be explained by regression, and the points will lie close to the regression line, i.e. the line fits the data well.

The proportion of total variance that is explained by regression is called coefficient of determination, usually expressed as a percentage and denoted R 2(in paired linear regression this is the quantity r 2, square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference represents the percentage of variance that cannot be explained by regression.

There is no formal test to evaluate; we must rely on subjective judgment to determine the goodness of fit of the regression line.

Applying a Regression Line to Forecast

You can use a regression line to predict a value from a value at the extreme end of the observed range (never extrapolate beyond these limits).

We predict the mean of observables that have a particular value by plugging that value into the equation of the regression line.

So, if we predict as Use this predicted value and its standard error to estimate a confidence interval for the true population mean.

Repeating this procedure for different values ​​allows you to construct confidence limits for this line. This is the band or area that contains the true line, for example at 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 observations with predictor values ​​P, such as 7, 4, and 9, and the design includes a first-order effect P, then the design matrix X will be

and the regression equation using P for X1 is

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P, such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-constrained and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the coding method chosen, the values ​​of the continuous variables are incremented accordingly and used as values ​​for the X variables. In this case, no recoding is performed. In addition, when describing regression plans, you can omit consideration of the design matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data presented in the table:

Rice. 3. Table of initial data.

Data compiled from a comparison of the 1960 and 1970 censuses in randomly selected 30 counties. County names are presented as observation names. Information regarding each variable is presented below:

Rice. 4. Table of variable specifications.

Research problem

For this example, the correlation between the poverty rate and the degree that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor) as the dependent variable.

We can put forward a hypothesis: changes in population size and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to out-migration, so there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng) as a predictor variable.

View results

Regression coefficients

Rice. 5. Regression coefficients of Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and the Param column. the unstandardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374. This means that for every one unit decrease in population, there is an increase in poverty rate of .40374. The upper and lower (default) 95% confidence limits for this unstandardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Variable distribution

Correlation coefficients can become significantly overestimated or underestimated if large outliers are present in the data. Let's study the distribution of the dependent variable Pt_Poor by district. To do this, let's build a histogram of the variable Pt_Poor.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the two right columns) have a higher percentage of families that are below the poverty line than expected under a normal distribution, they appear to be "within the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers should be considered if the observation (or observations) do not fall within the interval (mean ± 3 times the standard deviation). In this case, it is worth repeating the analysis with and without outliers to ensure that they do not have a major effect on the correlation between population members.

Scatterplot

If one of the hypotheses is a priori about the relationship between given variables, then it is useful to test it on the graph of the corresponding scatterplot.

Rice. 8. Scatter diagram.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., there is a 95% probability that the regression line lies between the two dotted curves.

Significance criteria

Rice. 9. Table containing significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Bottom line

This example showed how to analyze a simple regression design. Interpretations of unstandardized and standardized regression coefficients were also presented. The importance of studying the response distribution of a dependent variable is discussed, and a technique for determining the direction and strength of the relationship between a predictor and a dependent variable is demonstrated.