What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a two-dimensional scatter plot and say that we have linear relation, if the data is approximated by a straight line.

If we believe that y depends on x, and changes in y are caused precisely by changes in x, we can determine the regression line (regression y on x), which best describes the linear relationship between these two variables.

The statistical use of the word regression comes from the phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that although tall fathers tend to have tall sons, the average height of sons is shorter than that of their tall fathers. The average height of sons "regressed" and "moved backward" towards the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still quite tall) sons, and short fathers have taller (but still quite short) sons.

Regression line

Mathematical equation that evaluates a simple (paired) line linear regression:

x called the independent variable or predictor.

Y- dependent variable or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the "predicted value" y»

  • a- free member (intersection) of the evaluation line; this is the meaning Y, When x=0(Fig.1).
  • b- slope or gradient of the estimated line; it represents the amount by which Y increases on average if we increase x for one unit.
  • a And b are called regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intercept a and the slope b (the amount Y increases as x increases by one unit)

Least square method

We fulfill regression analysis, using a sample of observations, where a And b- sample estimates of the true (general) parameters, α and β, which determine the linear regression line in the population (general population).

The simplest method for determining coefficients a And b is method least squares (MNC).

The fit is assessed by looking at the residuals (the vertical distance of each point from the line, e.g. residual = observed y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observable quantity the remainder equal to the difference and the corresponding predicted Each residue can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

  • The residuals are normally distributed with a mean of zero;

If the assumptions of linearity, normality and/or constant variance are questionable, we can transform or and calculate new line regression for which these assumptions are satisfied (for example, use a logarithmic transformation, etc.).

Anomalous values ​​(outliers) and influence points

An "influential" observation, if omitted, changes one or more model parameter estimates (ie, slope or intercept).

An outlier (an observation that is inconsistent with the majority of values ​​in a data set) can be an "influential" observation and can be easily detected visually by inspecting a bivariate scatterplot or residual plot.

Both for outliers and for “influential” observations (points), models are used, both with and without their inclusion, and attention is paid to changes in estimates (regression coefficients).

When conducting an analysis, you should not automatically discard outliers or influence points, since simply ignoring them can affect the results obtained. Always study the reasons for these outliers and analyze them.

Linear regression hypothesis

When constructing linear regression, the null hypothesis is tested that the general slope of the regression line β is equal to zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which is subject to a distribution with degrees of freedom, where the standard error of the coefficient


,

- estimation of the dispersion of the residuals.

Typically, if the significance level is reached, the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom, which gives the probability of a two-sided test

This is the interval that contains the general slope with a probability of 95%.

For large samples, say, we can approximate with a value of 1.96 (that is, the test statistic will tend to be normally distributed)

Assessing the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as , and call it the variation that is due to or explained by regression. The residual variation should be as small as possible.

If this is true, then most of the variation will be explained by regression, and the points will lie close to the regression line, i.e. the line fits the data well.

Share total variance, which is explained by regression is called coefficient of determination, usually expressed as a percentage and denoted R 2(in paired linear regression this is the quantity r 2, square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference represents the percentage of variance that cannot be explained by regression.

There is no formal test to evaluate; we must rely on subjective judgment to determine the goodness of fit of the regression line.

Applying a Regression Line to Forecast

You can use a regression line to predict a value from a value at the extreme end of the observed range (never extrapolate beyond these limits).

We predict the mean of observables that have a particular value by plugging that value into the equation of the regression line.

So, if we predict as We use this predicted value and its standard error to estimate confidence interval for the true population mean.

Repeating this procedure for different values ​​allows you to construct confidence limits for this line. This is the band or area that contains the true line, for example at 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 observations with predictor values ​​P, such as 7, 4, and 9, and the design includes a first-order effect P, then the design matrix X will be

A regression equation using P for X1 looks like

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P, such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-constrained and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the coding method chosen, the values ​​of the continuous variables are incremented accordingly and used as values ​​for the X variables. In this case, no recoding is performed. In addition, when describing regression plans, you can omit consideration of the design matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data presented in the table:

Rice. 3. Table of initial data.

Data compiled from a comparison of the 1960 and 1970 censuses in randomly selected 30 counties. County names are presented as observation names. Information regarding each variable is presented below:

Rice. 4. Table of variable specifications.

Research problem

For this example, the correlation between the poverty rate and the degree that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor) as the dependent variable.

We can put forward a hypothesis: changes in population size and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to out-migration, so there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng) as a predictor variable.

View results

Regression coefficients

Rice. 5. Regression coefficients of Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and the Param column. the unstandardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374. This means that for every one unit decrease in population, there is an increase in poverty rate of .40374. The upper and lower (default) 95% confidence limits for this unstandardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Variable distribution

Correlation coefficients can become significantly overestimated or underestimated if large outliers are present in the data. Let's study the distribution of the dependent variable Pt_Poor by district. To do this, let's build a histogram of the variable Pt_Poor.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the two right columns) have a higher percentage of families that are below the poverty line than expected under a normal distribution, they appear to be "within the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers should be considered if the observation (or observations) do not fall within the interval (mean ± 3 times the standard deviation). In this case, it is worth repeating the analysis with and without outliers to ensure that they do not have a major effect on the correlation between population members.

Scatterplot

If one of the hypotheses is a priori about the relationship between given variables, then it is useful to test it on the graph of the corresponding scatterplot.

Rice. 8. Scatter diagram.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., there is a 95% probability that the regression line lies between the two dotted curves.

Significance criteria

Rice. 9. Table containing significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Bottom line

This example showed how to analyze a simple regression design. Interpretations of unstandardized and standardized regression coefficients were also presented. The importance of studying the response distribution of a dependent variable is discussed, and a technique for determining the direction and strength of the relationship between a predictor and a dependent variable is demonstrated.

A) Graphical analysis of simple linear regression.

Simple linear regression equation y=a+bx. If there is a correlation between the random variables Y and X, then the value y = ý + ,

where ý is the theoretical value of y obtained from the equation ý = f(x),

 – error of deviation of the theoretical equation ý from the actual (experimental) data.

The equation for the dependence of the average value ý on x, that is, ý = f(x), is called the regression equation. Regression analysis consists of four stages:

1) setting the problem and establishing the reasons for the connection.

2) limitation of the research object, collection of statistical information.

3) selection of the coupling equation based on the analysis and nature of the data collected.

4) calculation of numerical values, characteristics of correlation connections.

If two variables are related in such a way that a change in one variable corresponds to a systematic change in the other variable, then regression analysis is used to estimate and select the equation for the relationship between them if these variables are known. Unlike regression analysis, correlation analysis is used to analyze the closeness of the relationship between X and Y.

Let's consider finding a straight line in regression analysis:

Theoretical regression equation.

The term "simple regression" indicates that the value of one variable is estimated based on knowledge about another variable. Unlike simple multivariate regression, it is used to estimate a variable based on knowledge of two, three or more variables. Let's look at the graphical analysis of simple linear regression.

Suppose there are results of screening tests on pre-employment and labor productivity.

Selection results (100 points), x

Productivity (20 points), y

By plotting the points on a graph, we obtain a scatter diagram (field). We use it to analyze the results of selection tests and labor productivity.

Let's analyze the regression line using the scatterplot. In regression analysis, at least two variables are always specified. A systematic change in one variable is associated with a change in another. primary goal regression analysis consists of estimating the value of one variable if the value of another variable is known. For a complete task, the assessment of labor productivity is important.

Independent variable in regression analysis, a quantity that is used as a basis for analyzing another variable. In this case, these are the results of selection tests (along the X axis).

Dependent variable is called the estimated value (along the Y axis). In regression analysis, there can be only one dependent variable and more than one independent variable.

For simple regression analysis, the dependence can be represented in a two-coordinate system (x and y), with the X axis being the independent variable and the Y axis being the dependent variable. We plot the intersection points so that a pair of values ​​is represented on the graph. The schedule is called scatterplot. Its construction is the second stage of regression analysis, since the first is the selection of analyzed values ​​and collection of sample data. Thus, regression analysis is used for statistical analysis. The relationship between the sample data in a chart is linear.

To estimate the magnitude of a variable y based on a variable x, it is necessary to determine the position of the line that best represents the relationship between x and y based on the location of the points on the scatterplot. In our example, this is performance analysis. Line drawn through scattering points – regression line. One way to construct a regression line based on visual experience is the freehand method. Our regression line can be used to determine labor productivity. When finding the equation of the regression line

The least squares test is often used. The most suitable line is the one where the sum of squared deviations is minimal

The mathematical equation of a growth line represents the law of growth in an arithmetic progression:

at = AbX.

Y = A + bX– the given equation with one parameter is the simplest type of coupling equation. It is acceptable for average values. To more accurately express the relationship between X And at, an additional proportionality coefficient is introduced b, which indicates the slope of the regression line.

B) Construction of a theoretical regression line.

The process of finding it consists in choosing and justifying the type of curve and calculating parameters A, b, With etc. The construction process is called leveling, and the supply of curves offered by the mat. analysis, varied. Most often, in economic problems, a family of curves is used, equations that are expressed by polynomials of positive integer powers.

1)
– equation of a straight line,

2)
– hyperbola equation,

3)
– equation of a parabola,

where ý are the ordinates of the theoretical regression line.

Having chosen the type of equation, you need to find the parameters on which this equation depends. For example, the nature of the location of points in the scattering field showed that the theoretical regression line is straight.

A scatterplot allows you to represent labor productivity using regression analysis. In economics, regression analysis is used to predict many characteristics that affect the final product (taking into account pricing).

B) The criterion of the smallest frames for finding a straight line.

One criterion we might apply for a suitable regression line in a scatterplot is based on choosing the line for which the sum of squared errors is minimal.

The proximity of the scattering points to the straight line is measured by the ordinates of the segments. The deviations of these points can be positive and negative, but the sum of the squares of the deviations of the theoretical line from the experimental line is always positive and should be minimal. The fact that all scattering points do not coincide with the position of the regression line indicates the existence of a discrepancy between the experimental and theoretical data. Thus, we can say that no other regression line, except the one found, can give a smaller amount of deviations between the experimental and experimental data. Therefore, having found the theoretical equation ý and the regression line, we satisfy the least squares requirement.

This is done using the coupling equation
using formulas to find parameters A And b. Taking the theoretical value
and denoting the left side of the equation by f, we get the function
from unknown parameters A And b. Values A And b will satisfy the minimum function f and are found from partial differential equations
And
. This necessary condition, however, for a positive quadratic function this is also a sufficient condition for finding A And b.

Let us derive the parameter formulas from the partial derivative equations A And b:



we obtain a system of equations:

Where
– arithmetic mean errors.

Substituting numerical values, we find the parameters A And b.

There is a concept
. This is the approximation factor.

If e < 33%, то модель приемлема для дальнейшего анализа;

If e> 33%, then we take a hyperbola, parabola, etc. This gives the right for analysis in various situations.

Conclusion: according to the criterion of the approximation coefficient, the most suitable line is the one for which

, and no other regression line for our problem gives a minimum deviation.

D) Square error of estimation, checking their typicality.

In relation to a population in which the number of research parameters is less than 30 ( n < 30), для проверки типичности параметров уравнения регрессии используется t-Student's t-test. This calculates the actual value t-criteria:

From here

Where – residual root-mean-square error. Received t a And t b compared with critical t k from the Student's table taking into account the accepted significance level ( = 0.01 = 99% or  = 0.05 = 95%). P = f = k 1 = m– number of parameters of the equation under study (degree of freedom). For example, if y = a + bx; m = 2, k 2 = f 2 = p 2 = n – (m+ 1), where n– number of studied characteristics.

t a < t k < t b .

Conclusion: using the parameters of the regression equation tested for typicality, a mathematical model of communication is built
. In this case, the parameters of the mathematical function used in the analysis (linear, hyperbola, parabola) receive the corresponding quantitative values. The semantic content of the models obtained in this way is that they characterize the average value of the resulting characteristic
from factor sign X.

D) Curvilinear regression.

Quite often, a curvilinear relationship occurs when a changing relationship is established between variables. The intensity of the increase (decrease) depends on the level of X. There are different types of curvilinear dependence. For example, consider the relationship between crop yield and precipitation. With an increase in precipitation under equal natural conditions, there is an intensive increase in yield, but up to a certain limit. After the critical point, precipitation turns out to be excessive, and yields drop catastrophically. The example shows that at first the relationship was positive and then negative. The critical point is the optimal level of attribute X, which corresponds to the maximum or minimum value of attribute Y.

In economics, such a relationship is observed between price and consumption, productivity and experience.

Parabolic dependence.

If the data show that an increase in a factor characteristic leads to an increase in the resultant characteristic, then a second-order equation (parabola) is taken as a regression equation.

. Coefficients a,b,c are found from partial differential equations:

We get a system of equations:

Types of curvilinear equations:

,

,

We have the right to assume that there is a curvilinear relationship between labor productivity and selection test scores. This means that as the scoring system increases, performance will begin to decrease at some level, so the straight model may turn out to be curvilinear.

The third model will be a hyperbola, and in all equations the variable x will be replaced by the expression .

The main feature of regression analysis: with its help, you can obtain specific information about what form and nature the relationship between the variables under study has.

Sequence of stages of regression analysis

Let us briefly consider the stages of regression analysis.

    Problem formulation. At this stage, preliminary hypotheses about the dependence of the phenomena under study are formed.

    Definition of dependent and independent (explanatory) variables.

    Collection of statistical data. Data must be collected for each of the variables included in the regression model.

    Formulation of a hypothesis about the form of connection (simple or multiple, linear or nonlinear).

    Definition regression functions (consists in calculating the numerical values ​​of the parameters of the regression equation)

    Assessing the accuracy of regression analysis.

    Interpretation of the results obtained. The obtained results of regression analysis are compared with preliminary hypotheses. The correctness and credibility of the results obtained are assessed.

    Predicting unknown values ​​of a dependent variable.

Using regression analysis, it is possible to solve the problem of forecasting and classification. Predicted values ​​are calculated by substituting the values ​​of explanatory variables into the regression equation. The classification problem is solved in this way: the regression line divides the entire set of objects into two classes, and that part of the set where the function value is greater than zero belongs to one class, and the part where it is less than zero belongs to another class.

Regression Analysis Problems

Let's consider the main tasks of regression analysis: establishing the form of dependence, determining regression functions, estimation of unknown values ​​of the dependent variable.

Establishing the form of dependence.

The nature and form of the relationship between variables can form the following types of regression:

    positive linear regression (expressed in uniform growth of the function);

    positive uniformly increasing regression;

    positive uniformly increasing regression;

    negative linear regression (expressed as a uniform decline in the function);

    negative uniformly accelerated decreasing regression;

    negative uniformly decreasing regression.

However, the described varieties are usually not found in pure form, but in combination with each other. In this case, we talk about combined forms of regression.

Definition of the regression function.

The second task comes down to identifying the effect on the dependent variable of the main factors or causes, other things being equal, and subject to the exclusion of the influence of random elements on the dependent variable. Regression function is defined in the form of a mathematical equation of one type or another.

Estimation of unknown values ​​of the dependent variable.

The solution to this problem comes down to solving a problem of one of the following types:

    Estimation of the values ​​of the dependent variable within the considered interval of the initial data, i.e. missing values; in this case, the interpolation problem is solved.

    Estimation of future values ​​of the dependent variable, i.e. finding values ​​outside the specified interval of the source data; in this case, the problem of extrapolation is solved.

Both problems are solved by substituting the found parameter estimates for the values ​​of independent variables into the regression equation. The result of solving the equation is an estimate of the value of the target (dependent) variable.

Let's look at some of the assumptions that regression analysis relies on.

Linearity assumption, i.e. the relationship between the variables under consideration is assumed to be linear. So, in this example, we plotted a scatterplot and were able to see a clear linear relationship. If, on the scatter diagram of the variables, we see a clear absence of a linear relationship, i.e. If there is a nonlinear relationship, nonlinear analysis methods should be used.

Normality Assumption leftovers. It assumes that the distribution of the difference between predicted and observed values ​​is normal. To visually determine the nature of the distribution, you can use histograms leftovers.

When using regression analysis, its main limitation should be considered. It consists in the fact that regression analysis allows us to detect only dependencies, and not the connections underlying these dependencies.

Regression analysis allows you to estimate the strength of the relationship between variables by calculating the estimated value of a variable based on several known values.

Regression equation.

The regression equation looks like this: Y=a+b*X

Using this equation, the variable Y is expressed in terms of a constant a and the slope of the line (or slope) b, multiplied by the value of the variable X. The constant a is also called the intercept term, and the slope is the regression coefficient or B-coefficient.

In most cases (if not always) there is a certain scatter of observations relative to the regression line.

Remainder is the deviation of a single point (observation) from the regression line (predicted value).

To solve the problem of regression analysis in MS Excel, select from the menu Service"Analysis package" and the Regression analysis tool. We set the input intervals X and Y. The input interval Y is the range of dependent analyzed data, it must include one column. The input interval X is the range of independent data that needs to be analyzed. The number of input ranges should not exceed 16.

At the output of the procedure in the output range we obtain the report given in table 8.3a-8.3v.

CONCLUSION OF RESULTS

Table 8.3a. Regression statistics

Regression statistics

Plural R

R-square

Normalized R-squared

Standard error

Observations

Let's first look at the top part of the calculations presented in table 8.3a, - regression statistics.

Magnitude R-square, also called the measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the source data and the regression model (calculated data). The measure of certainty is always within the interval.

In most cases the value R-square is between these values, called extreme, i.e. between zero and one.

If the value R-square close to unity, this means that the constructed model explains almost all the variability in the corresponding variables. Conversely, the meaning R-square, close to zero, means poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

plural R - multiple correlation coefficient R - expresses the degree of dependence of the independent variables (X) and the dependent variable (Y).

Plural R is equal to the square root of the coefficient of determination; this quantity takes values ​​in the range from zero to one.

In simple linear regression analysis plural R equal to the Pearson correlation coefficient. Really, plural R in our case, it is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients

Odds

Standard error

t-statistic

Y-intersection

Variable X 1

* A truncated version of the calculations is provided

Now consider the middle part of the calculations presented in table 8.3b. Here the regression coefficient b (2.305454545) and the displacement along the ordinate axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between variables is determined based on the signs (negative or positive) of the regression coefficients (coefficient b).

If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

IN table 8.3c. output results are presented leftovers. In order for these results to appear in the report, you must activate the “Residuals” checkbox when running the “Regression” tool.

WITHDRAWAL OF THE REST

Table 8.3c. Leftovers

Observation

Predicted Y

Leftovers

Standard balances

Using this part of the report, we can see the deviations of each point from the constructed regression line. Largest absolute value remainder in our case - 0.778, the smallest - 0.043. To better interpret these data, we will use the graph of the original data and the constructed regression line presented in rice. 8.3. As you can see, the regression line is quite accurately “fitted” to the values ​​of the original data.

It should be taken into account that the example under consideration is quite simple and it is not always possible to qualitatively construct a linear regression line.

Rice. 8.3. Source data and regression line

The problem of estimating unknown future values ​​of the dependent variable based on known values ​​of the independent variable has remained unconsidered, i.e. forecasting problem.

Having a regression equation, the forecasting problem is reduced to solving the equation Y= x*2.305454545+2.694545455 with known values ​​of x. The results of predicting the dependent variable Y six steps ahead are presented in table 8.4.

Table 8.4. Y variable forecast results

Y(predicted)

Thus, as a result of using regression analysis in Microsoft Excel, we:

    built a regression equation;

    established the form of dependence and direction of connection between variables - positive linear regression, which is expressed in uniform growth of the function;

    established the direction of the relationship between the variables;

    assessed the quality of the resulting regression line;

    were able to see deviations of the calculated data from the data of the original set;

    predicted future values ​​of the dependent variable.

If regression function defined, interpreted and justified, and the assessment of the accuracy of the regression analysis meets the requirements, the constructed model and predicted values ​​can be considered to have sufficient reliability.

The predicted values ​​obtained in this way are the average values ​​that can be expected.

In this work we reviewed the main characteristics descriptive statistics and among them such concepts as average value,median,maximum,minimum and other characteristics of data variation.

The concept was also briefly discussed emissions. The characteristics considered relate to the so-called exploratory data analysis; its conclusions may not apply to the general population, but only to a sample of data. Exploratory data analysis is used to obtain primary conclusions and form hypotheses about the population.

The basics of correlation and regression analysis, their tasks and possibilities for practical use were also discussed.

Lecture 3.

Regression analysis.

1) Numerical characteristics of regression

2) Linear regression

3) Nonlinear regression

4) Multiple regression

5) Using MS EXCEL to perform regression analysis

Control and evaluation tool - test tasks

1. Numerical characteristics of regression

Regression analysis is a statistical method for studying the influence of one or more independent variables on a dependent variable. Independent variables are otherwise called regressors or predictors, and dependent variables are called criterion variables. The terminology of dependent and independent variables reflects only the mathematical dependence of the variables, and not cause-and-effect relationships.

Goals of Regression Analysis

  • Determining the degree of determination of the variation of a criterion (dependent) variable by predictors (independent variables).
  • Predicting the value of a dependent variable using the independent variable(s).
  • Determination of the contribution of individual independent variables to the variation of the dependent variable.

Regression analysis cannot be used to determine whether there is a relationship between variables, since the presence of such a relationship is a prerequisite for applying the analysis.

To conduct regression analysis, you first need to become familiar with the basic concepts of statistics and probability theory.

Basic numerical characteristics of discrete and continuous random variables: mathematical expectation, dispersion and standard deviation.

Random variables are divided into two types:

  • · discrete, which can only take on specific, pre-agreed values ​​(for example, the values ​​of the numbers on the top edge of a thrown dice or the ordinal values ​​of the current month);
  • · continuous (most often - the values ​​of some physical quantities: weight, distance, temperature, etc.), which, according to the laws of nature, can take on any values, at least in a certain interval.

The distribution law of a random variable is the correspondence between the possible values ​​of a discrete random variable and its probabilities, usually written in a table:

The statistical definition of probability is expressed through the relative frequency of a random event, that is, it is found as the ratio of the number of random variables to the total number of random variables.

Mathematical expectation of a discrete random variableX is called the sum of products of the values ​​of a quantity X on the probability of these values. The mathematical expectation is denoted by or M(X) .

n

= M(X) = x 1 p 1 + x 2 p 2 +… + x n p n = S x i p i

i=1

The dispersion of a random variable relative to its mathematical expectation is determined using a numerical characteristic called dispersion. Simply put, variance is the spread of a random variable around the mean value. To understand the essence of dispersion, consider an example. The average salary in the country is about 25 thousand rubles. Where does this figure come from? Most likely, all salaries are added up and divided by the number of employees. In this case, there is a very large dispersion (the minimum salary is about 4 thousand rubles, and the maximum is about 100 thousand rubles). If everyone's salary was the same, then the variance would be zero and there would be no spread.

Dispersion of a discrete random variableX is the mathematical expectation of the squared difference of a random variable and its mathematical expectation:

D = M [ ((X - M (X)) 2 ]

Using the definition of mathematical expectation to calculate variance, we obtain the formula:

D = S (x i - M (X)) 2 p i

The variance has the dimension of the square of the random variable. In cases where it is necessary to have a numerical characteristic of the dispersion of possible values ​​in the same dimension as the random variable itself, the standard deviation is used.

Standard deviation a random variable is called the square root of its variance.

The standard deviation is a measure of the dispersion of the values ​​of a random variable around its mathematical expectation.

Example.

The distribution law of the random variable X is given by the following table:

Find its mathematical expectation, variance and standard deviation .

We use the above formulas:

M (X) = 1 0.1 + 2 0.4 + 4 0.4 + 5 0.1 = 3

D = (1-3) 2 0.1 + (2 - 3) 2 0.4 + (4 - 3) 2 0.4 + (5 - 3) 2 0.1 = 1.6

Example.

In a cash lottery, 1 win of 1000 rubles, 10 wins of 100 rubles and 100 wins of 1 ruble each are played out with a total number of tickets of 10,000. Create a distribution law for the random win X for the owner of one lottery ticket and determine the mathematical expectation, dispersion and standard deviation of the random variable .

X 1 = 1000, X 2 = 100, X 3 = 1, X 4 = 0,

P 1 = 1/10000 = 0.0001, P 2 = 10/10000 = 0.001, P 3 = 100/10000 = 0.01, P 4 = 1 - (P 1 + P 2 + P 3) = 0.9889 .

Let's put the results in the table:

Mathematical expectation is the sum of paired products of the value of a random variable and its probability. For this task, it is advisable to calculate it using the formula

1000 · 0.0001 + 100 · 0.001 + 1 · 0.01 + 0 · 0.9889 = 0.21 rubles.

We received a real “fair” ticket price.

D = S (x i - M (X)) 2 p i = (1000 - 0.21) 2 0.0001 + (100 - 0.21) 2 0.001 +

+ (1 - 0,21) 2 0,01 + (0 - 0,21) 2 0,9889 ≈ 109,97

Distribution function of continuous random variables

A value that, as a result of a test, will take on one possible value (which is not known in advance) is called a random variable. As mentioned above, random variables can be discrete (discontinuous) and continuous.

Discrete is a random variable that takes on separate possible values ​​with certain probabilities that can be numbered.

Continuous is a random variable that can take all values ​​from some finite or infinite interval.

Up to this point, we were limited to only one “type” of random variables - discrete, i.e. taking finite values.

But the theory and practice of statistics require the use of the concept of a continuous random variable - allowing any numerical values ​​from any interval.

It is convenient to define the distribution law of a continuous random variable using the so-called probability density function. f(x). Probability P (a< X < b) того, что значение, принятое случайной величиной Х, попадет в промежуток (a; b), определяется равенством

P(a< X < b) = ∫ f(x) dx

The graph of the function f (x) is called the distribution curve. Geometrically, the probability of a random variable falling into the interval (a; b) is equal to the area of ​​the corresponding curvilinear trapezoid bounded by the distribution curve, the Ox axis and the straight lines x = a, x = b.

P(a £ X

If a finite or countable set is subtracted from a complex event, the probability of the occurrence of a new event remains unchanged.

Function f(x) - a numerical scalar function of the real argument x is called the probability density, and exists at a point x if a limit exists at this point:

Properties of probability density:

  1. The probability density is a non-negative function, i.e. f(x) ≥ 0

(if all values ​​of the random variable X are contained in the interval (a;b), then the last

the equality can be written as ∫ f (x) dx = 1).

Let us now consider the function F(x) = P(X< х). Эта функция называется функцией распределения вероятности случайной величины Х. Функция F(х) существует как для дискретных, так и для непрерывных случайных величин. Если f (x) - функция плотности распределения вероятности

continuous random variable X, then F (x) = ∫ f(x) dx = 1).

From the last equality it follows that f (x) = F" (x)

Sometimes the function f(x) is called the differential probability distribution function, and the function F(x) is called the cumulative probability distribution function.

Let us note the most important properties of the probability distribution function:

  1. F(x) is a non-decreasing function.
  2. F (- ∞) = 0.
  3. F (+ ∞) = 1.

The concept of distribution function is central to probability theory. Using this concept, we can give another definition of a continuous random variable. A random variable is called continuous if its cumulative distribution function F(x) is continuous.

Numerical characteristics of continuous random variables

The mathematical expectation, dispersion and other parameters of any random variables are almost always calculated using formulas arising from the distribution law.

For a continuous random variable, the mathematical expectation is calculated using the formula:

M(X) = ∫ x f(x) dx

Dispersion:

D (X) = ∫ ( x- M (X)) 2 f(x) dx or D(X) = ∫ x 2 f(x) dx - (M (X)) 2

2. Linear regression

Let the components X and Y of a two-dimensional random variable (X, Y) be dependent. We will assume that one of them can be approximately represented as a linear function of the other, for example

Y ≈ g(Х) = α + βХ, and we determine the parameters α and β using the least squares method.

Definition. The function g(Х) = α + βХ is called best approximation Y in the sense of the least squares method, if the mathematical expectation M(Y - g(X)) 2 takes the smallest possible value; the function g(X) is called mean square regression Y to X.

Theorem Linear mean square regression of Y on X has the form:

where is the correlation coefficient of X and Y.

Equation coefficients.

It can be verified that for these values ​​the function F(α, β)

F(α, β ) = M(Y - α - βX)² has a minimum, which proves the theorem.

Definition. The coefficient is called regression coefficient Y on X, and the straight line - - direct mean square regression of Y on X.

By substituting the coordinates of the stationary point into the equality, we can find the minimum value of the function F(α, β), equal to This quantity is called residual variance Y relative to X and characterizes the amount of error allowed when replacing Y with

g(X) = α+βX. When the residual variance is equal to 0, that is, the equality is not approximate, but exact. Therefore, at Y and X are related by a linear functional dependence. Similarly, you can get a direct mean square regression of X on Y:

and the residual variance of X relative to Y. At both direct regressions coincide. By comparing the regression equations Y on X and X on Y and solving the system of equations, you can find the point of intersection of the regression lines - a point with coordinates (m x, m y), called the center of the joint distribution of X and Y values.

We will consider the algorithm for composing regression equations from the textbook by V. E. Gmurman “Probability Theory and Mathematical Statistics” p. 256.

1) Draw up a calculation table in which the numbers of sample elements, sampling options, their squares and product will be recorded.

2) Calculate the sum for all columns except the number.

3) Calculate the average values ​​for each value, variance and standard deviations.

5) Test the hypothesis about the existence of a connection between X and Y.

6) Create equations for both regression lines and draw graphs of these equations.

The slope of the straight regression line Y on X is the sample regression coefficient

Coefficient b=

We obtain the required equation for the regression line of Y on X:

Y = 0.202 X + 1.024

The regression equation for X on Y is similar:

The slope of the straight regression line Y on X is the sample regression coefficient pxy:

Coefficient b=

X = 4.119U - 3.714

3. Nonlinear regression

If there are nonlinear relationships between economic phenomena, then they are expressed using the corresponding nonlinear functions.

There are two classes of nonlinear regressions:

1. Regressions that are nonlinear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, for example:

Polynomials of different degrees

Equilateral hyperbola - ;

Semilogarithmic function - .

2. Regressions that are nonlinear in terms of the parameters being estimated, for example:

Power - ;

Demonstrative - ;

Exponential - .

Regressions that are nonlinear with respect to the included variables are brought to a linear form by simply replacing variables, and further estimation of the parameters is carried out using the least squares method. Let's look at some features.

A parabola of the second degree is reduced to linear form using the replacement: . As a result, we arrive at a two-factor equation, the estimation of parameters of which using the Least Squares Method leads to a system of equations:

A parabola of the second degree is usually used in cases where, for a certain interval of factor values, the nature of the connection between the characteristics under consideration changes: direct connection changes to reverse or reverse to direct.

An equilateral hyperbola can be used to characterize the relationship between the specific costs of raw materials, materials, fuel and the volume of output, the time of circulation of goods and the amount of turnover. Its classic example is the Phillips curve, which characterizes the nonlinear relationship between the unemployment rate x and the percentage of wage growth y.

The hyperbola is reduced to a linear equation by a simple substitution: . You can also use the Method of Least Squares to construct a system of linear equations.

In a similar way, the dependences are reduced to a linear form: , and others.

An equilateral hyperbola and a semi-logarithmic curve are used to describe the Engel curve (a mathematical description of the relationship between the share of expenditures on durable goods and total expenditures (or income)). The equations that include are used in studies of productivity and labor intensity of agricultural production.

4. Multiple regression

Multiple regression is a relationship equation with several independent variables:

where is the dependent variable (resultative attribute);

Independent variables (factors).

To build a multiple regression equation, the following functions are most often used:

linear -

power -

exponent -

hyperbole - .

You can use other functions that can be reduced to linear form.

To estimate the parameters of the multiple regression equation, the least squares method (OLS) is used. For linear equations and nonlinear equations reducible to linear ones, the following system of normal equations is constructed, the solution of which allows us to obtain estimates of the regression parameters:

To solve it, the method of determinants can be used:

where is the determinant of the system;

Particular qualifiers; which are obtained by replacing the corresponding column of the system determinant matrix with the data on the left side of the system.

Another type of multiple regression equation is a regression equation on a standardized scale; OLS is applied to a multiple regression equation on a standardized scale.

5.UsageMSEXCELto perform regression analysis

Regression analysis establishes the forms of dependence between the random variable Y (dependent) and the values ​​of one or more variable quantities (independent), and the values ​​of the latter are considered to be precisely specified. Such a dependence is usually determined by some mathematical model (regression equation) containing several unknown parameters. During regression analysis, based on sample data, estimates of these parameters are found, statistical errors in estimates or boundaries of confidence intervals are determined, and the compliance (adequacy) of the adopted mathematical model with experimental data is checked.

In linear regression analysis, the relationship between random variables is assumed to be linear. In the simplest case, in a paired linear regression model there are two variables X and Y. And it is required to construct (fit) a straight line using n pairs of observations (X1, Y1), (X2, Y2), ..., (Xn, Yn), called the regression line that "best" approximates the observed values. The equation of this line y=ax+b is a regression equation. Using a regression equation, you can predict the expected value of the dependent variable y corresponding to a given value of the independent variable x. In the case when the dependence between one dependent variable Y and several independent variables X1, X2, ..., Xm is considered, we speak of multiple linear regression.

In this case, the regression equation has the form

y = a 0 +a 1 x 1 +a 2 x 2 +…+a m x m ,

where a0, a1, a2, …, am are regression coefficients that require determination.

The coefficients of the regression equation are determined using the least squares method, achieving the minimum possible sum of squared differences between the actual values ​​of the Y variable and those calculated from the regression equation. Thus, for example, a linear regression equation can be constructed even in the case where there is no linear correlation.

A measure of the effectiveness of a regression model is the coefficient of determination R2 (R-square). The coefficient of determination can take values ​​between 0 and 1; it determines the degree of accuracy with which the resulting regression equation describes (approximates) the original data. The significance of the regression model is also examined using the F-test (Fisher) and the reliability of the difference between the coefficients a0, a1, a2, ..., am and zero is checked using the Student’s t-test.

In Excel, experimental data are approximated by a linear equation up to the 16th order:

y = a0+a1x1+a2x2+…+a16x16

To obtain linear regression coefficients, the “Regression” procedure from the analysis package can be used. Also, complete information about the linear regression equation is provided by the LINEST function. In addition, the SLOPE and INTERCEPT functions can be used to obtain the parameters of the regression equation, and the TREND and FORECAST functions can be used to obtain the predicted Y values ​​at the desired points (for pairwise regression).

Let us consider in detail the use of the LINEST function (known_y, [known_x], [constant], [statistics]): known_y - the range of known values ​​of the dependent parameter Y. In paired regression analysis it can have any form; in plural must be a row or column; known_x - range of known values ​​of one or more independent parameters. Must have the same shape as the Y range (for several parameters - several columns or rows, respectively); constant is a logical argument. If, based on the practical meaning of the regression analysis problem, it is necessary that the regression line passes through the origin, that is, the free coefficient is equal to 0, the value of this argument should be set equal to 0 (or “false”). If the value is set to 1 (or true) or omitted, then the free coefficient is calculated in the usual way; statistics are a logical argument. If the value is set to 1 (or “true”), then regression statistics are additionally returned (see table) used to evaluate the effectiveness and significance of the model. In general, for pair regression y=ax+b, the result of applying the LINEST function has the form:

Table. Output range of the LINEST function for pairwise regression analysis

In the case of multiple regression analysis for the equation y=a0+a1x1+a2x2+…+amxm, the first line displays the coefficients am,…,a1,a0, and the second line displays the standard errors for these coefficients. Rows 3-5, excluding the first two columns filled with regression statistics, will return #N/A.

The LINEST function should be entered as an array formula, first selecting an array of the required size for the result (m+1 columns and 5 rows if regression statistics are required) and completing the entry of the formula by pressing CTRL+SHIFT+ENTER.

Result for our example:

In addition, the program has a built-in function - Data Analysis on the Data tab.

It can also be used to perform regression analysis:

The slide shows the result of regression analysis performed using Data Analysis.

CONCLUSION OF RESULTS

Regression statistics

Plural R

R-square

Normalized R-squared

Standard error

Observations

Analysis of variance

Significance F

Regression

Odds

Standard error

t-statistic

P-Value

Bottom 95%

Top 95%

Bottom 95.0%

Top 95.0%

Y-intersection

Variable X 1

The regression equations that we looked at earlier were also built in MS Excel. To perform them, first build a Scatter Chart, then through the context menu select - Add Trend Line. In the new window, check the box - Show the equation on the diagram and place the approximation reliability value (R^2) on the diagram.

Literature:

  1. Theory of Probability and Mathematical Statistics. Gmurman V. E. Textbook for universities. - Ed. 10th, erased. - M.: Higher. school, 2010. - 479 p.
  2. Higher mathematics in exercises and problems. Textbook for universities / Danko P. E., Popov A. G., Kozhevnikova T. Ya., Danko S. P. In 2 hours - Ed. 6th, erased. - M.: Onyx Publishing House LLC: Mir and Education Publishing House LLC, 2007. - 416 p.
    1. 3. http://www.machinelearning.ru/wiki/index.php?title=%D0%A0%D0%B5%D0%B3%D1%80%D0%B5%D1%81%D1%81%D0%B8 %D1%8F - some information about regression analysis

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is discussed in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equality is used in statistics and econometrics.

Definition of regression

In mathematics, regression means a certain quantity that describes the dependence of the average value of a set of data on the values ​​of another quantity. The regression equation shows, as a function of a particular characteristic, the average value of another characteristic. The regression function has the form of a simple equation y = x, in which y acts as a dependent variable, and x as an independent variable (feature-factor). In fact, regression is expressed as y = f (x).

What are the types of relationships between variables?

In general, there are two opposing types of relationships: correlation and regression.

The first is characterized by the equality of conditional variables. In this case, it is not reliably known which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to construct a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

Today, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c+t*x+E. A hyperbolic equation has the form of a regular hyperbola y = c + m / x + E. A logarithmically linear equation expresses the relationship using a logarithmic function: In y = In c + m * In x + In E.

Multiple and nonlinear

The two more complex types of regression are multiple and nonlinear. The multiple regression equation is expressed by the function y = f(x 1, x 2 ... x c) + E. In this situation, y acts as a dependent variable, and x acts as an explanatory variable. The E variable is stochastic; it includes the influence of other factors in the equation. The nonlinear regression equation is a bit controversial. On the one hand, relative to the indicators taken into account, it is not linear, but on the other hand, in the role of evaluating indicators, it is linear.

Inverse and paired types of regressions

An inverse is a type of function that needs to be converted to a linear form. In the most traditional application programs, it has the form of a function y = 1/c + m*x+E. A pairwise regression equation shows the relationship between the data as a function of y = f (x) + E. Just like in other equations, y depends on x, and E is a stochastic parameter.

Concept of correlation

This is an indicator demonstrating the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence of feedback, a positive indicator indicates direct feedback. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1, the stronger the relationship between the parameters; the closer to 0, the weaker it is.

Methods

Correlation parametric methods can assess the strength of the relationship. They are used on the basis of distribution estimation to study parameters that obey the law of normal distribution.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the selected relationship formula. The correlation field is used as a connection identification method. To do this, all existing data must be depicted graphically. All known data must be plotted in a rectangular two-dimensional coordinate system. This is how a correlation field is formed. The values ​​of the describing factor are marked along the abscissa axis, while the values ​​of the dependent factor are marked along the ordinate axis. If there is a functional relationship between the parameters, they are lined up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can speak of an almost complete absence of connection. If it is between 30% and 70%, then this indicates the presence of medium-close connections. A 100% indicator is evidence of a functional connection.

A nonlinear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of multiple correlation. He talks about the close relationship of the presented set of indicators with the characteristic being studied. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is estimated using this indicator.

In order to calculate the multiple correlation indicator, it is necessary to calculate its index.

Least square method

This method is a way to estimate regression factors. Its essence is to minimize the sum of squared deviations obtained as a result of the dependence of the factor on the function.

A pairwise linear regression equation can be estimated using such a method. This type of equations is used when a paired linear relationship is detected between indicators.

Equation Parameters

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter m demonstrates the average change in the final indicator of the function y, provided that the variable x decreases (increases) by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not carry economic meaning. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say that the change in the result is slow compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed through an equation. For example, factor c has the form c = y - mx.

Grouped data

There are task conditions in which all information is grouped by attribute x, but for a certain group the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depending on x changes. Thus, the grouped information helps to find the regression equation. It is used as an analysis of relationships. However, this method has its drawbacks. Unfortunately, average indicators are often subject to external fluctuations. These fluctuations do not reflect the pattern of the relationship; they just mask its “noise.” Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the number of an individual population by the corresponding average, one can obtain the sum y within the group. Next, you need to add up all the amounts received and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. If the intervals are small, we can conditionally take the x indicator for all units (within the group) to be the same. You should multiply it with the sum of y to find out the sum of the products of x and y. Next, all the amounts are added together and the total amount xy is obtained.

Multiple pairwise regression equation: assessing the importance of a relationship

As discussed earlier, multiple regression has a function of the form y = f (x 1,x 2,…,x m)+E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, and to study the causes and type of the production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the microeconomics level this equation is used a little less frequently.

The main task of multiple regression is to build a model of data containing a huge amount of information in order to further determine what influence each of the factors individually and in their totality has on the indicator that needs to be modeled and its coefficients. The regression equation can take on a wide variety of values. In this case, to assess the relationship, two types of functions are usually used: linear and nonlinear.

The linear function is depicted in the form of the following relationship: y = a 0 + a 1 x 1 + a 2 x 2,+ ... + a m x m. In this case, a2, a m are considered “pure” regression coefficients. They are necessary to characterize the average change in parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of stable values ​​of other indicators.

Nonlinear equations have, for example, the form of a power function y=ax 1 b1 x 2 b2 ...x m bm. In this case, the indicators b 1, b 2 ..... b m are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors need to be taken into account when constructing multiple regression

In order to correctly build multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationships between economic factors and what is being modeled. Factors that will need to be included must meet the following criteria:

  • Must be subject to quantitative measurement. In order to use a factor that describes the quality of an object, in any case it should be given a quantitative form.
  • There should be no intercorrelation of factors, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditional, and this entails its unreliability and unclear estimates.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction methods

There are a huge number of methods and methods that explain how you can select factors for an equation. However, all these methods are based on the selection of coefficients using a correlation indicator. Among them are:

  • Elimination method.
  • Switching method.
  • Stepwise regression analysis.

The first method involves filtering out all coefficients from the total set. The second method involves introducing many additional factors. Well, the third is the elimination of factors that were previously used for the equation. Each of these methods has a right to exist. They have their pros and cons, but they can all solve the issue of eliminating unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Multivariate analysis methods

Such methods for determining factors are based on consideration of individual combinations of interrelated characteristics. These include discriminant analysis, shape recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, but it appeared due to the development of the component method. All of them apply in certain circumstances, subject to certain conditions and factors.