For the territories of the region, data for 200X is provided.

Region number Average per capita living wage per day of one able-bodied person, rub., x Daily average wage, rub., y
1 78 133
2 82 148
3 87 134
4 79 154
5 89 162
6 106 195
7 67 139
8 88 158
9 73 152
10 87 162
11 76 159
12 115 173

Exercise:

1. Construct a correlation field and formulate a hypothesis about the form of the connection.

2. Calculate the parameters of the equation linear regression

4. Using the average (general) elasticity coefficient, give a comparative assessment of the strength of the relationship between the factor and the result.

7. Calculate the predicted value of the result if the predicted value of the factor increases by 10% from its average level. Determine the forecast confidence interval for the significance level.

Solution:

Let's solve this problem using Excel.

1. By comparing the available data x and y, for example, ranking them in increasing order of factor x, one can observe the presence of a direct relationship between the characteristics, when an increase in the average per capita subsistence level increases the average daily wage. Based on this, we can make the assumption that the relationship between the characteristics is direct and can be described by a straight line equation. The same conclusion is confirmed based on graphical analysis.

To build a correlation field, you can use Excel PPP. Enter the initial data in sequence: first x, then y.

Select the area of ​​cells that contains data.

Then choose: Insert / Scatter Plot / Scatter with Markers as shown in Figure 1.

Figure 1 Construction of the correlation field

Analysis of the correlation field shows the presence of a close to rectilinear dependence, since the points are located almost in a straight line.

2. To calculate the parameters of the linear regression equation
Let's use the built-in statistical function LINEST.

For this:

1) Open an existing file containing the analyzed data;
2) Select a 5x2 area of ​​empty cells (5 rows, 2 columns) to display the results of regression statistics.
3) Activate Function Wizard: in the main menu select Formulas / Insert Function.
4) In the window Category you are taking Statistical, in the function window - LINEST. Click the button OK as shown in Figure 2;

Figure 2 Function Wizard Dialog Box

5) Fill in the function arguments:

Known values ​​for

Known values ​​of x

Constant- a logical value that indicates the presence or absence of a free term in the equation; if Constant = 1, then the free term is calculated in the usual way, if Constant = 0, then the free term is 0;

Statistics- a logical value that indicates whether to display additional information on regression analysis or not. If Statistics = 1, then additional information is displayed, if Statistics = 0, then only estimates of the equation parameters are displayed.

Click the button OK;

Figure 3 LINEST Function Arguments Dialog Box

6) The first element of the final table will appear in the upper left cell of the selected area. To open the entire table, press the key , and then to the key combination ++ .

Additional regression statistics will be output in the order shown in the following diagram:

Coefficient value b Coefficient a value
Standard error b Standard error a
Standard error y
F-statistic
Regression sum of squares

Figure 4 Result of calculating the LINEST function

We got the regression level:

We conclude: With an increase in the average per capita subsistence level by 1 rub. the average daily wage increases by an average of 0.92 rubles.

This means that 52% of the variation in wages (y) is explained by the variation of factor x - the average per capita living wage, and 48% - by the action of other factors not included in the model.

Using the calculated coefficient of determination, the correlation coefficient can be calculated: .

The connection is assessed as close.

4. Using the average (general) elasticity coefficient, we determine the strength of the factor’s influence on the result.

For a straight line equation, we determine the average (total) elasticity coefficient using the formula:

We will find the average values ​​by selecting the area of ​​cells with x values ​​and selecting Formulas / AutoSum / Average, and we will do the same with the values ​​of y.

Figure 5 Calculation of average function values ​​and argument

Thus, if the average per capita cost of living changes by 1% from its average value, the average daily wage will change by an average of 0.51%.

Using a data analysis tool Regression available:
- results of regression statistics,
- results of analysis of variance,
- results confidence intervals,
- residuals and regression line fitting graphs,
- residuals and normal probability.

The procedure is as follows:

1) check access to Analysis package. In the main menu, select: File/Options/Add-ons.

2) In the dropdown list Control select item Excel add-ins and press the button Go.

3) In the window Add-ons check the box Analysis package and then click the button OK.

If Analysis package not in the field list Available add-ons, press the button Review to perform a search.

If you receive a message indicating that the analysis package is not installed on your computer, click Yes to install it.

4) In the main menu, select: Data / Data Analysis / Analysis Tools / Regression and then click the button OK.

5) Fill out the data input and output parameters dialog box:

Input interval Y- range containing data of the resultant attribute;

Input interval X- range containing data of the factor characteristic;

Tags- a flag that indicates whether the first line contains column names or not;

Constant - zero- a flag indicating the presence or absence of a free term in the equation;

Output interval- it is enough to indicate the upper left cell of the future range;

6) New worksheet - can be set arbitrary name new sheet.

Then click the button OK.

Figure 6 Dialog box for entering parameters for the Regression tool

The results of the regression analysis for the problem data are presented in Figure 7.

Figure 7 Result of using the regression tool

5. Let us evaluate the quality of the equations using the average approximation error. Let's use the results of the regression analysis presented in Figure 8.

Figure 8 Result of using the regression tool “Withdrawal of remainder”

Let's create a new table as shown in Figure 9. In column C, we calculate the relative approximation error using the formula:

Figure 9 Calculation of average approximation error

The average approximation error is calculated using the formula:

The quality of the constructed model is assessed as good, since it does not exceed 8 - 10%.

6. From table c regression statistics(Figure 4) we write down the actual value of Fisher’s F-test:

Because the at a 5% significance level, then we can conclude that the regression equation is significant (the relationship has been proven).

8. We will assess the statistical significance of the regression parameters using Student’s t-statistics and by calculating the confidence interval of each indicator.

We put forward the hypothesis H 0 about a statistically insignificant difference between the indicators and zero:

.

for the number of degrees of freedom

Figure 7 has the actual t-statistic values:

The t-test for the correlation coefficient can be calculated in two ways:

Method I:

Where - random error correlation coefficient.

We will take the data for calculation from the table in Figure 7.

Method II:

The actual t-statistic values ​​exceed the table values:

Therefore, the hypothesis H 0 is rejected, that is, the regression parameters and the correlation coefficient do not differ from zero by chance, but are statistically significant.

The confidence interval for parameter a is defined as

For parameter a, the 95% limits as shown in Figure 7 were:

The confidence interval for the regression coefficient is defined as

For the regression coefficient b, the 95% limits as shown in Figure 7 were:

Analysis of the upper and lower limits of confidence intervals leads to the conclusion that with probability parameters a and b, being within the specified limits, do not take zero values, i.e. are not statistically insignificant and significantly different from zero.

7. The obtained estimates of the regression equation allow it to be used for forecasting. If the predicted cost of living is:

Then the predicted value of the cost of living will be:

We calculate the forecast error using the formula:

Where

We will also calculate the variance using Excel PPP. For this:

1) Activate Function Wizard: in the main menu select Formulas / Insert Function.

3) Fill in the range containing the numerical data of the factor characteristic. Click OK.

Figure 10 Calculation of variance

We got the variance value

To calculate the residual variance per degree of freedom, we will use the results of analysis of variance as shown in Figure 7.

Confidence intervals for predicting individual values ​​of y with a probability of 0.95 are determined by the expression:

The interval is quite wide, primarily due to the small volume of observations. In general, the forecast for the average monthly salary turned out to be reliable.

The condition of the problem is taken from: Workshop on econometrics: Proc. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M.: Finance and Statistics, 2003. - 192 p.: ill.

Regression and correlation analysis are statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below, using specific practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when combining them.

Regression Analysis in Excel

Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how does the number of economically active population depend on the number of enterprises, wages and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to highlight priorities. And based on the main factors, predict, plan the development of priority areas, and make management decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx 2);
  • exponential (y = a * exp(bx));
  • power (y = a*x^b);
  • hyperbolic (y = b/x + a);
  • logarithmic (y = b * 1n(x) + a);
  • exponential (y = a * b^x).

Let's look at an example of building a regression model in Excel and interpreting the results. Let's take the linear type of regression.

Task. At 6 enterprises, the average monthly salary and the number of quitting employees were analyzed. It is necessary to determine the dependence of the number of quitting employees on the average salary.

The linear regression model looks like this:

Y = a 0 + a 1 x 1 +…+a k x k.

Where a are regression coefficients, x are influencing variables, k is the number of factors.

In our example, Y is the indicator of quitting employees. The influencing factor is wages (x).

Excel has built-in functions that can help you calculate the parameters of a linear regression model. But the “Analysis Package” add-on will do this faster.

We activate a powerful analytical tool:

Once activated, the add-on will be available in the Data tab.

Now let's do the regression analysis itself.



First of all, we pay attention to R-squared and coefficients.

R-squared is the coefficient of determination. In our example – 0.755, or 75.5%. This means that the calculated parameters of the model explain 75.5% of the relationship between the studied parameters. The higher the coefficient of determination, the better the model. Good - above 0.8. Bad – less than 0.5 (such an analysis can hardly be considered reasonable). In our example – “not bad”.

The coefficient 64.1428 shows what Y will be if all variables in the model under consideration are equal to 0. That is, the value of the analyzed parameter is also influenced by other factors not described in the model.

The coefficient -0.16285 shows the weight of variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the fewer people quit. Which is fair.



Correlation Analysis in Excel

Correlation analysis helps determine whether there is a relationship between indicators in one or two samples. For example, between the operating time of a machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a connection, then does an increase in one parameter lead to an increase (positive correlation) or a decrease (negative) of the other. Correlation analysis helps the analyst determine whether the value of one indicator can be used to predict possible meaning another.

The correlation coefficient is denoted by r. Varies from +1 to -1. Classification of correlations for different areas will be different. When the coefficient is 0, there is no linear relationship between samples.

Let's look at how to find the correlation coefficient using Excel.

To find paired coefficients, the CORREL function is used.

Objective: Determine whether there is a relationship between the operating time of a lathe and the cost of its maintenance.

Place the cursor in any cell and press the fx button.

  1. In the “Statistical” category, select the CORREL function.
  2. Argument “Array 1” - the first range of values ​​– machine operating time: A2:A14.
  3. Argument “Array 2” - second range of values ​​– repair cost: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis of several parameters (more than 2), it is more convenient to use “Data Analysis” (the “Analysis Package” add-on). You need to select correlation from the list and designate the array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this:

Correlation and regression analysis

In practice, these two techniques are often used together.

Example:


Now the regression analysis data has become visible.

IN scientific research Often there is a need to find a connection between outcome and factor variables (the yield of a crop and the amount of precipitation, the height and weight of a person in homogeneous groups by gender and age, pulse rate and body temperature, etc.).

The second are signs that contribute to changes in those associated with them (the first).

The concept of correlation analysis

There are many Based on the above, we can say that correlation analysis is a method used to test the hypothesis about the statistical significance of two or more variables if the researcher can measure them, but not change them.

There are other definitions of the concept in question. Correlation analysis is a processing method that involves studying correlation coefficients between variables. In this case, correlation coefficients between one pair or many pairs of characteristics are compared to establish statistical relationships between them. Correlation analysis is a method for studying statistical dependence between random variables with the optional presence of a strict functional nature, in which the dynamics of one random variable leads to dynamics mathematical expectation another.

The concept of false correlation

When conducting correlation analysis, it is necessary to take into account that it can be carried out in relation to any set of characteristics, often absurd in relation to each other. Sometimes they have no causal connection with each other.

In this case, they talk about a false correlation.

Problems of correlation analysis

Based on the above definitions, we can formulate the following tasks of the described method: obtain information about one of the sought variables using another; determine the closeness of the relationship between the studied variables.

Correlation analysis involves determining the relationship between the characteristics being studied, and therefore the tasks of correlation analysis can be supplemented with the following:

  • identification of factors that have the greatest impact on the resulting characteristic;
  • identification of previously unexplored causes of connections;
  • construction of a correlation model with its parametric analysis;
  • study of the significance of communication parameters and their interval assessment.

Relationship between correlation analysis and regression

The method of correlation analysis is often not limited to finding the closeness of the relationship between the studied quantities. Sometimes it is supplemented by the compilation of regression equations, which are obtained using the analysis of the same name, and which represent a description of the correlation dependence between the resulting and factor (factor) characteristic (features). This method, together with the analysis under consideration, constitutes the method

Conditions for using the method

Effective factors depend on one to several factors. The correlation analysis method can be used if there is a large number of observations about the value of effective and factor indicators (factors), while the studied factors must be quantitative and reflected in specific sources. The first can be determined normal law— in this case, the result of the correlation analysis is the Pearson correlation coefficients, or, if the characteristics do not obey this law, the coefficient is used rank correlation Spearman.

Rules for selecting correlation analysis factors

When using this method it is necessary to determine the factors influencing performance indicators. They are selected taking into account the fact that there must be cause-and-effect relationships between the indicators. In the case of creating a multifactor correlation model, those that have an impact are selected. significant influence on the resulting indicator, while it is preferable not to include interdependent factors with a pair correlation coefficient of more than 0.85 in the correlation model, as well as those for which the connection with the resulting parameter is not linear or functional.

Displaying results

The results of correlation analysis can be presented in text and graphic forms. In the first case they are presented as a correlation coefficient, in the second - in the form of a scatter diagram.

In the absence of correlation between the parameters, the points on the diagram are located chaotically, the average degree of connection is characterized by a greater degree of order and is characterized by a more or less uniform distance of the marked marks from the median. A strong connection tends to be straight and at r=1 the dot plot is a flat line. Reverse correlation differs in the direction of the graph from the upper left to the lower right, direct correlation - from the lower left to the upper right corner.

3D representation of a scatter plot

In addition to the traditional 2D scatter plot display, a 3D graphical representation of correlation analysis is now used.

A scatterplot matrix is ​​also used, which displays all paired plots in a single figure in a matrix format. For n variables, the matrix contains n rows and n columns. The chart located at the intersection of the i-th row and the j-th column is a plot of the variables Xi versus Xj. Thus, each row and column is one dimension, a single cell displays a scatterplot of two dimensions.

Assessing the tightness of the connection

The closeness of the correlation connection is determined by the correlation coefficient (r): strong - r = ±0.7 to ±1, medium - r = ±0.3 to ±0.699, weak - r = 0 to ±0.299. This classification is not strict. The figure shows a slightly different diagram.

An example of using the correlation analysis method

In the UK it was attempted interesting research. It is devoted to the connection between smoking and lung cancer, and was carried out through correlation analysis. This observation is presented below.

Initial data for correlation analysis

Professional group

mortality

Farmers, foresters and fishermen

Miners and quarry workers

Manufacturers of gas, coke and chemicals

Manufacturers of glass and ceramics

Workers of furnaces, forges, foundries and rolling mills

Electrical and electronics workers

Engineering and related professions

Woodworking industries

Leatherworkers

Textile workers

Manufacturers of work clothes

Workers in the food, drink and tobacco industries

Paper and Print Manufacturers

Manufacturers of other products

Builders

Painters and decorators

Drivers of stationary engines, cranes, etc.

Workers not elsewhere included

Transport and communications workers

Warehouse workers, storekeepers, packers and filling machine workers

Office workers

Sellers

Sports and recreation workers

Administrators and managers

Professionals, technicians and artists

We begin correlation analysis. For clarity, it is better to start the solution with a graphical method, for which we will construct a scatter diagram.

It demonstrates a direct connection. However, it is difficult to draw an unambiguous conclusion based on the graphical method alone. Therefore, we will continue to perform correlation analysis. An example of calculating the correlation coefficient is presented below.

By using software(using MS Excel as an example will be described below) we determine the correlation coefficient, which is 0.716, which means a strong connection between the parameters under study. Let's determine the statistical reliability of the obtained value using the corresponding table, for which we need to subtract 2 from 25 pairs of values, as a result we get 23 and using this line in the table we find r critical for p = 0.01 (since these are medical data, a more strict dependence, in other cases p=0.05 is sufficient), which is 0.51 for this correlation analysis. The example demonstrated that the calculated r is greater than the critical r, and the value of the correlation coefficient is considered statistically reliable.

Using software when conducting correlation analysis

The described type of statistical data processing can be carried out using software, in particular, MS Excel. Correlation involves calculating the following parameters using functions:

1. The correlation coefficient is determined using the CORREL function (array1; array2). Array1,2 - cell of the interval of values ​​of the resultant and factor variables.

The linear correlation coefficient is also called the Pearson correlation coefficient, and therefore, starting with Excel 2007, you can use the function with the same arrays.

Graphical display of correlation analysis in Excel is done using the “Charts” panel with the “Scatter Plot” selection.

After specifying the initial data, we get a graph.

2. Assessing the significance of the pairwise correlation coefficient using Student’s t-test. The calculated value of the t-criterion is compared with the tabulated (critical) value of this indicator from the corresponding table of values ​​of the parameter under consideration, taking into account the specified level of significance and the number of degrees of freedom. This estimation is carried out using the function STUDISCOVER(probability; degrees_of_freedom).

3. Matrix of pair correlation coefficients. The analysis is carried out using the Data Analysis tool, in which Correlation is selected. Statistical assessment of pair correlation coefficients is carried out by comparing it absolute value with a tabular (critical) value. When the calculated pairwise correlation coefficient exceeds the critical one, we can say, taking into account the given degree of probability, that the null hypothesis about the significance of the linear relationship is not rejected.

Finally

The use of correlation analysis method in scientific research allows us to determine the relationship between various factors and performance indicators. It is necessary to take into account that a high correlation coefficient can be obtained from an absurd pair or set of data, and therefore this type of analysis must be carried out on a sufficiently large array of data.

After obtaining the calculated value of r, it is advisable to compare it with the critical r to confirm the statistical reliability of a certain value. Correlation analysis can be carried out manually using formulas, or using software, in particular MS Excel. Here you can also construct a scatter diagram for the purpose of visually representing the relationship between the studied factors of correlation analysis and the resulting characteristic.

Have you already encountered the need to calculate the degree of connection between two statistical quantities and determine the formula by which they correlate? Normal person may ask why this might be necessary at all. Oddly enough, this is actually necessary. Knowing reliable correlations can help you make crazy money if you're, say, a stock trader. The problem is that for some reason no one reveals these correlations (surprising, isn’t it?).

Let's count them ourselves! For example, I decided to try to calculate the correlation of the ruble to the dollar through the euro. Let's look at how this is done in detail.

This article is intended for advanced levels of proficiency. Microsoft Excel. If you don't have time to read the entire article, you can download the file and figure it out yourself.

If you often find yourself needing to do something like this I highly recommend that you consider purchasing the book. Statistical calculations in Excel.

What is important to know about correlations

To calculate a reliable correlation, you need to have a reliable sample; the larger it is, the more reliable the result will be. For purposes this example I took a daily sample of exchange rates for 10 years. The data is freely available, I took it from the site http://oanda.com.

What did I actually do

(1) Once I had the raw data, I started by checking the degree of correlation between the two data sets. To do this, I used the CORREL function - there is a little information about it. It returns the degree of correlation between two data ranges. The result, frankly speaking, was not particularly impressive (only about 70%). Generally speaking, the degree of correlation between two quantities is usually considered to be the square of this quantity, that is, the correlation turned out to be reliable by approximately 49%. This is very little!

(2) This seemed very strange to me. What errors could have crept into my calculations? So I decided to make a graph and see what could happen. The graph was specially broken down by year so that you could visually see where the correlation breaks down. The schedule turned out like this

(3) It is obvious from the graph that in the range of about 35 rubles per euro the correlation begins to break into two parts. Because of this, it turned out to be unreliable. It was necessary to determine why this was happening.

(4) The color shows that these data refer to 2007, 2008, 2009. Certainly! Periods of economic peaks and recessions are usually statistically unreliable, which is what happened in this case. Therefore, I tried to exclude these periods from the data (and to check, I checked the degree of correlation of the data in this period). The degree of correlation of these data alone is 0.01%, that is, it is completely absent. But without them, the data correlates by approximately 81%. This is already a fairly reliable correlation. Here is a graph with the function.

Next steps

Theoretically, the correlation function can be refined by converting it from linear to exponential or logarithmic. In this case, the statistical reliability of the correlation increases by approximately one percent, but the complexity of applying the formula increases incredibly. Therefore, I ask myself the question: is this really necessary? It's up to you to decide - for each specific case.

To determine the degree of dependence between several indicators, multiple correlation coefficients are used. They are then summarized in a separate table, which is called the correlation matrix. The names of the rows and columns of such a matrix are the names of the parameters whose dependence on each other is established. At the intersection of rows and columns, the corresponding correlation coefficients are located. Let's find out how you can do a similar calculation using Excel tools.

It is customary to determine the level of relationship between various indicators as follows, depending on the correlation coefficient:

  • 0 – 0.3 – no connection;
  • 0.3 – 0.5 – weak connection;
  • 0.5 – 0.7 – average connection;
  • 0.7 – 0.9 – high;
  • 0.9 – 1 – very strong.

If the correlation coefficient is negative, this means that the relationship between the parameters is inverse.

In order to create a correlation matrix in Excel, you use one tool included in the package "Data analysis". That's what it's called - "Correlation". Let's learn how it can be used to calculate multiple correlation metrics.

Step 1: Activate the analysis package

It must be said right away that the default package "Data analysis" disabled. Therefore, before proceeding with the procedure of directly calculating correlation coefficients, you need to activate it. Unfortunately, not every user knows how to do this. Therefore, we will dwell on this issue.


After the specified action, the tool package "Data analysis" will be activated.

Stage 2: coefficient calculation

Now you can proceed directly to calculating the multiple correlation coefficient. Let's use the example of the table below of indicators of labor productivity, capital-labor ratio and energy-labor ratio at various enterprises to calculate the multiple correlation coefficient of these factors.


Stage 3: analysis of the result obtained

Now let's figure out how to understand the result that we received in the process of data processing with the tool "Correlation" in Excel.

As we can see from the table, the correlation coefficient of capital-labor ratio (Column 2) and energy availability ( Column 1) is 0.92, which corresponds to a very strong relationship. Between labor productivity ( Column 3) and energy availability ( Column 1) this indicator is 0.72, which is a high degree of dependence. The correlation coefficient between labor productivity ( Column 3) and capital-labor ratio ( Column 2) is equal to 0.88, which also corresponds to a high degree of dependence. Thus, we can say that the relationship between all the studied factors is quite strong.

As you can see, the package "Data analysis" in Excel is a very convenient and fairly easy-to-use tool for determining the multiple correlation coefficient. With its help, you can also calculate the usual correlation between two factors.