LABORATORY WORK

CORRELATION ANALYSIS INEXCEL

1.1 Correlation analysis in MS Excel

Correlation analysis consists of determining the degree of connection between two random variables X and Y. The correlation coefficient is used as a measure of such connection. The correlation coefficient is estimated from a sample of n related pairs of observations (x i, y i) from the joint population of X and Y. To assess the degree of relationship between the values ​​of X and Y, measured in quantitative scales, it is used linear correlation coefficient(Pearson coefficient), which assumes that samples X and Y are normally distributed.

The correlation coefficient varies from -1 (strict inverse linear relationship) to 1 (strict direct proportional relationship). When set to 0, there is no linear relationship between the two samples.

General classification of correlations (according to Ivanter E.V., Korosov A.V., 1992):

There are several types of correlation coefficients, depending on the variables X and Y, which can be measured on different scales. It is this fact that determines the choice of the appropriate correlation coefficient (see Table 13):

In MS Excel, a special function is used to calculate pair linear correlation coefficients CORREL (array1; array2),

subjects

where array1 is a reference to the range of cells of the first selection (X);

Example 1: 10 schoolchildren were given tests for visual-figurative and verbal thinking. The average time for solving test tasks was measured in seconds. The researcher is interested in the question: is there a relationship between the time it takes to solve these problems? Variable X denotes the average time for solving visual-figurative tests, and variable Y denotes the average time for solving verbal test tasks.

R solution: To identify the degree of relationship, first of all, it is necessary to enter data into a MS Excel table (see table, Fig. 1). Then the value of the correlation coefficient is calculated. To do this, place the cursor in cell C1. On the toolbar, click the Insert Function (fx) button.

In the Feature Wizard dialog box that appears, select a category Statistical and function CORREL, and then click OK. Using the mouse pointer, enter the sample data range X in the array1 (A1:A10) field. In the array2 field, enter the sample data range Y (B1:B10). Click OK. In cell C1 the value of the correlation coefficient will appear - 0.54119. Next, you need to look at the absolute number of the correlation coefficient and determine the type of connection (close, weak, medium, etc.)

Rice. 1. Results of calculating the correlation coefficient

Thus, the connection between the time of solving visual-figurative and verbal test tasks has not been proven.

Exercise 1. Data are available for 20 agricultural holdings. Find correlation coefficient between the yields of grain crops and the quality of the land and evaluate its significance. The data is shown in the table.

Table 2. Dependence of grain yield on land quality

Farm number

Land quality, score

Productivity, c/ha


Task 2. Determine whether there is a connection between the operating time of a sports fitness equipment (thousand hours) and the cost of its repair (thousand rubles):

Simulator operating time (thousand hours)

Cost of repairs (thousand rubles)

1.2 Multiple correlation in MS Excel

At large number observations, when correlation coefficients need to be sequentially calculated for several samples, for convenience, the resulting coefficients are summarized in tables called correlation matrices.

Correlation matrix is a square table in which at the intersection of the corresponding rows and columns there is a correlation coefficient between the corresponding parameters.

In MS Excel, the procedure is used to calculate correlation matrices Correlation from the package Data analysis. The procedure allows us to obtain a correlation matrix containing correlation coefficients between various parameters.

To implement the procedure you need:

1. execute the command Service - Analysis data;

2. in the list that appears Analysis Tools select line Correlation and press the button OK;

3. in the dialog box that appears, specify Input interval, that is, enter a link to the cells containing the analyzed data. The input interval must contain at least two columns.

4. in section Grouping set the switch in accordance with the entered data (by columns or by rows);

5. indicate day off interval, that is, enter a link to the cell from which the analysis results will be shown. The size of the output range will be determined automatically and a message will be displayed if the output range may overlap with the source data. Press the button OK.

A correlation matrix will be output to the output range, in which at the intersection of each row and column there is a correlation coefficient between the corresponding parameters. Cells in the output range that have matching row and column coordinates contain the value 1 because each column in the input range is perfectly correlated with itself

Example 2. There are monthly observational data on weather conditions and attendance at museums and parks (see Table 3). It is necessary to determine whether there is a relationship between weather conditions and attendance at museums and parks.

Table 3. Observation results

Number of clear days

Number of museum visitors

Number of park visitors

Solution. To perform correlation analysis, enter the original data into the range A1:G3 (Fig. 2). Then in the menu Service select item Analysis data and then enter the line Correlation. In the dialog box that appears, specify Input interval(A2:C7). Specify that the data is looked at in columns. Specify the output range (E1) and press the button OK.

In Fig. 33 shows that the correlation between weather conditions and museum attendance is -0.92, and between weather conditions and park attendance is 0.97, and between park and museum attendance is 0.92.

Thus, as a result of the analysis, dependencies were revealed: a strong degree of inverse linear relationship between museum attendance and the number of sunny days and an almost linear (very strong direct) relationship between park attendance and weather conditions. There is a strong inverse relationship between museum and park attendance.

Rice. 2. Results of calculating the correlation matrix from example 2

Task 3. 10 managers were assessed using the method of expert assessments of the psychological characteristics of a manager’s personality. 15 experts assessed each psychological characteristic using a five-point system (see Table 4). The psychologist is interested in the question of the relationship between these characteristics of a leader.

Table 4. Study results

Subjects

tact

exactingness

criticality

A utility that is widely used in many companies and enterprises. The reality is that almost any employee must be proficient in Excel to one degree or another, since this program is used to solve a very wide range of problems. When working with tables, you often have to determine whether certain variables are related to each other. For this purpose, the so-called correlation is used. In this article, we will take a detailed look at how to calculate the correlation coefficient in Excel. Let's figure it out. Go!

Let's start with what a correlation coefficient is in general. It shows the degree of relationship between two elements and always ranges from -1 (strong inverse relationship) to 1 (strong forward relationship). If the coefficient is 0, this indicates that there is no relationship between the values.

Now, having dealt with the theory, let's move on to practice. To find the relationship between variables and y, use the built-in Microsoft Excel “CORREL” function. To do this, click on the function wizard button (it is located next to the formula field). In the window that opens, select “CORREL” from the list of functions. After that, set the range in the “Array1” and “Array2” fields. For example, for "Array1" select the y values, and for "Array2" select the x values. As a result, you will receive the correlation coefficient calculated by the program.

The following method will be relevant for students who are required to find a dependence using a given formula. First of all, you need to know the average values ​​of the variables x and y. To do this, select the variable values ​​and use the “AVERAGE” function. Next, you need to calculate the difference between each x and x avg, and y avg. In the selected cells write formulas x-x, y-. Don't forget to pin cells with averages. Then stretch the formula down so that it applies to the rest of the numbers.

Now that we have all the necessary data, we can calculate the correlation. Multiply the resulting differences in this way: (x-x avg) * (y-y avg). Once you have the result for each variable, add the resulting numbers using the AutoSum function. This is how the numerator is calculated.

Now let's move on to the denominator. The calculated differences must be squared. To do this, enter the formulas in a separate column: (x-x avg) 2 and (y-y avg) 2. Then stretch the formulas across the entire range. Then, using the “AutoSum” button, find the sum for all columns (for x and y). It remains to multiply the found sums and extract from them Square root. The last step is to divide the numerator by the denominator. The result obtained will be the desired correlation coefficient.

As you can see, knowing how to work correctly with Microsoft Excel functions, you can significantly simplify the task of calculating complex mathematical expressions. Thanks to the tools implemented in the program, you can easily do correlation analysis in Excel in just a couple of minutes, saving time and effort. Write in the comments whether the article helped you understand the issue, ask about everything that interested you on the topic discussed.

Let's calculate the correlation coefficient and covariance for different types relationships of random variables.

Correlation coefficient(correlation criterion Pearson, English Pearson Product Moment correlation coefficient) determines the degree linear relationships between random variables.

As follows from the definition, to calculate correlation coefficient it is required to know the distribution of random variables X and Y. If the distributions are unknown, then to estimate correlation coefficient used sample correlation coefficientr ( it is also designated as Rxy or r xy) :

where S x – standard deviation samples random variable x, calculated by the formula:

As can be seen from the calculation formula correlations, the denominator (the product of the standard deviations) simply normalizes the numerator such that correlation turns out to be a dimensionless number from -1 to 1. Correlation And covariance provide the same information (if known standard deviations ), But correlation more convenient to use, because it is a dimensionless quantity.

Calculate correlation coefficient And sample covariance in MS EXCEL is not difficult, since there are special functions CORREL() and KOVAR() for this purpose. It is much more difficult to figure out how to interpret the obtained values; most of the article is devoted to this.

Theoretical retreat

Let us remind you that correlation connection is called a statistical relationship consisting in the fact that different meanings one variable corresponds to different average values ​​are different (with a change in the value of X average value Y changes in a regular way). It is assumed that both variables X and Y are random values ​​and have a certain random scatter relative to them average value.

Note. If only one variable, for example, Y, has a random nature, and the values ​​of the other are deterministic (set by the researcher), then we can only talk about regression.

Thus, for example, when studying the dependence of the average annual temperature, one cannot talk about correlations temperature and year of observation and, accordingly, apply indicators correlations with their corresponding interpretation.

Correlation between variables can arise in several ways:

  1. The presence of a causal relationship between variables. For example, the amount of investment in Scientific research(variable X) and the number of patents received (Y). The first variable appears as independent variable (factor), second - dependent variable (outcome). It must be remembered that the dependence of quantities determines the presence of a correlation between them, but not vice versa.
  2. The presence of conjugation (common cause). For example, as the organization grows, the wage fund (payroll) and the cost of renting premises increase. Obviously, it is wrong to assume that the rental of premises depends on the payroll. Both of these variables depend linearly on the number of personnel in many cases.
  3. Mutual influence of variables (when one changes, the second variable changes, and vice versa). With this approach, two formulations of the problem are allowed; Any variable can act both as an independent variable and as a dependent variable.

Thus, correlation indicator shows how strong linear relationship between two factors (if there is one), and regression allows you to predict one factor based on the other.

Correlation, like any other statistical indicator, can be useful when used correctly, but it also has limitations in its use. If it shows a clearly defined linear relationship or complete absence relationships, then correlation will reflect this wonderfully. But, if the data shows a nonlinear relationship (for example, quadratic), the presence separate groups values ​​or outliers, then the calculated value correlation coefficient may be misleading (see example file).

Correlation close to 1 or -1 (i.e. close in absolute value to 1) shows a strong linear relationship between the variables, a value close to 0 shows no relationship. Positive correlation means that with an increase in one indicator, the other on average increases, and with a negative indicator, it decreases.

To calculate the correlation coefficient, it is required that the compared variables satisfy the following conditions:

  • the number of variables must be equal to two;
  • variables must be quantitative (eg frequency, weight, price). The calculated average of these variables makes sense: average price or average weight patient. Unlike quantitative variables, qualitative (nominal) variables take values ​​only from a finite set of categories (for example, gender or blood type). These values ​​are conventionally associated with numerical values ​​(for example, female gender is 1, and male gender is 2). It is clear that in this case the calculation average value, which is required to find correlations, is incorrect, and therefore the calculation itself is incorrect correlations;
  • variables must be random variables and have .

Two-dimensional data can have different structures. Some of them require certain approaches to work with:

  • For data with non-linear relationship correlation must be used with caution. For some problems, it may be useful to transform one or both variables to produce a linear relationship (this requires making an assumption about the type of nonlinear relationship in order to suggest the type of transformation needed).
  • By using scatter plots Some data may exhibit unequal variation (scatter). The problem with uneven variation is that locations with high variation not only provide the least accurate information, but also have the greatest impact when calculating statistics. This problem is also often solved by transforming the data, such as using logarithms.
  • Some data can be observed to be divided into groups (clustering), which may indicate the need to divide the population into parts.
  • An outlier (a sharply deviating value) can distort the calculated value of the correlation coefficient. An outlier may be due to chance, an error in data collection, or may actually reflect some feature of the relationship. Since the outlier deviates greatly from the average value, it makes a large contribution to the calculation of the indicator. Statistical indicators are often calculated with and without taking into account outliers.

Using MS EXCEL to calculate correlation

Let's take 2 variables as an example X And Y and correspondingly, sample consisting of several pairs of values ​​(X i; Y i). For clarity, let's build .

Note: For more information about constructing diagrams, see the article. In the example file for building scatter plots used because Here we have deviated from the requirement that variable X be random (this simplifies the generation of various types of relationships: constructing trends and a given spread). For real data, you must use a Scatter chart (see below).

Calculations correlations we will conduct for various cases relationships between variables: linear, quadratic and at lack of communication.

Note: In the example file, you can set the parameters of the linear trend (slope, Y-intercept) and the degree of scatter relative to this trend line. You can also adjust the quadratic parameters.

In the example file for building scatter plots if there is no dependence of variables, a scatter diagram is used. In this case, the points on the diagram are arranged in the form of a cloud.

Note: Please note that by changing the scale of the diagram along the vertical or horizontal axis, the cloud of points can be given the appearance of a vertical or horizontal line. It is clear that the variables will remain independent.

As mentioned above, to calculate correlation coefficient in MS EXCEL there is a CORREL() function. You can also use the similar function PEARSON(), which returns the same result.

To make sure that the calculations correlations are produced by the CORREL() function using the above formulas; the example file shows the calculation correlations using more detailed formulas:

=COVARIANCE.G(B28:B88;D28:D88)/STDEV.G(B28:B88)/STDEV.G(D28:D88)

=COVARIANCE.B(B28:B88;D28:D88)/STDEV.B(B28:B88)/STDEV.B(D28:D88)

Note: Square correlation coefficient r is equal to coefficient of determination R2, which is calculated when constructing a regression line using the QPIRSON() function. The value of R2 can also be output to scatter diagram by building a linear trend using the standard MS EXCEL functionality (select the chart, select the tab Layout, then in the group Analysis click the button Trend line and select Linear approximation). For more information on constructing a trend line, see, for example, .

Using MS EXCEL to Calculate Covariance

Covariance is close in meaning to (also a measure of dispersion) with the difference that it is defined for 2 variables, and dispersion- for one. Therefore, cov(x;x)=VAR(x).

To calculate covariance in MS EXCEL (starting from version 2010), the functions COVARIATION.Г() and COVARIATION.В() are used. In the first case, the formula for calculating is similar to the above (end .G stands for Population ), in the second, instead of the multiplier 1/n, 1/(n-1) is used, i.e. ending .IN stands for Sample.

Note: The COVAR() function, which is present in MS EXCEL in earlier versions, is similar to the COVARIATION.G() function.

Note: The CORREL() and COVAR() functions are presented in the English version as CORREL and COVAR. The functions COVARIANCE.G() and COVARIANCE.B() as COVARIANCE.P and COVARIANCE.S.

Additional formulas for calculation covariances:

=SUMPRODUCT(B28:B88-AVERAGE(B28:B88);(D28:D88-AVERAGE(D28:D88)))/COUNT(D28:D88)

=SUMPRODUCT(B28:B88-AVERAGE(B28:B88),(D28:D88))/COUNT(D28:D88)

=SUMPRODUCT(B28:B88;D28:D88)/COUNT(D28:D88)-AVERAGE(B28:B88)*AVERAGE(D28:D88)

These formulas use the property covariances:

If the variables x And y independent, then their covariance is 0. If the variables are not independent, then the variance of their sum is equal to:

VAR(x+y)= VAR(x)+ VAR(y)+2COV(x;y)

A dispersion their difference is equal

VAR(x-y)= VAR(x)+ VAR(y)-2COV(x;y)

Estimation of statistical significance of the correlation coefficient

In order to test the hypothesis, we must know the distribution of the random variable, i.e. correlation coefficient r. Usually, the hypothesis is tested not for r, but for the random variable t r:

which has n-2 degrees of freedom.

If the calculated value of the random variable |t r | is greater than the critical value t α,n-2 (α-specified), then the null hypothesis is rejected (the relationship between the values ​​is statistically significant).

Analysis package add-on

B to calculate covariance and correlation there are instruments of the same name analysis.

After calling the tool, a dialog box appears containing the following fields:

  • Input interval: you need to enter a link to a range with source data for 2 variables
  • Grouping: As a rule, the source data is entered in 2 columns
  • Labels in the first line: if the checkbox is checked, then Input interval must contain column headers. It is recommended to check the box so that the result of the Add-in contains informative columns
  • Output interval: the range of cells where the calculation results will be placed. It is enough to indicate the upper left cell of this range.

The add-in returns the calculated correlation and covariance values ​​(for covariance, the variances of both random variables are also calculated).

With correlation connection the same value of one characteristic corresponds to different values ​​of another. For example: there is a correlation between height and weight, between the incidence of malignant neoplasms and age, etc.

There are 2 methods for calculating the correlation coefficient: the method of squares (Pearson), the method of ranks (Spearman).

The most accurate is the method of squares (Pearson), in which the correlation coefficient is determined by the formula: , where

r xy is the correlation coefficient between the statistical series X and Y.

d x is the deviation of each of the numbers of the statistical series X from its arithmetic mean.

d y is the deviation of each of the numbers of the statistical series Y from its arithmetic mean.

Depending on the strength of the connection and its direction, the correlation coefficient can range from 0 to 1 (-1). A correlation coefficient of 0 indicates a complete lack of connection. The closer the level of the correlation coefficient is to 1 or (-1), the correspondingly greater and more closely the direct or feedback it measures. When the correlation coefficient is equal to 1 or (-1), the connection is complete and functional.

Scheme for assessing the strength of correlation using the correlation coefficient

The power of connection

The value of the correlation coefficient if available

direct connection (+)

feedback (-)

No connection

The connection is small (weak)

from 0 to +0.29

from 0 to –0.29

Connection average (moderate)

from +0.3 to +0.69

from –0.3 to –0.69

The connection is big (strong)

from +0.7 to +0.99

from –0.7 to –0.99

Full communication

(functional)

To calculate the correlation coefficient using the square method, a table of 7 columns is compiled. Let's look at the calculation process using an example:

DETERMINE THE STRENGTH AND NATURE OF THE CONNECTION BETWEEN

It's time-

ness

goiter

(V y )

d x = V xM x

d y = V yM y

d x d y

d x 2

d y 2

Σ -1345 ,0

Σ 13996 ,0

Σ 313 , 47

1. Determine the average iodine content in water (in mg/l).

mg/l

2. Determine the average incidence of goiter in %.

3. Determine the deviation of each V x from M x, i.e. dx.

201–138=63; 178–138=40, etc.

4. Similarly, we determine the deviation of each V y from M y, i.e. d y.

0.2–3.8=-3.6; 0.6–38=-3.2, etc.

5. Determine the products of deviations. We sum up the resulting product and get.

6. We square d x and sum up the results, we get.

7. Similarly, we square d y, sum up the results, we get

8. Finally, we substitute all the received amounts into the formula:

To resolve the issue of the reliability of the correlation coefficient, its average error is determined using the formula:

(If the number of observations is less than 30, then the denominator is n–1).

In our example

The value of the correlation coefficient is considered reliable if it is at least 3 times higher than its average error.

In our example

Thus, the correlation coefficient is not reliable, which necessitates an increase in the number of observations.

The correlation coefficient can be determined in a slightly less accurate, but much easier way - the method of ranks (Spearman).

Spearman method: P=1-(6∑d 2 /n-(n 2 -1))

make two rows of paired comparable features, designating the first and second row x and y, respectively. In this case, present the first row of the characteristic in descending or ascending order, and place the numerical values ​​of the second row opposite those values ​​of the first row to which they correspond

replace the value of the characteristic in each of the compared series with a serial number (rank). Ranks, or numbers, indicate the places of indicators (values) of the first and second rows. Wherein numerical values of the second characteristic, ranks must be assigned in the same order as was adopted when allocating them to the values ​​of the first characteristic. With identical values ​​of a characteristic in a series, ranks should be determined as the average number from the sum of the ordinal numbers of these values

determine the rank difference between x and y (d): d = x - y

square the resulting rank difference (d 2)

obtain the sum of the squares of the difference (Σ d 2) and substitute the resulting values ​​into the formula:

Example: Using the rank method, establish the direction and strength of the relationship between years of work experience and the frequency of injuries if the following data are obtained:

Justification for choosing the method: To solve a problem, only a method can be chosen rank correlation, because The first row of the attribute “work experience in years” has open options (work experience up to 1 year and 7 or more years), which does not allow the use of a more accurate method - the method of squares - to establish a connection between the compared characteristics.

Solution. The sequence of calculations is presented in the text, the results are presented in table. 2.

table 2

Work experience in years

Number of injuries

Ordinal numbers (ranks)

Rank difference

Squared difference of ranks

d(x-y)

d 2

Each of the rows of paired characteristics is designated by “x” and “y” (columns 1-2).

The value of each feature is replaced by a rank (ordinal) number. The order of distribution of ranks in the row “x” is as follows: the minimum value of the attribute (experience up to 1 year) is assigned the serial number “1”, subsequent variants of the same row of attribute, respectively, in increasing order, 2nd, 3rd, 4th and 5th th serial numbers - ranks (see column 3). A similar order is followed when distributing ranks to the second attribute “y” (column 4). In cases where there are several options of equal magnitude (for example, in the standard problem these are 12 and 12 injuries per 100 workers with an experience of 3-4 years and 5-6 years, the serial number is designated by the average number from the sum of their serial numbers. These data on the number of injuries (12 injuries) when ranking should occupy 2 and 3 places, so the average number of them is (2 + 3) / 2 = 2.5. Thus, the number of injuries “12” and “12” (attribute ) the same rank numbers should be distributed - “2.5” (column 4).

Determine the rank difference d = (x - y) - (column 5)

Square the rank difference (d 2) and obtain the sum of squares of the rank difference Σ d 2 (column 6).

Calculate the rank correlation coefficient using the formula:

where n is the number of pairs of options being compared in the row “x” and in the row “y”

The correlation coefficient reflects the degree of relationship between two indicators. It always takes a value from -1 to 1. If the coefficient is located around 0, then there is no connection between the variables.

If the value is close to one (from 0.9, for example), then there is a strong direct relationship between the observed objects. If the coefficient is close to the other extreme point of the range (-1), then there is a strong inverse relationship between the variables. When the value is somewhere between 0 and 1 or 0 and -1, then we're talking about about weak coupling (direct or reverse). This relationship is usually not taken into account: it is believed that it does not exist.

Calculation of correlation coefficient in Excel

Let's look at an example of methods for calculating the correlation coefficient, features of direct and inverse relationships between variables.

Values ​​of indicators x and y:

Y is an independent variable, x is a dependent variable. It is necessary to find the strength (strong/weak) and direction (forward/inverse) of the connection between them. The correlation coefficient formula looks like this:


To make it easier to understand, let's break it down into several simple elements.

A strong direct relationship is determined between the variables.

The built-in CORREL function avoids complex calculations. Let's calculate the pair correlation coefficient in Excel using it. Call the function wizard. We find the right one. The function arguments are an array of y values ​​and an array of x values:

Let's show the values ​​of the variables on the graph:


A strong connection between y and x is visible, because the lines run almost parallel to each other. The relationship is direct: y increases - x increases, y decreases - x decreases.



Pair correlation coefficient matrix in Excel

The correlation matrix is ​​a table at the intersection of rows and columns of which the correlation coefficients between the corresponding values ​​are located. It makes sense to build it for several variables.

The matrix of correlation coefficients in Excel is constructed using the “Correlation” tool from the “Data Analysis” package.


A strong direct relationship was found between the values ​​of y and x1. There is a strong feedback between x1 and x2. There is practically no connection with the values ​​in column x3.