is a quantitative assessment of the statistical study of the relationship between phenomena, used in nonparametric methods.

The indicator shows how the sum of squared differences between ranks obtained during observation differs from the case of no connection.

Purpose of the service. Using this online calculator you can:

  • calculation of Spearman's rank correlation coefficient;
  • calculation confidence interval for the coefficient and assessment of its significance;

Spearman's rank correlation coefficient refers to indicators for assessing the closeness of communication. The qualitative characteristic of the closeness of the connection of the rank correlation coefficient, as well as other correlation coefficients, can be assessed using the Chaddock scale.

Calculation of coefficient consists of the following steps:

Properties of Spearman's rank correlation coefficient

Application area. Rank correlation coefficient used to assess the quality of communication between two populations. In addition, its statistical significance is used when analyzing data for heteroskedasticity.

Example. Based on a sample of observed variables X and Y:

  1. create a ranking table;
  2. find Spearman's rank correlation coefficient and check its significance at level 2a
  3. assess the nature of the dependence
Solution. Let's assign ranks to feature Y and factor X.
XYrank X, d xrank Y, d y
28 21 1 1
30 25 2 2
36 29 4 3
40 31 5 4
30 32 3 5
46 34 6 6
56 35 8 7
54 38 7 8
60 39 10 9
56 41 9 10
60 42 11 11
68 44 12 12
70 46 13 13
76 50 14 14

Rank matrix.
rank X, d xrank Y, d y(d x - d y) 2
1 1 0
2 2 0
4 3 1
5 4 1
3 5 4
6 6 0
8 7 1
7 8 1
10 9 1
9 10 1
11 11 0
12 12 0
13 13 0
14 14 0
105 105 10

Checking the correctness of the matrix based on the checksum calculation:

The sum of the columns of the matrix is ​​equal to each other and the checksum, which means that the matrix is ​​composed correctly.
Using the formula, we calculate the Spearman rank correlation coefficient.


The relationship between trait Y and factor X is strong and direct
Significance of Spearman's rank correlation coefficient
In order to test the null hypothesis at the significance level α that the general Spearman rank correlation coefficient is equal to zero under the competing hypothesis Hi. p ≠ 0, we need to calculate the critical point:

where n is the sample size; ρ - sample Spearman rank correlation coefficient: t(α, k) - critical point of the two-sided critical region, which is found from the table critical points Student's distribution, according to the significance level α and the number of degrees of freedom k = n-2.
If |p|< Т kp - нет оснований отвергнуть нулевую гипотезу. Ранговая корреляционная связь между качественными признаками не значима. Если |p| >T kp - the null hypothesis is rejected. There is a significant rank correlation between qualitative characteristics.
Using the Student's table we find t(α/2, k) = (0.1/2;12) = 1.782

Since T kp< ρ , то отклоняем гипотезу о равенстве 0 коэффициента ранговой корреляции Спирмена. Другими словами, коэффициент ранговой корреляции статистически - значим и ранговая корреляционная связь между оценками по двум тестам значимая.

Brief theory

Rank correlation is a method of correlation analysis that reflects the relationships of variables ordered by increasing value.

Ranks are the serial numbers of aggregate units in a ranked series. If we rank a population according to two characteristics, the relationship between which is being studied, then complete coincidence of ranks means the closest possible direct connection, and the complete opposite of ranks means the closest possible feedback. It is necessary to rank both characteristics in the same order: either from smaller values ​​of the characteristic to larger ones, or vice versa.

For practical purposes, the use of rank correlation is very useful. For example, if a high rank correlation is established between two qualitative characteristics of products, then it is enough to control products only by one of the characteristics, which reduces the cost and speeds up control.

The rank correlation coefficient, proposed by K. Spearman, refers to a nonparametric measure of the relationship between variables measured on a rank scale. When calculating this coefficient, no assumptions are required about the nature of the distributions of characteristics in the population. This coefficient determines the degree of closeness of connection between ordinal characteristics, which in this case represent the ranks of the compared quantities.

The value of the Spearman correlation coefficient lies in the range of +1 and -1. It can be positive or negative, characterizing the direction of the relationship between two characteristics measured on a rank scale.

Spearman's rank correlation coefficient is calculated using the formula:

Difference between ranks on two variables

number of matched pairs

The first step in calculating the rank correlation coefficient is to rank the series of variables. The ranking procedure begins by arranging the variables in ascending order of their values. Different values ​​are assigned ranks, denoted natural numbers. If there are several variables of equal value, they are assigned an average rank.

The advantage of the Spearman rank correlation coefficient is that it is possible to rank according to characteristics that cannot be expressed numerically: it is possible to rank candidates for a certain position by professional level, by ability to lead a team, by personal charm, etc. With expert assessments it is possible rank the assessments of different experts and find their correlations with each other, in order to then exclude from consideration the expert’s assessments that are weakly correlated with the assessments of other experts. Spearman's rank correlation coefficient is used to assess the stability of the trend. The disadvantage of the rank correlation coefficient is that the same differences in ranks can correspond to completely different differences in the values ​​of the characteristics (in the case of quantitative characteristics). Therefore, for the latter, the correlation of ranks should be considered an approximate measure of the closeness of the connection, which has less information content than the correlation coefficient numerical values signs.

Example of problem solution

The task

A survey of randomly selected 10 students living in a university dormitory reveals the relationship between the average score from the previous session and the number of hours per week spent by the student on independent study.

Determine the strength of the relationship using the Spearman rank correlation coefficient.

If you have difficulty solving problems, the site provides online help to students in statistics with home tests or exams.

The solution of the problem

Let's calculate the rank correlation coefficient.

Ranging Rank comparison Rank difference 1 26 4.7 8 1 3.1 1 8 10 -2 4 2 22 4.4 10 2 3.6 2 7 9 -2 4 3 8 3.8 12 3 3.7 3 1 4 -3 9 4 12 3.7 15 4 3.8 4 3 3 0 0 5 15 4.2 17 5 3.9 5 4 7 -3 9 6 30 4.3 20 6 4 6 9 8 1 1 7 20 3.6 22 7 4.2 7 6 2 4 16 8 31 4 26 8 4.3 8 10 6 4 16 9 10 3.1 30 9 4.4 9 2 1 1 1 10 17 3.9 31 10 4.7 10 5 5 0 0 Sum 60

Spearman's rank correlation coefficient:

Substituting numerical values, we get:

Conclusion to the problem

The relationship between the GPA from the previous session and the number of hours per week spent by the student on independent study is moderately strong.

If you are running out of time to complete a test, you can always order an urgent solution to statistics problems on the website.

Average the cost of solving a test is 700 - 1200 rubles (but not less than 300 rubles for the entire order). The price is greatly influenced by the urgency of the decision (from a day to several hours). The cost of online help for an exam/test is from 1000 rubles. for solving the ticket.

You can ask all questions about the cost directly in the chat, having previously sent the task conditions and informed you of the time frame for the solution you need. Response time is a few minutes.

Examples of related problems

Fechner ratio
A brief theory is given and an example of solving the problem of calculating the Fechner sign correlation coefficient is considered.

Mutual contingency coefficients of Chuprov and Pearson
The page contains information on methods for studying the relationships between qualitative characteristics using the Chuprov and Pearson coefficients of mutual contingency.

In cases where the measurements of the characteristics under study are carried out on an order scale, or the form of the relationship differs from linear, the study of the relationship between two random variables carried out using rank correlation coefficients. Consider the Spearman rank correlation coefficient. When calculating it, it is necessary to rank (order) the sample options. Ranking is the grouping of experimental data into in a certain order, either ascending or descending.

The ranking operation is carried out according to the following algorithm:

1. A lower value is assigned a lower rank. The highest value is assigned a rank corresponding to the number of ranked values. The smallest value is assigned a rank of 1. For example, if n=7, then highest value will receive rank number 7, except as provided in the second rule.

2. If several values ​​are equal, then they are assigned a rank that is the average of the ranks they would receive if they were not equal. As an example, consider an ascending-ordered sample consisting of 7 elements: 22, 23, 25, 25, 25, 28, 30. The values ​​22 and 23 appear once each, so their ranks are respectively R22=1, and R23=2 . The value 25 appears 3 times. If these values ​​were not repeated, then their ranks would be 3, 4, 5. Therefore, their R25 rank is equal to the arithmetic mean of 3, 4 and 5: . The values ​​28 and 30 are not repeated, so their ranks are respectively R28=6 and R30=7. Finally we have the following correspondence:

3. The total sum of ranks must coincide with the calculated one, which is determined by the formula:

where n - total ranked values.

A discrepancy between the actual and calculated rank sums will indicate an error made when calculating ranks or summing them up. In this case, you need to find and fix the error.

Spearman's rank correlation coefficient is a method that allows one to determine the strength and direction of the relationship between two traits or two hierarchies of traits. The use of the rank correlation coefficient has a number of limitations:

  • a) The assumed correlation dependence must be monotonic.
  • b) The volume of each sample must be greater than or equal to 5. To determine the upper limit of the sample, use tables of critical values ​​(Table 3 of the Appendix). The maximum value of n in the table is 40.
  • c) During the analysis, it is possible that large quantity identical ranks. In this case, an amendment must be made. The most favorable case is when both samples under study represent two sequences of divergent values.

To conduct a correlation analysis, the researcher must have two samples that can be ranked, for example:

  • - two characteristics measured in the same group of subjects;
  • - two individual hierarchies of traits identified in two subjects using the same set of traits;
  • - two group hierarchies of characteristics;
  • - individual and group hierarchies of characteristics.

We begin the calculation by ranking the studied indicators separately for each of the characteristics.

Let us analyze a case with two signs measured in the same group of subjects. First, the individual values ​​obtained by different subjects are ranked according to the first characteristic, and then the individual values ​​are ranked according to the second characteristic. If lower ranks of one indicator correspond to lower ranks of another indicator, and higher ranks of one indicator correspond to greater ranks of another indicator, then the two characteristics are positively related. If higher ranks of one indicator correspond to lower ranks of another indicator, then the two characteristics are negatively related. To find rs, we determine the differences between the ranks (d) for each subject. The smaller the difference between the ranks, the closer the rank correlation coefficient rs will be to “+1”. If there is no relationship, then there will be no correspondence between them, hence rs will be close to zero. The greater the difference between the ranks of subjects on two variables, the closer to “-1” the value of the rs coefficient will be. Thus, the Spearman rank correlation coefficient is a measure of any monotonic relationship between the two characteristics under study.

Let us consider the case with two individual hierarchies of traits identified in two subjects using the same set of traits. In this situation, the individual values ​​obtained by each of the two subjects are ranked according to a certain set of characteristics. The feature with the lowest value must be assigned the first rank; featured with more high value- second rank, etc. Should be paid Special attention to ensure that all characteristics are measured in the same units. For example, it is impossible to rank indicators if they are expressed in different “price” points, since it is impossible to determine which of the factors will take first place in terms of severity until all values ​​are brought to a single scale. If features that have low ranks in one of the subjects also have low ranks in another, and vice versa, then the individual hierarchies are positively related.

In the case of two group hierarchies of characteristics, the average group values ​​obtained in two groups of subjects are ranked according to the same set of characteristics for the studied groups. Next, we follow the algorithm given in previous cases.

Let us analyze a case with an individual and group hierarchy of characteristics. They begin by ranking separately the individual values ​​of the subject and the average group values ​​according to the same set of characteristics that were obtained, excluding the subject who does not participate in the average group hierarchy, since his individual hierarchy will be compared with it. Rank correlation allows us to assess the degree of consistency of the individual and group hierarchy of traits.

Let us consider how the significance of the correlation coefficient is determined in the cases listed above. In the case of two characteristics, it will be determined by the sample size. In the case of two individual feature hierarchies, the significance depends on the number of features included in the hierarchy. In the last two cases, significance is determined by the number of characteristics being studied, and not by the number of groups. Thus, the significance of rs in all cases is determined by the number of ranked values ​​n.

When checking the statistical significance of rs, they use tables of critical values ​​of the rank correlation coefficient compiled for various quantities ranked values ​​and different levels significance. If the absolute value of rs reaches or exceeds a critical value, then the correlation is reliable.

When considering the first option (a case with two signs measured in the same group of subjects), the following hypotheses are possible.

H0: The correlation between variables x and y is not different from zero.

H1: The correlation between variables x and y is significantly different from zero.

If we work with any of the three remaining cases, then it is necessary to put forward another pair of hypotheses:

H0: The correlation between hierarchies x and y is not different from zero.

H1: The correlation between hierarchies x and y is significantly different from zero.

The sequence of actions when calculating the Spearman rank correlation coefficient rs is as follows.

  • - Determine which two features or two hierarchies of features will participate in the comparison as variables x and y.
  • - Rank the values ​​of the variable x, assigning a rank of 1 lowest value, in accordance with the ranking rules. Place the ranks in the first column of the table in order of test subjects or characteristics.
  • - Rank the values ​​of the variable y. Place the ranks in the second column of the table in order of test subjects or characteristics.
  • - Calculate the differences d between the ranks x and y for each row of the table. Place the results in the next column of the table.
  • - Calculate the squared differences (d2). Place the resulting values ​​in the fourth column of the table.
  • - Calculate the sum of squared differences? d2.
  • - If identical ranks occur, calculate the corrections:

where tx is the volume of each group of identical ranks in sample x;

ty is the volume of each group of identical ranks in sample y.

Calculate the rank correlation coefficient depending on the presence or absence of identical ranks. If there are no identical ranks, calculate the rank correlation coefficient rs using the formula:

If there are identical ranks, calculate the rank correlation coefficient rs using the formula:

where?d2 is the sum of squared differences between ranks;

Tx and Ty - corrections for equal ranks;

n is the number of subjects or features participating in the ranking.

Determine the critical values ​​of rs from Appendix Table 3 for a given number of subjects n. A significant difference from zero of the correlation coefficient will be observed provided that rs is not less than the critical value.

37. Spearman's rank correlation coefficient.

S. 56 (64) 063.JPG

http://psystat.at.ua/publ/1-1-0-33

Spearman's rank correlation coefficient is used in cases where:
- variables have ranking scale measurements;
- the data distribution is too different from normal or not known at all;
- samples have a small volume (N< 30).

Interpretation ranking coefficient The Spearman correlation is no different from the Pearson coefficient, but its meaning is somewhat different. To understand the difference between these methods and logically justify their areas of application, let’s compare their formulas.

Pearson correlation coefficient:

Spearman correlation coefficient:

As you can see, the formulas differ significantly. Let's compare the formulas

The Pearson correlation formula uses the arithmetic mean and standard deviation of the correlated series, but the Spearman formula does not. Thus, to obtain an adequate result using the Pearson formula, it is necessary that the correlated series be close to the normal distribution (the mean and standard deviation are normal distribution parameters). This is not relevant for the Spearman formula.

An element of the Pearson formula is the standardization of each series in z-scale.

As you can see, the conversion of variables to the Z-scale is present in the formula for the Pearson correlation coefficient. Accordingly, for the Pearson coefficient, the scale of the data does not matter at all: for example, we can correlate two variables, one of which has a min. = 0 and max. = 1, and the second min. = 100 and max. = 1000. No matter how different the range of values ​​is, they will all be converted to standard z-values ​​that are the same in scale.

Such normalization does not occur in the Spearman coefficient, therefore

A MANDATORY CONDITION FOR USING THE SPEARMAN COEFFICIENT IS THE EQUALITY OF THE RANGE OF THE TWO VARIABLES.

Before using the Spearman coefficient for data series with different ranges, it is necessary to rank. Ranking leads to the fact that the values ​​of these series acquire the same minimum = 1 (minimum rank) and a maximum equal to the number of values ​​(maximum, last rank = N, i.e. maximum number cases in the sample).

In what cases can you do without ranking?

These are cases when the data is initially ranking scale. For example, test value orientations Rokeach.

Also, these are cases when the number of value options is small and the sample contains a fixed minimum and maximum. For example, in semantic differential minimum = 1, maximum = 7.

Example of calculating Spearman's rank correlation coefficient

Rokeach’s test of value orientations was carried out on two samples X and Y. Objective: to find out how close the hierarchies of values ​​of these samples are (literally, how similar they are).

The resulting value r=0.747 is checked by table of critical values. According to the table, with N=18, the obtained value is significant at the p level<=0,005

Spearman and Kendal rank correlation coefficients

For variables belonging to an ordinal scale or for variables not subject to a normal distribution, as well as for variables belonging to an interval scale, the Spearman rank correlation is calculated instead of the Pearson coefficient. To do this, individual variable values ​​are assigned ranks, which are subsequently processed using appropriate formulas. To detect rank correlation, clear the default Pearson correlation check box in the Bivariate Correlations... dialog box. Instead, activate the Spearman correlation calculation. This calculation will give the following results. The rank correlation coefficients are very close to the corresponding values ​​of the Pearson coefficients (the original variables have a normal distribution).

titkova-matmetody.pdf p. 45

Spearman's rank correlation method allows you to determine tightness (strength) and direction

correlation between two signs or two profiles (hierarchies) signs.

To calculate rank correlation, it is necessary to have two rows of values,

which can be ranked. Such series of values ​​could be:

1) two signs measured in the same group subjects;

2) two individual hierarchies of characteristics, identified in two subjects using the same

set of features;

3) two group hierarchies of characteristics,

4) individual and group hierarchy of features.

First, the indicators are ranked separately for each of the characteristics.

As a rule, a lower rank is assigned to a lower attribute value.

In the first case (two characteristics), individual values ​​are ranked according to the first

characteristic obtained by different subjects, and then individual values ​​​​for the second

sign.

If two characteristics are positively related, then subjects with low ranks

one of them will have low ranks in the other, and subjects who have high ranks in

one of the characteristics will also have high ranks for the other characteristic. To calculate rs

differences need to be determined (d) between the ranks obtained by a given subject in both

signs. Then these indicators d are transformed in a certain way and subtracted from 1. Than

The smaller the difference between the ranks, the larger rs will be, the closer it will be to +1.

If there is no correlation, then all ranks will be mixed and there will be no

no correspondence. The formula is designed so that in this case rs will be close to 0.

In case of negative correlation low ranks of subjects on one basis

high ranks on another basis will correspond, and vice versa. The greater the discrepancy

between the ranks of subjects on two variables, the closer rs is to -1.

In the second case (two individual profiles), individual ones are ranked

values ​​obtained by each of the 2 subjects according to a certain (the same for them

both) set of features. The first rank will be given to the feature with the lowest value; second rank –

a sign with a higher value, etc. Obviously, all characteristics must be measured in

the same units, otherwise ranking is impossible. For example, it is impossible

rank the indicators on the Cattell Personality Inventory (16PF), if they are expressed in

“raw” points, since the ranges of values ​​are different for different factors: from 0 to 13, from 0 to

20 and from 0 to 26. We cannot say which factor will take first place in

expression until we bring all the values ​​to a single scale (most often this is the wall scale).

If the individual hierarchies of two subjects are positively related, then the signs

having low ranks in one of them will have low ranks in the other, and vice versa.

For example, if one subject’s factor E (dominance) has the lowest rank, then

another test subject, it should have a low rank if one test subject has factor C

(emotional stability) has the highest rank, then the other subject must also have

this factor has a high rank, etc.

In the third case (two group profiles), the group average values ​​are ranked,

obtained in 2 groups of subjects according to a specific set, identical for both groups

signs. In what follows, the line of reasoning is the same as in the previous two cases.

In case 4 (individual and group profiles), they are ranked separately

individual values ​​of the subject and group average values ​​for the same set

signs that are obtained, as a rule, by excluding this individual subject - he

does not participate in the average group profile with which his individual profile will be compared

profile. Rank correlation will allow you to check how consistent the individual and

group profiles.

In all four cases, the significance of the resulting correlation coefficient is determined

by the number of ranked values N. In the first case, this quantity will coincide with

sample size n. In the second case, the number of observations will be the number of features,

making up the hierarchy. In the third and fourth cases, N is also the number of compared

characteristics, and not the number of subjects in groups. Detailed explanations are given in the examples. If

the absolute value of rs reaches or exceeds a critical value, correlation

reliable.

Hypotheses.

There are two possible hypotheses. The first applies to case 1, the second to the other three

First version of hypotheses

H0: The correlation between variables A and B is not different from zero.

H2: The correlation between variables A and B is significantly different from zero.

Second version of hypotheses

H0: The correlation between hierarchies A and B is not different from zero.

H2: The correlation between hierarchies A and B is significantly different from zero.

Limitations of the rank correlation coefficient

1. For each variable, at least 5 observations must be presented. Upper

the sampling boundary is determined by the available tables of critical values .

2. Spearman's rank correlation coefficient rs for a large number of identical

ranks for one or both compared variables gives rough values. Ideally

both correlated series must represent two sequences of divergent

values. If this condition is not met, an amendment must be made to

same ranks.

Spearman's rank correlation coefficient is calculated using the formula:

If both compared rank series contain groups of the same ranks,

before calculating the rank correlation coefficient, it is necessary to make corrections for the same

Ta and TV ranks:

Ta = Σ (a3 – a)/12,

Тв = Σ (в3 – в)/12,

Where A - the volume of each group of identical ranks in rank series A, in volume of each

groups of identical ranks in the rank series B.

To calculate the empirical value of rs, use the formula:

38. Point-biserial correlation coefficient.

About correlation in general, see question No. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf

Let variable X be measured on a strong scale, and variable Y on a dichotomous scale. The point biserial correlation coefficient rpb is calculated using the formula:

Here x 1 is the average value over X objects with a value of “one” over Y;

x 0 – average value over X objects with a value of “zero” over Y;

s x – standard deviation of all values ​​along X;

n 1 – number of objects “one” in Y, n 0 – number of objects “zero” in Y;

n = n 1 + n 0 – sample size.

The point biserial correlation coefficient can also be calculated using other equivalent expressions:

Here x– overall average value for the variable X.

Point biserial correlation coefficient rpb varies from –1 to +1. Its value is zero if variables with one Y have an average Y, equal to the average of variables with zero over Y.

Examination significance hypotheses point biserial correlation coefficient is to check null hypothesish 0 about the equality of the general correlation coefficient to zero: ρ = 0, which is carried out using the Student’s t-test. Empirical significance

compared with critical values t a (df) for the number of degrees of freedom df = n– 2

If the condition | t| ≤ (df), the null hypothesis ρ = 0 is not rejected. The point biserial correlation coefficient differs significantly from zero if the empirical value | t| falls into the critical region, that is, if the condition | t| > (n– 2). Reliability of the relationship calculated using the point biserial correlation coefficient rpb, can also be determined using the criterion χ 2 for the number of degrees of freedom df= 2.

Point biserial correlation

The subsequent modification of the correlation coefficient of the product of moments was reflected in the point biserial r. This stat. shows the relationship between two variables, one of which is assumed to be continuous and normally distributed, and the other is discrete in in the exact sense words. The point biserial correlation coefficient is denoted by r pbis Since in r pbis dichotomy reflects the true nature of the discrete variable, and not being artificial, as in the case r bis, its sign is determined arbitrarily. Therefore, for all practical purposes. goals r pbis considered in the range from 0.00 to +1.00.

There is also the case where two variables are assumed to be continuous and normally distributed, but both are artificially dichotomized, as in the case of biserial correlation. To assess the relationship between such variables, the tetrachoric correlation coefficient is used r tet, which was also bred by Pearson. Basic (exact) formulas and procedures for calculation r tet quite complex. Therefore, with practical This method uses approximations r tet,obtained on the basis of abbreviated procedures and tables.

/on-line/dictionary/dictionary.php?term=511

POINT BISERIAL COEFFICIENT is the correlation coefficient between two variables, one measured on a dichotomous scale and the other on an interval scale. It is used in classical and modern testing as an indicator of the quality of a test task - reliability and consistency with overall score according to the test.

To correlate variables measured in dichotomous and interval scale use point-biserial correlation coefficient.
The point-biserial correlation coefficient is a method of correlation analysis of the relationship of variables, one of which is measured on a scale of names and takes only 2 values ​​(for example, men/women, correct answer/false answer, feature present/not present), and the second on a scale ratios or interval scale. Formula for calculating the point-biserial correlation coefficient:

Where:
m1 and m0 are the average values ​​of X with a value of 1 or 0 in Y.
σx – standard deviation of all values ​​by X
n1,n0 – number of X values ​​from 1 or 0 to Y.
n – total number of pairs of values

Most often, this type of correlation coefficient is used to calculate the relationship between test items and the total scale. This is one type of validity check.

39. Rank-biserial correlation coefficient.

About correlation in general, see question No. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf p. 28

Rank biserial correlation coefficient, used in cases where one of the variables ( X) is presented in an ordinal scale, and the other ( Y) – dichotomous, calculated by the formula

.

Here is the average rank of objects having one in Y; – average rank of objects with zero to Y, n– sample size.

Examination significance hypotheses The rank-biserial correlation coefficient is carried out similarly to the point biserial correlation coefficient using the Student’s test with replacement in the formulas rpb on rrb.

In cases where one variable is measured on a dichotomous scale (variable X), and the other in the rank scale (variable Y), the rank-biserial correlation coefficient is used. We remember that the variable X, measured on a dichotomous scale, takes only two values ​​(codes) 0 and 1. We especially emphasize: despite the fact that this coefficient varies in the range from –1 to +1, its sign does not matter for the interpretation of the results. This is another exception to the general rule.

This coefficient is calculated using the formula:

where ` X 1 average rank for those elements of the variable Y, which corresponds to code (sign) 1 in the variable X;

`X 0 – average rank for those elements of the variable Y, which corresponds to the code (sign) 0 in the variable X\

N – total number of elements in the variable X.

To apply the rank-biserial correlation coefficient, the following conditions must be met:

1. The variables being compared must be measured on different scales: one X – on a dichotomous scale; other Y– on a ranking scale.

2. Number of varying characteristics in the compared variables X And Y should be the same.

3. To assess the level of reliability of the rank-biserial correlation coefficient, you should use formula (11.9) and the table of critical values ​​for the Student criterion k = n – 2.

http://psystat.at.ua/publ/drugie_vidy_koehfficienta_korreljacii/1-1-0-38

Cases where one of the variables is represented in dichotomous scale, and the other in rank (ordinal), require application rank-biserial correlation coefficient:

rpb=2 / n * (m1 - m0)

Where:
n – number of measurement objects
m1 and m0 - the average rank of objects with 1 or 0 on the second variable.
This coefficient is also used when checking the validity of tests.

40. Linear correlation coefficient.

For correlation in general (and linear correlation in particular), see question No. 36 With. 56 (64) 063.JPG

Mr. PEARSON'S COEFFICIENT

r-Pearson (Pearson r) is used to study the relationship between two metricdifferent variables measured on the same sample. There are many situations in which its use is appropriate. Does intelligence affect academic performance in senior university years? Is the size of an employee’s salary related to his friendliness towards colleagues? Does a student’s mood affect the success of solving a complex arithmetic problem? To answer such questions, the researcher must measure two indicators of interest for each member of the sample. The data to study the relationship is then tabulated, as in the example below.

EXAMPLE 6.1

The table shows an example of initial data for measuring two indicators of intelligence (verbal and nonverbal) for 20 8th grade students.

The relationship between these variables can be depicted using a scatterplot (see Figure 6.3). The diagram shows that there is some relationship between the measured indicators: the greater the value of verbal intelligence, the (mostly) the greater the value of non-verbal intelligence.

Before giving the formula for the correlation coefficient, let's try to trace the logic of its occurrence using the data from example 6.1. The position of each /-point (subject with number /) on the scatter diagram relative to the other points (Fig. 6.3) can be specified by the values ​​and signs of deviations of the corresponding variable values ​​from their average values: (xj - MJ And (mind at ). If the signs of these deviations coincide, then this indicates a positive relationship ( large values By X large values ​​correspond to at or lower values X smaller values ​​correspond to y).

For subject No. 1, deviation from the average X and by at positive, and for subject No. 3 both deviations are negative. Consequently, the data from both indicate a positive relationship between the studied traits. On the contrary, if the signs of deviations from the average X and by at differ, this will indicate a negative relationship between the characteristics. Thus, for subject No. 4, the deviation from the average X is negative, by y - positive, and for subject No. 9 - vice versa.

Thus, if the product of deviations (x,- M X ) X (mind at ) positive, then the data of the /-subject indicate a direct (positive) relationship, and if negative, then a reverse (negative) relationship. Accordingly, if Xwy y are generally related in direct proportion, then most of the products of deviations will be positive, and if they are related by an inverse relationship, then most of the products will be negative. Therefore, a general indicator for the strength and direction of the relationship can be the sum of all products of deviations for a given sample:

With a directly proportional relationship between variables, this value is large and positive - for most subjects, the deviations coincide in sign (large values ​​of one variable correspond to large values ​​of another variable and vice versa). If X And at have feedback, then for most subjects large values ​​of one variable will correspond to smaller values ​​of another variable, i.e. the signs of the products will be negative, and the sum of the products as a whole will also be large in absolute value, but negative in sign. If there is no systematic connection between the variables, then the positive terms (products of deviations) will be balanced by negative terms, and the sum of all products of deviations will be close to zero.

To ensure that the sum of the products does not depend on the sample size, it is enough to average it. But we are interested in the measure of interconnection not as a general parameter, but as a calculated estimate of it - statistics. Therefore, as for the dispersion formula, in this case we will do the same, divide the sum of the products of deviations not by N, and on TV - 1. This results in a measure of connection, widely used in physics and technical sciences, which is called covariance (Covahance):


IN In psychology, unlike physics, most variables are measured on arbitrary scales, since psychologists are not interested in the absolute value of a sign, but in the relative position of subjects in a group. In addition, covariance is very sensitive to the scale of the scale (variance) on which the traits are measured. To make the measure of connection independent of the units of measurement of both characteristics, it is enough to divide the covariance into the corresponding standard deviations. Thus it was obtained for-Mule of the K. Pearson correlation coefficient:

or, after substituting the expressions for o x and


If the values ​​of both variables were converted to r-values ​​using the formula


then the formula for the r-Pearson correlation coefficient looks simpler (071.JPG):

/dict/sociology/article/soc/soc-0525.htm

CORRELATION LINEAR- statistical linear relationship of a non-causal nature between two quantitative variables X And at. Measured using the "K.L coefficient." Pearson, which is the result of dividing the covariance by the standard deviations of both variables:

,

Where s xy- covariance between variables X And at;

s x , s y- standard deviations for variables X And at;

x i , y i- variable values X And at for object with number i;

x, y- arithmetic averages for variables X And at.

Pearson coefficient r can take values ​​from the interval [-1; +1]. Meaning r = 0 means there is no linear relationship between variables X And at(but does not exclude a nonlinear statistical relationship). Positive values coefficient ( r> 0) indicate a direct linear connection; the closer its value is to +1, the stronger the relationship is the statistical line. Negative coefficient values ​​( r < 0) свидетельствуют об обратной линейной связи; чем ближе его значение к -1, тем сильнее обратная связь. Значения r= ±1 means the presence of a complete linear connection, direct or reverse. In the case of complete connection, all points with coordinates ( x i , y i) lie on a straight line y = a + bx.

"Coefficient K.L." Pearson is also used to measure the strength of connection in a linear pairwise regression model.

41. Correlation matrix and correlation graph.

About correlation in general, see question No. 36 With. 56 (64) 063.JPG

Correlation matrix. Often correlation analysis involves the study of connections between not two, but many variables measured on a quantitative scale in one sample. In this case, correlations are calculated for each pair of this set of variables. The calculations are usually carried out on a computer, and the result is a correlation matrix.

Correlation matrix(Correlation Matrix) is the result of calculating correlations of one type for each pair from the set R variables measured on a quantitative scale in one sample.

EXAMPLE

Suppose we are studying relationships between 5 variables (vl, v2,..., v5; P= 5), measured on a sample of N=30 Human. Below is a table of source data and a correlation matrix.

AND
similar data:

Correlation matrix:

It is easy to notice that the correlation matrix is ​​square, symmetrical with respect to the main diagonal (takkak,y = /) y), with units on the main diagonal (since G And = Gu = 1).

The correlation matrix is square: the number of rows and columns is equal to the number of variables. She symmetrical relative to the main diagonal, since the correlation X With at equal to correlation at With X. Units are located on its main diagonal, since the correlation of the feature with itself is equal to one. Consequently, not all elements of the correlation matrix are subject to analysis, but those that are located above or below the main diagonal.

Number of correlation coefficients, Pfeatures to be analyzed when studying relationships are determined by the formula: P(P- 1)/2. In the above example, the number of such correlation coefficients is 5(5 - 1)/2 = 10.

The main task of analyzing the correlation matrix is identifying the structure of relationships between many features. In this case, visual analysis is possible correlation galaxies- graphic image structures statisticallymeaningful connections, if there are not very many such connections (up to 10-15). Another way is to use multivariate methods: multiple regression, factor or cluster analysis (see section “Multivariate methods...”). Using factor or cluster analysis, it is possible to identify groupings of variables that are more closely related to each other than to other variables. A combination of these methods is also very effective, for example, if there are many signs and they are not homogeneous.

Comparison of correlations - an additional task of analyzing the correlation matrix, which has two options. If it is necessary to compare correlations in one of the rows of the correlation matrix (for one of the variables), the comparison method is used for dependent samples(pp. 148-149). When comparing correlations of the same name calculated for different samples, the comparison method for independent samples is used (p. 147-148).

Comparison methods correlations in diagonals correlation matrix (to assess the stationarity of a random process) and comparison several correlation matrices obtained for different samples (for their homogeneity) are labor-intensive and beyond the scope of this book. You can get acquainted with these methods from the book by G.V. Sukhodolsky 1.

The problem of statistical significance of correlations. The problem is that the procedure for statistical hypothesis testing assumes one-multiple test carried out on one sample. If the same method is applied repeatedly, even if in relation to different variables, the probability of obtaining a result purely by chance increases. IN general case, if we repeat the same hypothesis testing method once in relation to different variables or samples, then with the established value a we are guaranteed to receive confirmation of the hypothesis in ahk number of cases.

Suppose a correlation matrix is ​​analyzed for 15 variables, that is, 15(15-1)/2 = 105 correlation coefficients are calculated. To test hypotheses, the level a = 0.05 is set. By checking the hypothesis 105 times, we will receive confirmation of it five times (!), regardless of whether the connection actually exists. Knowing this and having, say, 15 “statistically significant” correlation coefficients, can we tell which ones were obtained by chance and which ones reflect a real relationship?

Strictly speaking, for acceptance statistical solution it is necessary to reduce the level a as many times as the number of hypotheses being tested. But this is hardly advisable, since in an unpredictable way the probability of actually ignoring existing connection(make a type II error).

The correlation matrix alone is not a sufficient basisfor statistical conclusions regarding the individual coefficients included in itcorrelations!

There is only one truly convincing way to solve this problem: divide the sample randomly into two parts and take into account only those correlations that are statistically significant in both parts of the sample. An alternative may be the use of multivariate methods (factor, cluster or multiple regression analysis) to identify and subsequently interpret groups of statistically significantly related variables.

Missing values ​​problem. If there are missing values ​​in the data, then two options are possible for calculating the correlation matrix: a) row-by-row removal of values (Excludecaseslistwise); b) pairwise deletion of values (Excludecasespairwise). At line by line deletion observations with missing values, the entire row for an object (subject) that has at least one missing value for one of the variables is deleted. This method leads to a “correct” correlation matrix in the sense that all coefficients are calculated from the same set of objects. However, if the missing values ​​are distributed randomly across the variables, then this method can lead to the fact that there will not be a single object left in the data set under consideration (each row will contain at least one missing value). To avoid similar situation, use another method called pairwise removal. This method only considers gaps in each selected column-variable pair and ignores gaps in other variables. The correlation for a pair of variables is calculated for those objects where there are no gaps. In many situations, especially when the number of gaps is relatively small, say 10%, and the gaps are distributed quite randomly, this method does not lead to serious errors. However, sometimes this is not the case. For example, a systematic bias (shift) in the assessment may “hidden” a systematic arrangement of omissions, which is the reason for the difference in correlation coefficients constructed for different subsets (for example, for different subgroups of objects). Another problem associated with the correlation matrix calculated with pairwise removal of gaps occurs when using this matrix in other types of analysis (for example, in multiple regression or factor analysis). They assume that the “correct” correlation matrix is ​​used with a certain level of consistency and “compliance” of various coefficients. Using a matrix with “bad” (biased) estimates leads to the fact that the program is either unable to analyze such a matrix, or the results will be erroneous. Therefore, if the pairwise method of excluding missing data is used, it is necessary to check whether there are systematic patterns in the distribution of missing data.

If pairwise deletion of missing data does not lead to any systematic shift in the means and variances (standard deviations), then these statistics will be similar to those calculated using the row-by-row method of deleting missing data. If a significant difference is observed, then there is reason to assume that there is a shift in the estimates. For example, if the average (or standard deviation) of the values ​​of a variable A, which was used in calculating its correlation with the variable IN, much less than average (or standard deviation) the same variable values A, which were used in calculating its correlation with the variable C, then there is every reason to expect that these two correlations (A-Bus) based on different subsets of data. There will be a bias in the correlations caused by the non-random placement of gaps in the variable values.

Analysis of correlation galaxies. After solving the problem of statistical significance of the elements of the correlation matrix, statistically significant correlations can be represented graphically in the form of a correlation galaxy or galaxy. Correlation galaxy - This is a figure consisting of vertices and lines connecting them. The vertices correspond to the characteristics and are usually designated by numbers - variable numbers. The lines correspond to statistically significant connections and graphically express the sign and sometimes the j-level of significance of the connection.

The correlation galaxy can reflect All statistically significant connections of the correlation matrix (sometimes called correlation graph ) or only their meaningfully selected part (for example, corresponding to one factor according to the results of factor analysis).

EXAMPLE OF CONSTRUCTING A CORRELATION PLEIADE


Preparation for the state (final) certification of graduates: formation of the Unified State Exam base ( common list Unified State Exam participants of all categories, indicating subjects) - taking into account reserve days in case of coincidence of subjects;

  • Work plan (27)

    Solution

    2. Activities of the educational institution to improve the content and assess the quality in the subjects of science and mathematics education Municipal educational institution secondary school No. 4, Litvinovskaya, Chapaevskaya,

  • The Spearman rank correlation method allows you to determine the closeness (strength) and direction of the correlation between two characteristics or two profiles (hierarchies) of characteristics.

    To calculate rank correlation, it is necessary to have two rows of values,

    which can be ranked. Such series of values ​​could be:

    1) two signs measured in the same group of subjects;

    2) two individual hierarchies of traits identified in two subjects using the same set of traits;

    3) two group hierarchies of characteristics,

    4) individual and group hierarchies of characteristics.

    First, the indicators are ranked separately for each of the characteristics.

    As a rule, a lower rank is assigned to a lower attribute value.

    In the first case (two characteristics), individual values ​​for the first characteristic obtained by different subjects are ranked, and then individual values ​​for the second characteristic.

    If two characteristics are positively related, then subjects who have low ranks in one of them will have low ranks in the other, and subjects who have high ranks in

    one of the characteristics will also have high ranks for the other characteristic. To calculate rs, it is necessary to determine the differences (d) between the ranks obtained by a given subject for both characteristics. Then these indicators d are transformed in a certain way and subtracted from 1. Than

    The smaller the difference between the ranks, the larger rs will be, the closer it will be to +1.

    If there is no correlation, then all ranks will be mixed and there will be no

    no correspondence. The formula is designed so that in this case rs will be close to 0.

    In the case of a negative correlation between low ranks of subjects on one attribute

    high ranks on another basis will correspond, and vice versa. The greater the discrepancy between subjects' ranks on two variables, the closer rs is to -1.

    In the second case (two individual profiles), individual

    values ​​obtained by each of the 2 subjects for a certain (identical for both of them) set of characteristics. The first rank will be given to the feature with the lowest value; the second rank is a feature with a higher value, etc. Obviously, all characteristics must be measured in the same units, otherwise ranking is impossible. For example, it is impossible to rank indicators on the Cattell Personality Inventory (16PF) if they are expressed in “raw” points, since the ranges of values ​​​​for different factors are different: from 0 to 13, from 0 to

    20 and from 0 to 26. We cannot say which factor will take first place in terms of severity until we bring all the values ​​to a single scale (most often this is the wall scale).

    If the individual hierarchies of two subjects are positively related, then features that have low ranks in one of them will have low ranks in the other, and vice versa. For example, if one subject’s factor E (dominance) has the lowest rank, then another subject’s factor should also have a low rank, if one subject’s factor C

    (emotional stability) has the highest rank, then the other subject must also have

    this factor has a high rank, etc.

    In the third case (two group profiles), the group average values ​​obtained in 2 groups of subjects are ranked according to a certain set of characteristics, identical for the two groups. In what follows, the line of reasoning is the same as in the previous two cases.

    In case 4 (individual and group profiles), the individual values ​​of the subject and the group average values ​​are ranked separately according to the same set of characteristics, which are obtained, as a rule, by excluding this individual subject - he does not participate in the group average profile with which he will be compared individual profile. Rank correlation will test how consistent the individual and group profiles are.

    In all four cases, the significance of the resulting correlation coefficient is determined by the number of ranked values ​​N. In the first case, this number will coincide with the sample size n. In the second case, the number of observations will be the number of features that make up the hierarchy. In the third and fourth cases, N is also the number of compared features, and not the number of subjects in groups. Detailed explanations are given in the examples. If the absolute value of rs reaches or exceeds a critical value, the correlation is reliable.

    Hypotheses.

    There are two possible hypotheses. The first applies to case 1, the second to the other three cases.

    First version of hypotheses

    H0: The correlation between variables A and B is not different from zero.

    H1: The correlation between variables A and B is significantly different from zero.

    Second version of hypotheses

    H0: The correlation between hierarchies A and B is not different from zero.

    H1: The correlation between hierarchies A and B is significantly different from zero.

    Limitations of the rank correlation coefficient

    1. For each variable, at least 5 observations must be presented. The upper limit of the sample is determined by the available tables of critical values.

    2. Spearman's rank correlation coefficient rs with a large number of identical ranks for one or both compared variables gives rough values. Ideally, both correlated series should represent two sequences of divergent values. If this condition is not met, it is necessary to make an adjustment for equal ranks.

    Spearman's rank correlation coefficient is calculated using the formula:

    If in both compared rank series there are groups of the same ranks, before calculating the rank correlation coefficient it is necessary to make corrections for the same ranks Ta and Tv:

    Ta = Σ (a3 – a)/12,

    Тв = Σ (в3 – в)/12,

    where a is the volume of each group of identical ranks in the rank series A, b is the volume of each

    groups of identical ranks in the rank series B.

    To calculate the empirical value of rs, use the formula:

    Calculation of Spearman's rank correlation coefficient rs

    1. Determine which two characteristics or two hierarchies of characteristics will participate in

    comparison as variables A and B.

    2. Rank the values ​​of variable A, assigning rank 1 to the smallest value, in accordance with the ranking rules (see P.2.3). Enter the ranks in the first column of the table in order of the test subjects' numbers or characteristics.

    3. Rank the values ​​of variable B in accordance with the same rules. Enter the ranks in the second column of the table in order of the numbers of the subjects or characteristics.

    5. Square each difference: d2. Enter these values ​​in the fourth column of the table.

    Ta = Σ (a3 – a)/12,

    Тв = Σ (в3 – в)/12,

    where a is the volume of each group of identical ranks in rank series A; c – volume of each group

    identical ranks in the ranking series B.

    a) in the absence of identical ranks

    rs  1 − 6 ⋅

    b) in the presence of identical ranks

    Σd 2  T  T

    r  1 − 6 ⋅ a in,

    where Σd2 is the sum of squared differences between ranks; Ta and TV - corrections for the same

    N – number of subjects or features participating in the ranking.

    9. Determine from the Table (see Appendix 4.3) the critical values ​​of rs for a given N. If rs exceeds the critical value or is at least equal to it, the correlation is significantly different from 0.

    Example 4.1. When determining the degree of dependence of the reaction of alcohol consumption on the oculomotor reaction in the test group, data were obtained before and after consumption of alcohol. Does the subject's reaction depend on the state of intoxication?

    Experiment results:

    Before: 16, 13, 14, 9, 10, 13, 14, 14, 18, 20, 15, 10, 9, 10, 16, 17, 18. After: 24, 9, 10, 23, 20, 11, 12, 19, 18, 13, 14, 12, 14, 7, 9, 14. Let’s formulate hypotheses:

    H0: the correlation between the degree of dependence of the reaction before and after drinking alcohol does not differ from zero.

    H1: the correlation between the degree of dependence of the reaction before and after drinking alcohol is significantly different from zero.

    Table 4.1. Calculation of d2 for Spearman's rank correlation coefficient rs when comparing oculomotor reaction indicators before and after the experiment (N=17)

    values

    values

    Since we have repeating ranks, in this case we will apply the formula adjusted for identical ranks:

    Ta= ((23-2)+(33-3)+(23-2)+(33-3)+(23-2)+(23-2))/12=6

    Тb =((23-2)+(23-2)+(33-3))/12=3

    Let's find the empirical value of the Spearman coefficient:

    rs = 1- 6*((767.75+6+3)/(17*(172-1)))=0.05

    Using the table (Appendix 4.3) we find the critical values ​​of the correlation coefficient

    0.48 (p ≤ 0.05)

    0.62 (p ≤ 0.01)

    We get

    rs=0.05∠rcr(0.05)=0.48

    Conclusion: H1 hypothesis is rejected and H0 is accepted. Those. correlation between degree

    the dependence of the reaction before and after drinking alcohol does not differ from zero.