12.4 Testing the Significance of the Correlation Coefficient

The correlation coefficient, r , tells us about the strength and direction of the linear relationship between x and y . However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n , together.

We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute r , the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, r , is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is ρ , the Greek letter "rho."
  • ρ = population correlation coefficient (unknown)
  • r = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient r and the sample size n .

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is "significant."

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between x and y . We can use the regression line to model the linear relationship between x and y in the population.

If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is "not significant".

  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is not significantly different from zero."
  • What the conclusion means: There is not a significant linear relationship between x and y . Therefore, we CANNOT use the regression line to model a linear relationship between x and y in the population.
  • If r is significant and the scatter plot shows a linear trend, the line can be used to predict the value of y for values of x that are within the domain of observed x values.
  • If r is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If r is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed x values in the data.

PERFORMING THE HYPOTHESIS TEST

  • Null Hypothesis: H 0 : ρ = 0
  • Alternate Hypothesis: H a : ρ ≠ 0

WHAT THE HYPOTHESES MEAN IN WORDS:

  • Null Hypothesis H 0 : The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship (correlation) between x and y in the population.
  • Alternate Hypothesis H a : The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the population.

DRAWING A CONCLUSION: There are two methods of making the decision. The two methods are equivalent and give the same result.

  • Method 1: Using the p -value
  • Method 2: Using a table of critical values

In this chapter of this textbook, we will always use a significance level of 5%, α = 0.05

Using the p -value method, you could choose any appropriate significance level you want; you are not limited to using α = 0.05. But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, α = 0.05. (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook.)

METHOD 1: Using a p -value to make a decision

Using the ti-83, 83+, 84, 84+ calculator.

To calculate the p -value using LinRegTTEST : On the LinRegTTEST input screen, on the line prompt for β or ρ , highlight " ≠ 0 " The output screen shows the p-value on the line that reads "p =". (Most computer statistical software can calculate the p -value.)

  • Decision: Reject the null hypothesis.
  • Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero."
  • Decision: DO NOT REJECT the null hypothesis.
  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is NOT significantly different from zero."
  • You will use technology to calculate the p -value. The following describes the calculations to compute the test statistics and the p -value:
  • The p -value is calculated using a t -distribution with n - 2 degrees of freedom.
  • The formula for the test statistic is t = r n − 2 1 − r 2 t = r n − 2 1 − r 2 . The value of the test statistic, t , is shown in the computer or calculator output along with the p -value. The test statistic t has the same sign as the correlation coefficient r .
  • The p -value is the combined area in both tails.

An alternative way to calculate the p -value (p) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.

  • Consider the third exam/final exam example .
  • The line of best fit is: ŷ = -173.51 + 4.83 x with r = 0.6631 and there are n = 11 data points.
  • Can the regression line be used for prediction? Given a third exam score ( x value), can we use the line to predict the final exam score (predicted y value)?
  • H 0 : ρ = 0
  • H a : ρ ≠ 0
  • The p -value is 0.026 (from LinRegTTest on your calculator or from computer software).
  • The p -value, 0.026, is less than the significance level of α = 0.05.
  • Decision: Reject the Null Hypothesis H 0
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score ( x ) and the final exam score ( y ) because the correlation coefficient is significantly different from zero.

Because r is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.

METHOD 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a good idea of whether the computed value of r r is significant or not . Compare r to the appropriate critical value in the table. If r is not between the positive and negative critical values, then the correlation coefficient is significant. If r is significant, then you may want to use the line for prediction.

Example 12.7

Suppose you computed r = 0.801 using n = 10 data points. df = n - 2 = 10 - 2 = 8. The critical values associated with df = 8 are -0.632 and + 0.632. If r < negative critical value or r > positive critical value, then r is significant. Since r = 0.801 and 0.801 > 0.632, r is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Try It 12.7

For a given line of best fit, you computed that r = 0.6501 using n = 12 data points and the critical value is 0.576. Can the line be used for prediction? Why or why not?

Example 12.8

Suppose you computed r = –0.624 with 14 data points. df = 14 – 2 = 12. The critical values are –0.532 and 0.532. Since –0.624 < –0.532, r is significant and the line can be used for prediction

Try It 12.8

For a given line of best fit, you compute that r = 0.5204 using n = 9 data points, and the critical value is 0.666. Can the line be used for prediction? Why or why not?

Example 12.9

Suppose you computed r = 0.776 and n = 6. df = 6 – 2 = 4. The critical values are –0.811 and 0.811. Since –0.811 < 0.776 < 0.811, r is not significant, and the line should not be used for prediction.

Try It 12.9

For a given line of best fit, you compute that r = –0.7204 using n = 8 data points, and the critical value is = 0.707. Can the line be used for prediction? Why or why not?

THIRD-EXAM vs FINAL-EXAM EXAMPLE: critical value method

Consider the third exam/final exam example . The line of best fit is: ŷ = –173.51+4.83 x with r = 0.6631 and there are n = 11 data points. Can the regression line be used for prediction? Given a third-exam score ( x value), can we use the line to predict the final exam score (predicted y value)?

  • Use the "95% Critical Value" table for r with df = n – 2 = 11 – 2 = 9.
  • The critical values are –0.602 and +0.602
  • Since 0.6631 > 0.602, r is significant.
  • Conclusion:There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score ( x ) and the final exam score ( y ) because the correlation coefficient is significantly different from zero.

Example 12.10

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if r is significant and the line of best fit associated with each r can be used to predict a y value. If it helps, draw a number line.

  • r = –0.567 and the sample size, n , is 19. The df = n – 2 = 17. The critical value is –0.456. –0.567 < –0.456 so r is significant.
  • r = 0.708 and the sample size, n , is nine. The df = n – 2 = 7. The critical value is 0.666. 0.708 > 0.666 so r is significant.
  • r = 0.134 and the sample size, n , is 14. The df = 14 – 2 = 12. The critical value is 0.532. 0.134 is between –0.532 and 0.532 so r is not significant.
  • r = 0 and the sample size, n , is five. No matter what the dfs are, r = 0 is between the two critical values so r is not significant.

Try It 12.10

For a given line of best fit, you compute that r = 0 using n = 100 data points. Can the line be used for prediction? Why or why not?

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between x and y in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between x and y in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

  • The relationship between the variables being correlated should be linear. The data points should fall along an approximate straight-line pattern when plotted as ( x , y ) data points on a scatter plot.
  • The y values for any particular x value are normally distributed about the line. This implies that there are more y values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of y values lie on the line.
  • The standard deviations of the population y values about the line are equal for each value of x . In other words, each of these normal distributions of y values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Statistics 2e
  • Publication date: Dec 13, 2023
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-statistics-2e/pages/12-4-testing-the-significance-of-the-correlation-coefficient

© Dec 6, 2023 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

Module 12: Linear Regression and Correlation

Hypothesis test for correlation, learning outcomes.

  • Conduct a linear regression t-test using p-values and critical values and interpret the conclusion in context

The correlation coefficient,  r , tells us about the strength and direction of the linear relationship between x and y . However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n , together.

We perform a hypothesis test of the “ significance of the correlation coefficient ” to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute  r , the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we only have sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, r , is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is ρ , the Greek letter “rho.”
  • ρ = population correlation coefficient (unknown)
  • r = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient  ρ is “close to zero” or “significantly different from zero.” We decide this based on the sample correlation coefficient r and the sample size n .

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is “significant.”

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between x and y . We can use the regression line to model the linear relationship between x and y in the population.

If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that the correlation coefficient is “not significant.”

  • Conclusion: “There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is not significantly different from zero.”
  • What the conclusion means: There is not a significant linear relationship between x and y . Therefore, we CANNOT use the regression line to model a linear relationship between x and y in the population.
  • If r is significant and the scatter plot shows a linear trend, the line can be used to predict the value of y for values of x that are within the domain of observed x values.
  • If r is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If r is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed x values in the data.

Performing the Hypothesis Test

  • Null Hypothesis: H 0 : ρ = 0
  • Alternate Hypothesis: H a : ρ ≠ 0

What the Hypotheses Mean in Words

  • Null Hypothesis H 0 : The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship (correlation) between x and y in the population.
  • Alternate Hypothesis H a : The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the population.

Drawing a Conclusion

There are two methods of making the decision. The two methods are equivalent and give the same result.

  • Method 1: Using the p -value
  • Method 2: Using a table of critical values

In this chapter of this textbook, we will always use a significance level of 5%,  α = 0.05

Using the  p -value method, you could choose any appropriate significance level you want; you are not limited to using α = 0.05. But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, α = 0.05. (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook).

Method 1: Using a p -value to make a decision

Using the ti-83, 83+, 84, 84+ calculator.

To calculate the  p -value using LinRegTTEST:

  • On the LinRegTTEST input screen, on the line prompt for β or ρ , highlight “≠ 0”
  • The output screen shows the p-value on the line that reads “p =”.
  • (Most computer statistical software can calculate the  p -value).

If the p -value is less than the significance level ( α = 0.05)

  • Decision: Reject the null hypothesis.
  • Conclusion: “There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.”

If the p -value is NOT less than the significance level ( α = 0.05)

  • Decision: DO NOT REJECT the null hypothesis.
  • Conclusion: “There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is NOT significantly different from zero.”

Calculation Notes:

  • You will use technology to calculate the p -value. The following describes the calculations to compute the test statistics and the p -value:
  • The p -value is calculated using a t -distribution with n – 2 degrees of freedom.
  • The formula for the test statistic is [latex]\displaystyle{t}=\dfrac{{{r}\sqrt{{{n}-{2}}}}}{\sqrt{{{1}-{r}^{{2}}}}}[/latex]. The value of the test statistic, t , is shown in the computer or calculator output along with the p -value. The test statistic t has the same sign as the correlation coefficient r .
  • The p -value is the combined area in both tails.

Recall: ORDER OF OPERATIONS

1st find the numerator:

Step 1: Find [latex]n-2[/latex], and then take the square root.

Step 2: Multiply the value in Step 1 by [latex]r[/latex].

2nd find the denominator: 

Step 3: Find the square of [latex]r[/latex], which is [latex]r[/latex] multiplied by [latex]r[/latex].

Step 4: Subtract this value from 1, [latex]1 -r^2[/latex].

Step 5: Find the square root of Step 4.

3rd take the numerator and divide by the denominator.

An alternative way to calculate the  p -value (p) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.

THIRD-EXAM vs FINAL-EXAM EXAM:  p- value method

  • Consider the  third exam/final exam example (example 2).
  • The line of best fit is: [latex]\hat{y}[/latex] = -173.51 + 4.83 x  with  r  = 0.6631 and there are  n  = 11 data points.
  • Can the regression line be used for prediction?  Given a third exam score ( x  value), can we use the line to predict the final exam score (predicted  y  value)?
  • H 0 :  ρ  = 0
  • H a :  ρ  ≠ 0
  • The  p -value is 0.026 (from LinRegTTest on your calculator or from computer software).
  • The  p -value, 0.026, is less than the significance level of  α  = 0.05.
  • Decision: Reject the Null Hypothesis  H 0
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score ( x ) and the final exam score ( y ) because the correlation coefficient is significantly different from zero.

Because  r  is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.

Method 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a good idea of whether the computed value of r is significant or not . Compare  r to the appropriate critical value in the table. If r is not between the positive and negative critical values, then the correlation coefficient is significant. If  r is significant, then you may want to use the line for prediction.

Suppose you computed  r = 0.801 using n = 10 data points. df = n – 2 = 10 – 2 = 8. The critical values associated with df = 8 are -0.632 and + 0.632. If r < negative critical value or r > positive critical value, then r is significant. Since r = 0.801 and 0.801 > 0.632, r is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Horizontal number line with values of -1, -0.632, 0, 0.632, 0.801, and 1. A dashed line above values -0.632, 0, and 0.632 indicates not significant values.

r is not significant between -0.632 and +0.632. r = 0.801 > +0.632. Therefore, r is significant.

For a given line of best fit, you computed that  r = 0.6501 using n = 12 data points and the critical value is 0.576. Can the line be used for prediction? Why or why not?

If the scatter plot looks linear then, yes, the line can be used for prediction, because  r > the positive critical value.

Suppose you computed  r = –0.624 with 14 data points. df = 14 – 2 = 12. The critical values are –0.532 and 0.532. Since –0.624 < –0.532, r is significant and the line can be used for prediction

Horizontal number line with values of -0.624, -0.532, and 0.532.

r = –0.624-0.532. Therefore, r is significant.

For a given line of best fit, you compute that  r = 0.5204 using n = 9 data points, and the critical value is 0.666. Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction, because  r < the positive critical value.

Suppose you computed  r = 0.776 and n = 6. df = 6 – 2 = 4. The critical values are –0.811 and 0.811. Since –0.811 < 0.776 < 0.811, r is not significant, and the line should not be used for prediction.

Horizontal number line with values -0.924, -0.532, and 0.532.

–0.811 <  r = 0.776 < 0.811. Therefore, r is not significant.

For a given line of best fit, you compute that  r = –0.7204 using n = 8 data points, and the critical value is = 0.707. Can the line be used for prediction? Why or why not?

Yes, the line can be used for prediction, because  r < the negative critical value.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: critical value method

Consider the  third exam/final exam example  again. The line of best fit is: [latex]\hat{y}[/latex] = –173.51+4.83 x  with  r  = 0.6631 and there are  n  = 11 data points. Can the regression line be used for prediction?  Given a third-exam score ( x  value), can we use the line to predict the final exam score (predicted  y  value)?

  • Use the “95% Critical Value” table for  r  with  df  =  n  – 2 = 11 – 2 = 9.
  • The critical values are –0.602 and +0.602
  • Since 0.6631 > 0.602,  r  is significant.

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if  r is significant and the line of best fit associated with each r can be used to predict a y value. If it helps, draw a number line.

  • r = –0.567 and the sample size, n , is 19. The df = n – 2 = 17. The critical value is –0.456. –0.567 < –0.456 so r is significant.
  • r = 0.708 and the sample size, n , is nine. The df = n – 2 = 7. The critical value is 0.666. 0.708 > 0.666 so r is significant.
  • r = 0.134 and the sample size, n , is 14. The df = 14 – 2 = 12. The critical value is 0.532. 0.134 is between –0.532 and 0.532 so r is not significant.
  • r = 0 and the sample size, n , is five. No matter what the dfs are, r = 0 is between the two critical values so r is not significant.

For a given line of best fit, you compute that  r = 0 using n = 100 data points. Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction no matter what the sample size is.

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between  x and y in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between x and y in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

The assumptions underlying the test of significance are:

  • There is a linear relationship in the population that models the average value of y for varying values of x . In other words, the expected value of y for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population).
  • The y values for any particular x value are normally distributed about the line. This implies that there are more y values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of y values lie on the line.
  • The standard deviations of the population y values about the line are equal for each value of x . In other words, each of these normal distributions of y  values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

The left graph shows three sets of points. Each set falls in a vertical line. The points in each set are normally distributed along the line — they are densely packed in the middle and more spread out at the top and bottom. A downward sloping regression line passes through the mean of each set. The right graph shows the same regression line plotted. A vertical normal curve is shown for each line.

The  y values for each x value are normally distributed about the line with the same standard deviation. For each x value, the mean of the y values lies on the regression line. More y values lie near the line than are scattered further away from the line.

  • Provided by : Lumen Learning. License : CC BY: Attribution
  • Testing the Significance of the Correlation Coefficient. Provided by : OpenStax. Located at : https://openstax.org/books/introductory-statistics/pages/12-4-testing-the-significance-of-the-correlation-coefficient . License : CC BY: Attribution . License Terms : Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Introductory Statistics. Authored by : Barbara Illowsky, Susan Dean. Provided by : OpenStax. Located at : https://openstax.org/books/introductory-statistics/pages/1-introduction . License : CC BY: Attribution . License Terms : Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction

Footer Logo Lumen Candela

Privacy Policy

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

Lesson 15: tests concerning regression and correlation, overview section  .

In lessons 35 and 36, we learned how to calculate point and interval estimates of the intercept and slope parameters, \(\alpha\) and \(\beta\), of a simple linear regression model:

\(Y_i=\alpha+\beta(x_i-\bar{x})+\epsilon_i\)

with the random errors \(\epsilon_i\) following a normal distribution with mean 0 and variance \(\sigma^2\). In this lesson, we'll learn how to conduct a hypothesis test for testing the null hypothesis that the slope parameter equals some value, \(\beta_0\), say. Specifically, we'll learn how to test the null hypothesis \(H_0:\beta=\beta_0\) using a \(t\)-statistic.

Now, perhaps it is not a point that has been emphasized yet, but if you take a look at the form of the simple linear regression model, you'll notice that the response \(Y\)'s are denoted using a capital letter, while the predictor \(x\)'s are denoted using a lowercase letter. That's because, in the simple linear regression setting, we view the predictors as fixed values, whereas we view the responses as random variables whose possible values depend on the population \(x\) from which they came. Suppose instead that we had a situation in which we thought of the pair \((X_i, Y_i)\) as being a random sample, \(i=1, 2, \ldots, n\), from a bivariate normal distribution with parameters \(\mu_X\), \(\mu_Y\), \(\sigma^2_X\), \(\sigma^2_Y\) and \(\rho\). Then, we might be interested in testing the null hypothesis \(H_0:\rho=0\), because we know that if the correlation coefficient is 0, then \(X\) and \(Y\) are independent random variables. For this reason, we'll learn, not one, but three (!) possible hypothesis tests for testing the null hypothesis that the correlation coefficient is 0. Then, because we haven't yet derived an interval estimate for the correlation coefficient, we'll also take the time to derive an approximate confidence interval for \(\rho\).

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.7(6); 2003

Logo of ccforum

Statistics review 7: Correlation and regression

1 Senior Lecturer, School of Computing, Mathematical and Information Sciences, University of Brighton, Brighton, UK

Jonathan Ball

2 Lecturer in Intensive Care Medicine, St George's Hospital Medical School, London, UK

The present review introduces methods of analyzing the relationship between two quantitative variables. The calculation and interpretation of the sample product moment correlation coefficient and the linear regression equation are discussed and illustrated. Common misuses of the techniques are considered. Tests and confidence intervals for the population parameters are described, and failures of the underlying assumptions are highlighted.

Introduction

The most commonly used techniques for investigating the relationship between two quantitative variables are correlation and linear regression. Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation. For example, in patients attending an accident and emergency unit (A&E), we could use correlation and regression to determine whether there is a relationship between age and urea level, and whether the level of urea can be predicted for a given age.

Scatter diagram

When investigating a relationship between two variables, the first step is to show the data values graphically on a scatter diagram. Consider the data given in Table ​ Table1. 1 . These are the ages (years) and the logarithmically transformed admission serum urea (natural logarithm [ln] urea) for 20 patients attending an A&E. The reason for transforming the urea levels was to obtain a more Normal distribution [ 1 ]. The scatter diagram for ln urea and age (Fig. ​ (Fig.1) 1 ) suggests there is a positive linear relationship between these variables.

An external file that holds a picture, illustration, etc.
Object name is cc2401-1.jpg

Scatter diagram for ln urea and age

Age and ln urea for 20 patients attending an accident and emergency unit

Correlation

On a scatter diagram, the closer the points lie to a straight line, the stronger the linear relationship between two variables. To quantify the strength of the relationship, we can calculate the correlation coefficient. In algebraic notation, if we have two variables x and y, and the data take the form of n pairs (i.e. [x 1 , y 1 ], [x 2 , y 2 ], [x 3 , y 3 ] ... [x n , y n ]), then the correlation coefficient is given by the following equation:

This is the product moment correlation coefficient (or Pearson correlation coefficient). The value of r always lies between -1 and +1. A value of the correlation coefficient close to +1 indicates a strong positive linear relationship (i.e. one variable increases with the other; Fig. ​ Fig.2). 2 ). A value close to -1 indicates a strong negative linear relationship (i.e. one variable decreases as the other increases; Fig. ​ Fig.3). 3 ). A value close to 0 indicates no linear relationship (Fig. ​ (Fig.4); 4 ); however, there could be a nonlinear relationship between the variables (Fig. ​ (Fig.5 5 ).

An external file that holds a picture, illustration, etc.
Object name is cc2401-2.jpg

Correlation coefficient (r) = +0.9. Positive linear relationship.

An external file that holds a picture, illustration, etc.
Object name is cc2401-3.jpg

Correlation coefficient (r) = -0.9. Negative linear relationship.

An external file that holds a picture, illustration, etc.
Object name is cc2401-4.jpg

Correlation coefficient (r) = 0.04. No relationship.

An external file that holds a picture, illustration, etc.
Object name is cc2401-5.jpg

Correlation coefficient (r) = -0.03. Nonlinear relationship.

For the A&E data, the correlation coefficient is 0.62, indicating a moderate positive linear relationship between the two variables.

Hypothesis test of correlation

We can use the correlation coefficient to test whether there is a linear relationship between the variables in the population as a whole. The null hypothesis is that the population correlation coefficient equals 0. The value of r can be compared with those given in Table ​ Table2, 2 , or alternatively exact P values can be obtained from most statistical packages. For the A&E data, r = 0.62 with a sample size of 20 is greater than the value highlighted bold in Table ​ Table2 2 for P = 0.01, indicating a P value of less than 0.01. Therefore, there is sufficient evidence to suggest that the true population correlation coefficient is not 0 and that there is a linear relationship between ln urea and age.

5% and 1% points for the distribution of the correlation coefficient under the null hypothesis that the population correlation is 0 in a two-tailed test

Generated using the standard formula [ 2 ].

Confidence interval for the population correlation coefficient

Although the hypothesis test indicates whether there is a linear relationship, it gives no indication of the strength of that relationship. This additional information can be obtained from a confidence interval for the population correlation coefficient.

To calculate a confidence interval, r must be transformed to give a Normal distribution making use of Fisher's z transformation [ 2 ]:

The standard error [ 3 ] of z r is approximately:

and hence a 95% confidence interval for the true population value for the transformed correlation coefficient z r is given by z r - (1.96 × standard error) to z r + (1.96 × standard error). Because z r is Normally distributed, 1.96 deviations from the statistic will give a 95% confidence interval.

For the A&E data the transformed correlation coefficient z r between ln urea and age is:

The standard error of z r is:

The 95% confidence interval for z r is therefore 0.725 - (1.96 × 0.242) to 0.725 + (1.96 × 0.242), giving 0.251 to 1.199.

We must use the inverse of Fisher's transformation on the lower and upper limits of this confidence interval to obtain the 95% confidence interval for the correlation coefficient. The lower limit is:

giving 0.25 and the upper limit is:

giving 0.83. Therefore, we are 95% confident that the population correlation coefficient is between 0.25 and 0.83.

The width of the confidence interval clearly depends on the sample size, and therefore it is possible to calculate the sample size required for a given level of accuracy. For an example, see Bland [ 4 ].

Misuse of correlation

There are a number of common situations in which the correlation coefficient can be misinterpreted.

One of the most common errors in interpreting the correlation coefficient is failure to consider that there may be a third variable related to both of the variables being investigated, which is responsible for the apparent correlation. Correlation does not imply causation. To strengthen the case for causality, consideration must be given to other possible underlying variables and to whether the relationship holds in other populations.

A nonlinear relationship may exist between two variables that would be inadequately described, or possibly even undetected, by the correlation coefficient.

A data set may sometimes comprise distinct subgroups, for example males and females. This could result in clusters of points leading to an inflated correlation coefficient (Fig. ​ (Fig.6). 6 ). A single outlier may produce the same sort of effect.

An external file that holds a picture, illustration, etc.
Object name is cc2401-6.jpg

Subgroups in the data resulting in a misleading correlation. All data: r = 0.57; males: r = -0.41; females: r = -0.26.

It is important that the values of one variable are not determined in advance or restricted to a certain range. This may lead to an invalid estimate of the true correlation coefficient because the subjects are not a random sample.

Another situation in which a correlation coefficient is sometimes misinterpreted is when comparing two methods of measurement. A high correlation can be incorrectly taken to mean that there is agreement between the two methods. An analysis that investigates the differences between pairs of observations, such as that formulated by Bland and Altman [ 5 ], is more appropriate.

In the A&E example we are interested in the effect of age (the predictor or x variable) on ln urea (the response or y variable). We want to estimate the underlying linear relationship so that we can predict ln urea (and hence urea) for a given age. Regression can be used to find the equation of this line. This line is usually referred to as the regression line.

Note that in a scatter diagram the response variable is always plotted on the vertical (y) axis.

Equation of a straight line

The equation of a straight line is given by y = a + bx, where the coefficients a and b are the intercept of the line on the y axis and the gradient, respectively. The equation of the regression line for the A&E data (Fig. ​ (Fig.7) 7 ) is as follows: ln urea = 0.72 + (0.017 × age) (calculated using the method of least squares, which is described below). The gradient of this line is 0.017, which indicates that for an increase of 1 year in age the expected increase in ln urea is 0.017 units (and hence the expected increase in urea is 1.02 mmol/l). The predicted ln urea of a patient aged 60 years, for example, is 0.72 + (0.017 × 60) = 1.74 units. This transforms to a urea level of e 1.74 = 5.70 mmol/l. The y intercept is 0.72, meaning that if the line were projected back to age = 0, then the ln urea value would be 0.72. However, this is not a meaningful value because age = 0 is a long way outside the range of the data and therefore there is no reason to believe that the straight line would still be appropriate.

An external file that holds a picture, illustration, etc.
Object name is cc2401-7.jpg

Regression line for ln urea and age: ln urea = 0.72 + (0.017 × age).

Method of least squares

The regression line is obtained using the method of least squares. Any line y = a + bx that we draw through the points gives a predicted or fitted value of y for each value of x in the data set. For a particular value of x the vertical difference between the observed and fitted value of y is known as the deviation, or residual (Fig. ​ (Fig.8). 8 ). The method of least squares finds the values of a and b that minimise the sum of the squares of all the deviations. This gives the following formulae for calculating a and b:

An external file that holds a picture, illustration, etc.
Object name is cc2401-8.jpg

Regression line obtained by minimizing the sums of squares of all of the deviations.

Usually, these values would be calculated using a statistical package or the statistical functions on a calculator.

Hypothesis tests and confidence intervals

We can test the null hypotheses that the population intercept and gradient are each equal to 0 using test statistics given by the estimate of the coefficient divided by its standard error.

The test statistics are compared with the t distribution on n - 2 (sample size - number of regression coefficients) degrees of freedom [ 4 ].

The 95% confidence interval for each of the population coefficients are calculated as follows: coefficient ± (t n-2 × the standard error), where t n-2 is the 5% point for a t distribution with n - 2 degrees of freedom.

For the A&E data, the output (Table ​ (Table3) 3 ) was obtained from a statistical package. The P value for the coefficient of ln urea (0.004) gives strong evidence against the null hypothesis, indicating that the population coefficient is not 0 and that there is a linear relationship between ln urea and age. The coefficient of ln urea is the gradient of the regression line and its hypothesis test is equivalent to the test of the population correlation coefficient discussed above. The P value for the constant of 0.054 provides insufficient evidence to indicate that the population coefficient is different from 0. Although the intercept is not significant, it is still appropriate to keep it in the equation. There are some situations in which a straight line passing through the origin is known to be appropriate for the data, and in this case a special regression analysis can be carried out that omits the constant [ 6 ].

Regression parameter estimates, P values and confidence intervals for the accident and emergency unit data

Analysis of variance

As stated above, the method of least squares minimizes the sum of squares of the deviations of the points about the regression line. Consider the small data set illustrated in Fig. ​ Fig.9. 9 . This figure shows that, for a particular value of x, the distance of y from the mean of y (the total deviation) is the sum of the distance of the fitted y value from the mean (the deviation explained by the regression) and the distance from y to the line (the deviation not explained by the regression).

An external file that holds a picture, illustration, etc.
Object name is cc2401-9.jpg

Total, explained and unexplained deviations for a point.

The regression line for these data is given by y = 6 + 2x. The observed, fitted values and deviations are given in Table ​ Table4. 4 . The sum of squared deviations can be compared with the total variation in y, which is measured by the sum of squares of the deviations of y from the mean of y. Table ​ Table4 4 illustrates the relationship between the sums of squares. Total sum of squares = sum of squares explained by the regression line + sum of squares not explained by the regression line. The explained sum of squares is referred to as the 'regression sum of squares' and the unexplained sum of squares is referred to as the 'residual sum of squares'.

Small data set with the fitted values from the regression, the deviations and their sums of squares

This partitioning of the total sum of squares can be presented in an analysis of variance table (Table ​ (Table5). 5 ). The total degrees of freedom = n - 1, the regression degrees of freedom = 1, and the residual degrees of freedom = n - 2 (total - regression degrees of freedom). The mean squares are the sums of squares divided by their degrees of freedom.

Analysis of variance for a small data set

If there were no linear relationship between the variables then the regression mean squares would be approximately the same as the residual mean squares. We can test the null hypothesis that there is no linear relationship using an F test. The test statistic is calculated as the regression mean square divided by the residual mean square, and a P value may be obtained by comparison of the test statistic with the F distribution with 1 and n - 2 degrees of freedom [ 2 ]. Usually, this analysis is carried out using a statistical package that will produce an exact P value. In fact, the F test from the analysis of variance is equivalent to the t test of the gradient for regression with only one predictor. This is not the case with more than one predictor, but this will be the subject of a future review. As discussed above, the test for gradient is also equivalent to that for the correlation, giving three tests with identical P values. Therefore, when there is only one predictor variable it does not matter which of these tests is used.

The analysis of variance for the A&E data (Table ​ (Table6) 6 ) gives a P value of 0.006 (the same P value as obtained previously), again indicating a linear relationship between ln urea and age.

Analysis of variance for the accident and emergency unit data

Coefficent of determination

Another useful quantity that can be obtained from the analysis of variance is the coefficient of determination (R 2 ).

It is the proportion of the total variation in y accounted for by the regression model. Values of R 2 close to 1 imply that most of the variability in y is explained by the regression model. R 2 is the same as r 2 in regression when there is only one predictor variable.

For the A&E data, R 2 = 1.462/3.804 = 0.38 (i.e. the same as 0.62 2 ), and therefore age accounts for 38% of the total variation in ln urea. This means that 62% of the variation in ln urea is not accounted for by age differences. This may be due to inherent variability in ln urea or to other unknown factors that affect the level of ln urea.

The fitted value of y for a given value of x is an estimate of the population mean of y for that particular value of x. As such it can be used to provide a confidence interval for the population mean [ 3 ]. The fitted values change as x changes, and therefore the confidence intervals will also change.

The 95% confidence interval for the fitted value of y for a particular value of x, say x p , is again calculated as fitted y ± (t n-2 × the standard error). The standard error is given by:

Fig. ​ Fig.10 10 shows the range of confidence intervals for the A&E data. For example, the 95% confidence interval for the population mean ln urea for a patient aged 60 years is 1.56 to 1.92 units. This transforms to urea values of 4.76 to 6.82 mmol/l.

An external file that holds a picture, illustration, etc.
Object name is cc2401-10.jpg

Regression line, its 95% confidence interval and the 95% prediction interval for individual patients.

The fitted value for y also provides a predicted value for an individual, and a prediction interval or reference range [ 3 ] can be obtained (Fig. ​ (Fig.10). 10 ). The prediction interval is calculated in the same way as the confidence interval but the standard error is given by:

For example, the 95% prediction interval for the ln urea for a patient aged 60 years is 0.97 to 2.52 units. This transforms to urea values of 2.64 to 12.43 mmol/l.

Both confidence intervals and prediction intervals become wider for values of the predictor variable further from the mean.

Assumptions and limitations

The use of correlation and regression depends on some underlying assumptions. The observations are assumed to be independent. For correlation both variables should be random variables, but for regression only the response variable y must be random. In carrying out hypothesis tests or calculating confidence intervals for the regression parameters, the response variable should have a Normal distribution and the variability of y should be the same for each value of the predictor variable. The same assumptions are needed in testing the null hypothesis that the correlation is 0, but in order to interpret confidence intervals for the correlation coefficient both variables must be Normally distributed. Both correlation and regression assume that the relationship between the two variables is linear.

A scatter diagram of the data provides an initial check of the assumptions for regression. The assumptions can be assessed in more detail by looking at plots of the residuals [ 4 , 7 ]. Commonly, the residuals are plotted against the fitted values. If the relationship is linear and the variability constant, then the residuals should be evenly scattered around 0 along the range of fitted values (Fig. ​ (Fig.11 11 ).

An external file that holds a picture, illustration, etc.
Object name is cc2401-11.jpg

(a) Scatter diagram of y against x suggests that the relationship is nonlinear. (b) Plot of residuals against fitted values in panel a; the curvature of the relationship is shown more clearly. (c) Scatter diagram of y against x suggests that the variability in y increases with x. (d) Plot of residuals against fitted values for panel c; the increasing variability in y with x is shown more clearly.

In addition, a Normal plot of residuals can be produced. This is a plot of the residuals against the values they would be expected to take if they came from a standard Normal distribution (Normal scores). If the residuals are Normally distributed, then this plot will show a straight line. (A standard Normal distribution is a Normal distribution with mean = 0 and standard deviation = 1.) Normal plots are usually available in statistical packages.

Figs ​ Figs12 12 and ​ and13 13 show the residual plots for the A&E data. The plot of fitted values against residuals suggests that the assumptions of linearity and constant variance are satisfied. The Normal plot suggests that the distribution of the residuals is Normal.

An external file that holds a picture, illustration, etc.
Object name is cc2401-12.jpg

Plot of residuals against fitted values for the accident and emergency unit data.

An external file that holds a picture, illustration, etc.
Object name is cc2401-13.jpg

Normal plot of residuals for the accident and emergency unit data.

When using a regression equation for prediction, errors in prediction may not be just random but also be due to inadequacies in the model. In particular, extrapolating beyond the range of the data is very risky.

A phenomenon to be aware of that may arise with repeated measurements on individuals is regression to the mean. For example, if repeat measures of blood pressure are taken, then patients with higher than average values on their first reading will tend to have lower readings on their second measurement. Therefore, the difference between their second and first measurements will tend to be negative. The converse is true for patients with lower than average readings on their first measurement, resulting in an apparent rise in blood pressure. This could lead to misleading interpretations, for example that there may be an apparent negative correlation between change in blood pressure and initial blood pressure.

Both correlation and simple linear regression can be used to examine the presence of a linear relationship between two variables providing certain assumptions about the data are satisfied. The results of the analysis, however, need to be interpreted with care, particularly when looking for a causal relationship or when using the regression equation for prediction. Multiple and logistic regression will be the subject of future reviews.

Competing interests

None declared.

Abbreviations

A&E = accident and emergency unit; ln = natural logarithm (logarithm base e).

  • Whitley E, Ball J. Statistics review 1: Presenting and summarising data. Crit Care. 2002; 6 :66–71. doi: 10.1186/cc1455. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kirkwood BR, Sterne JAC. Essential Medical Statistics. 2. Oxford: Blackwell Science; 2003. [ Google Scholar ]
  • Whitley E, Ball J. Statistics review 2: Samples and populations. Crit Care. 2002; 6 :143–148. doi: 10.1186/cc1473. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bland M. An Introduction to Medical Statistics. 3. Oxford: Oxford University Press; 2001. [ Google Scholar ]
  • Bland M, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986; i :307–310. [ PubMed ] [ Google Scholar ]
  • Zar JH. Biostatistical Analysis. 4. New Jersey, USA: Prentice Hall; 1999. [ Google Scholar ]
  • Altman DG. Practical Statistics for Medical Research. London: Chapman & Hall; 1991. [ Google Scholar ]

Logo for University of Washington Libraries

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

12 Simple Linear Regression and Correlation

Barbara Illowsky; Susan Dean; and Margo Bergman

Linear Regression and Correlation

Student Learning Outcomes

By the end of this chapter, the student should be able to:

  • Discuss basic ideas of linear regression and correlation.
  • Create and interpret a line of best fit.
  • Calculate and interpret the correlation coefficient.
  • Calculate and interpret outliers.

Introduction

Professionals often want to know how two or more numeric variables are related. For example, is there a relationship between the grade on the second math exam a student takes and the grade on the final exam?   If there is a relationship, what is it and how strong is the relationship?

In another example, your income may be determined by your education, your profession, your years of experience, and your ability. The amount you pay a repair person for labor is often determined by an initial amount plus an hourly fee. These are all examples in which regression can be used.

The type of data described in the examples is bivariate data – “bi” for two variables. In reality, statisticians use multivariate data, meaning many variables.

In this chapter, you will be studying the simplest form of regression, “linear regression” with one independent variable ( x ). This involves data that fits a line in two dimensions. You will also study correlation which measures how strong the relationship is.

Linear Equations

Linear regression for two variables is based on a linear equation with one independent variable. It has the form:

y=b+mx

where m and b are constant numbers.

x is the independent variable, and y is the dependent variable. Typically, you choose a value to substitute for the independent variable and then solve for the dependent variable.

The following examples are linear equations.

y=3+2x

Linear equations of this form occur in applications of life sciences, social sciences, psychology, business, economics, physical sciences, mathematics, and other areas.

Aaron’s Word Processing Service (AWPS) does word processing. Its rate is $32 per hour plus a $31.50 one-time charge.  The total cost to a customer depends on the number of hours it takes to  do the word processing job.

Find the equation that expresses the total cost in terms of the number of hours required to finish the word processing job.

Let x = the number of hours it takes to get the job done.

Let y = the total cost to the customer.

The $31.50 is a fixed cost. If it takes x hours to complete the job, then  (32) ( x ) is the cost of the  word processing only. The total cost is:

y = 31.50 + 32x

Slope and Y-Intercept of a Linear Equation

From algebra recall that the slope is a number that describes the steepness of a line and the y-intercept is the y coordinate of the point (0, b ) where the line crosses the y-axis.

hypothesis regression correlation

m = 0, the line is horizontal. (c) If m < 0, the line slopes downward to the right.

y=25 + 15x

What are the independent and dependent variables? What is the y-intercept and  what  is  the slope? Interpret them using complete sentences.

The independent variable (x) is the number of hours Svetlana tutors each session. The dependent variable (y) is the amount, in dollars, Svetlana earns for each session.

The y-intercept is 25 (b = 25).  At the start of the tutoring session,  Svetlana charges a one-time fee  of $25 (this is when x = 0). The slope is 15 (m = 15). For each session, Svetlana earns $15 for each hour she tutors.

Scatter Plots

Before we take up the discussion of linear regression and correlation, we need to examine a way to display the relation between two variables x and y . The most common and easiest way is a scatter plot .  The following example illustrates a scatter plot.

From an article in the Wall Street Journal : In Europe and Asia, m-commerce is popular. M- commerce users have special mobile phones that work like electronic wallets as well as provide phone and Internet services.  Users can do everything from paying for parking to buying a TV set or soda from a machine to banking to checking sports scores on the Internet. For the years 2000 through 2004, was there a relationship between the year and the number of m-commerce users? Construct a scatter plot. Let x = the year and let y = the number of m-commerce users, in millions (Table 1).

hypothesis regression correlation

  Table 1 shows the number of m-commerce users (in millions) by year. Figure 3 is a scatter plot showing the number of m-commerce users (in millions) by year.

A scatter plot shows the direction and strength of a relationship between the variables. A clear direction happens when there is either:

  • High values of one variable occurring with high values of the other variable or low values of one variable occurring with low values of the other variable.
  • High values of one variable occurring with low values of the other variable.

You can determine the strength of the relationship by looking at the scatter plot and seeing how close the points are to a line, a power function, an exponential function, or to some other type of function.

When you look at a scatterplot, you want to notice the overall pattern and any deviations from the pattern. The following scatterplot examples illustrate these concepts:

hypothesis regression correlation

In this chapter, we are interested in scatter plots that show a linear pattern. Linear patterns are quite common. The linear relationship is strong if the points are close to a straight line. If we think that the points show a linear relationship, we would like to draw a line on the scatter plot. This line can be calculated through a process called linear regression . However, we only calculate a regression line if one of the variables helps to explain or predict the other variable. If x is the independent variable and y the dependent variable, then we can use a regression line to predict y for a given value of x .

The Regression Equation

Data rarely fit a straight line exactly. Usually, you must be satisfied with rough predictions. Typically, you have a set of data whose scatter plot appears to “fit” a straight line. This is called a Line of Best Fit or Least Squares Line .

A random sample of 11 statistics students produced the following data (Table 2) where x is the third exam score, out of 80, and y is the final exam score, out of 200. Can you predict the final exam score of a random student if you know the third exam score?

hypothesis regression correlation

Table 2 shows the scores on the final exam based on scores from the third exam. Figure 7 shows the scatter plot of the scores on the final exam based on scores from the third exam.

\left(x,\^y\right)

If the observed data point lies above the line, the residual is  positive,  and  the  line  underestimates  the actual data value for y . If the observed data point lies below the line, the residual is negative, and the line overestimates that actual data value for y .

\epsilon

This is called the Sum of Squared Errors (SSE) .

Using calculus,  you can determine the values of  b and m that make the SSE a minimum.  When you make the SSE a minimum, you have determined the points that are on the line of best fit. It turns out that the line  of best fit has the equation:

b = \bar{y} - m\cdot\bar{x}

Least Squares Criteria for Best Fit

The process of fitting the best fit line is called linear regression . The idea behind finding the best fit line is based on the assumption that the data are scattered about a straight line. The criteria for the best fit line is that the sum of the squared errors (SSE) is minimized, that is made as small as possible. Any other line you might choose would have a higher SSE than the best fit line. This best fit line is called the least squares regression line .

THIRD EXAM vs FINAL EXAM EXAMPLE:

The graph of the line of best fit for the third exam/final exam example is shown below:

hypothesis regression correlation

The least squares regression line (best fit line) for the third exam/final exam example has the equation:

hypothesis regression correlation

Remember, it is always important to plot a scatter diagram first. If the scatter plot indicates that there is a linear relationship between the variables,  then it is reasonable to use a best fit line  to make predictions for y given x within the domain of x -values in the sample data, but not necessarily for x -values outside that domain.

You  could use the line to predict the final exam score for a student who earned a grade of 73 on the third exam.

You should NOT use the line to predict the final exam score for a student who earned a grade of 50 on the third exam, because 50 is not within the domain of the x-values in the sample data, which are between 65 and 75.

UNDERSTANDING SLOPE

The slope of the line,  m,  describes how changes in the variables are related.   It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.

INTERPRETATION OF THE SLOPE: The slope of the best fit line tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average.

THIRD EXAM vs FINAL EXAM EXAMPLE

Slope: The slope of the line is m = 4.83.

Interpretation: For a one point increase in the score on the third exam, the final exam score increases by 4.83 points, on average.

Correlation Coefficient and Coefficient of Determination

The correlation coefficient r.

Besides looking at the scatter plot and seeing that a line seems reasonable, how can you tell if the line is a good predictor? Use the correlation coefficient as another indicator (besides the scatterplot) of the strength  of the relationship between x and y .

The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is a numerical measure of the strength of association between the independent variable x and the dependent variable y.

The correlation coefficient is calculated as:

hypothesis regression correlation

where n = the number of data points.

If you suspect a linear relationship between x and  y , then r can measure how strong the linear relationship  is.

What the VALUE of r tells us :

-1\leqr\leq1

  • The size of the correlation r indicates the strength of the linear relationship between  x and  y .  Values   of r close to -1 or to +1 indicate a stronger linear relationship between x and y .
  • If r = 0 there is absolutely no linear relationship between x and y (no linear correlation) .
  • If r  = 1,  there is perfect positive correlation.  If   r = 1, there is perfect negative correlation. In both these cases, all of the original data points lie on a straight line. Of course, in the real world, this will not generally happen.

What the SIGN of r tells us:

  • A positive value of r means that when x increases, y tends to increase and when  x decreases,  y tends  to decrease (positive correlation) .
  • A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends  to increase (negative correlation) .
  • The sign of r is the same as the sign of the slope, m , of the best fit line.

We can see this in Figure 10.

hypothesis regression correlation

NOTE: Strong correlation does not suggest that x causes y or y causes x . We say “correlation does not imply causation.” For example, every person who learned math in the 17th century is dead. However, learning math does not necessarily cause death!

The Coefficient of Determination

r^2

Consider the third exam/final exam example introduced in the previous section:

The line of best fit is: y = 173.51 + 4.83x The correlation coefficient is r = 0.6631

Interpretation of r2 in the context of this example:

Approximately 44% of the variation (0.4397 is approximately 0.44) in the final exam grades can be explained by the variation in the grades on the third exam, using the best fit regression line.

Therefore approximately 56% of the variation (1 – 0.44 = 0.56) in the final exam grades can NOT be explained by the variation in the grades on the third exam, using the best fit regression line. (This is seen as the scattering of the points about the line.)

We examined the scatterplot and showed that the correlation coefficient is significant. We found the equation of the best fit line for the final exam grade as a function of the grade on the third exam.  We  can now  use the least squares regression line for prediction.

Suppose you want to estimate, or predict, the final exam score of statistics students who received 73 on the third exam. The exam scores ( x -values) range from 65 to 75. Since 73 is between the x -values 65 and 75 , substitute x = 73 into the equation. Then:

We  predict that statistic students who earn a grade of 73 on the third exam will earn a grade of 179.08 on  the final exam, on average.

What would you predict the final exam score to be for a student who scored a 66 on the third exam?

In some data sets, there are values (observed data points) called outliers . Outliers are observed data points that are far from the least squares line. They have large “errors”, where the “error” or residual is the vertical distance from the line to the point.

Outliers need to be examined closely. Sometimes, for some reason or another, they should not be included in the analysis of the data. It is possible that an outlier is a result of erroneous data. Other times, an outlier may hold valuable information about the population under study and should remain included in the data. The key is to carefully examine what causes a data point to be an outlier.

Besides outliers, a sample may contain one or a few points that are called influential points . Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points may have a big effect on the slope of the regression line. To begin to identify an influential point, you can remove it from the data set and see if the slope of the regression line is changed significantly.

Computers and many calculators can be used to identify outliers from the data. Computer output for regression analysis will often identify both outliers and influential points so that you can examine them.

Identifying Outliers

We could guess at outliers by looking at a graph of the scatterplot and best fit line. However we would like some guideline as to how far away a point needs to be in order to be considered an outlier. As a rough rule of thumb, we can flag any point that is located further than two standard deviations above or below the best fit line as an outlier . The standard deviation used is the standard deviation of the residuals or errors.

We can do this visually in the scatterplot by drawing an extra pair of lines that are two standard deviations above and below the best fit line. Any data points that are outside this extra pair of lines are flagged as potential outliers. Or we can do this numerically by calculating each residual and comparing it to twice the standard deviation.  The graphical procedure is shown first, followed by the numerical calculations. You would generally only need to use one of these methods.

You can determine if there is an outlier or not.  If there is an outlier, as an exercise, delete it and fit the remaining data to a new line. For this example, the  new line ought to fit the remaining data better. This means the SSE should be smaller and the correlation coefficient ought to be closer to 1 or -1.

Graphical Identification of Outliers

With the TI-83,83+,84+ graphing calculators, it is easy to identify the outlier graphically and visually. If we were to measure the vertical distance from any data point to the corresponding point on the line of best fit and that distance was equal to 2 s or farther, then we would consider the data point to be “too far” from the line of best fit. We need to find and graph the lines that are two standard deviations below and above the regression line. Any points that are outside these two lines are outliers. We will call these lines Y2 and Y3:

As we did with the equation of the regression line and the correlation coefficient, we will use technology to calculate this standard deviation for us. Using the LinRegTTest with this data, scroll down through the output screens to find s=16.412

Line Y2=-173.5+4.83 x -2(16.4) and line Y3=-173.5+4.83 x +2(16.4)

Graph the scatterplot with the best fit line in equation Y1, then enter the two extra lines as Y2 and Y3 in the “Y=”equation editor and press ZOOM 9. You will find that the only data point that is not between lines Y2 and Y3 is the point x=65, y=175. On the calculator screen it is just barely outside these lines. The outlier is the student who had a grade of 65 on the third exam and 175 on the final exam; this point is further than 2 standard deviations away from the best fit line.

Sometimes a point is so close to the lines used to flag outliers on the graph that it is difficult to tell    if the point is between or outside the lines. On a computer,  enlarging the graph may help;  on a  small calculator screen,  zooming in may make the graph clearer.  Note that when the graph does  not give a clear enough picture, you can use the numerical comparisons to identify outliers. This method is shown in Figure 11.

hypothesis regression correlation

Numerical Identification of Outliers

\^y =-173.5+4.83x

Rather than calculate the value of s ourselves, we can find s using the computer or calculator. For this example, the calculator function LinRegTTest found s = 16.4 as the standard deviation of the residuals 35; -17; 16; -6; -19; 9; 3; -1; -10; -9; -1.

We are looking for all data points for which the residual is greater than 2 s =2(16.4)=32.8 or less than

-32.8. Compare these values to the residuals in column 4 of the table.  The only such data point is  the student who had a grade of 65 on the third exam and 175 on the final exam;  the residual for   this student is 35.

How does the outlier affect the best fit line?

Numerically and graphically, we have identified the point (65,175) as an outlier. We should re- examine the data for this point to see if there are any problems with the data.  If there is an error   we should fix the error if possible,  or delete the data.  If the data is correct,  we would leave it in    the data set. For this problem, we will suppose that we examined the data and found that this outlier data was an error. Therefore we will continue on and delete the outlier, so that we can explore how it affects the results, as a learning experience.

Basic Regression Problem

A basic relationship from Macroeconomic Principles is the consumption function. This theoretical relationship states that as a person’s income rises, their consumption rises, but by a smaller amount than the rise in income. If Y is consumption and X is income the regression problem is, first, to establish that this relationship exists. and second, to determine the impact of a change in income on a person’s consumption.

hypothesis regression correlation

Each “dot” in Figure 12 represents the consumption and income of different individuals at some point in time. This  was called cross-section data earlier; observations on variables at one point in time across different people or other units  of measurement. This analysis is often done with time series data, which would be the consumption and income of one individual or country at different points in time. For macroeconomic problems it is common to use times series aggregated data for a whole country. For this particular theoretical concept these data are readily available in the annual report of the President’s Council of Economic Advisors.

The regression problem comes down to determining which straight line would best represent the data in Figure 12 .  Regression analysis is sometimes called “least squares” analysis because the method of determining which line best “fits” the data is to minimize the sum of the squared residuals of a line put through the data.

y=\beta_0+\beta_1X+\epsilon

Figure 13 shows us this “best fit” line.

hypothesis regression correlation

This figure shows the assumed relationship between consumption and income from macroeconomic theory. Here the data are plotted as a scatter plot and an estimated straight line has been drawn. From this graph we can see an error term, e1. Each data point also has an error term. Again, the error term is put into the equation to capture effects on consumption that are not caused by income changes. Such other effects might be a person’s savings or wealth, or periods of unemployment. We will see how by minimizing the sum of these errors we can get an estimate for the slope and intercept of this line.

Figure 14 shows the more general case of the notation rather than the specific case of the Macroeconomic consumption function in our example.

hypothesis regression correlation

The sum of the errors squared is the term obviously called Sum of Squared Errors (SSE) .

The equations that support the linear regression line are called the Normal Equations and come from another very important mathematical finding called the Gauss-Markov Theorem without which we could not do regression analysis. The Gauss-Markov Theorem tells us that the estimates we get from using the ordinary least squares (OLS) regression method will result in estimates that have some very important properties. In the Gauss-Markov Theorem it was proved that a least squares line is BLUE, which is, B est, L inear, U nbiased, E stimator. Best is the statistical property that an estimator is the one with the minimum variance. Linear refers to the property of the type of line being estimated. An unbiased estimator is one whose estimating function has an expected mean equal to the mean of the population.

Both Gauss and Markov were giants in the field of mathematics, and Gauss in physics too, in the 18th century and early 19th century. They barely overlapped chronologically and never in geography, but Markov’s work on this theorem was based extensively on the earlier work of Carl Gauss. The extensive applied value of this theorem had to wait until the middle of this last century.

e^2

The variance of the errors is fundamental in testing hypotheses for a regression. It tells us just how “tight” the dispersion is about the line. As we will see shortly, the greater the dispersion about the line, meaning the larger the variance of the errors, the less probable that the hypothesized independent variable will be found to have a significant effect on the dependent variable. In short, the theory being tested will more likely fail if the variance of the error term is high. Upon reflection this should not be a surprise. As we tested hypotheses about a mean we observed that large variances reduced the calculated test statistic and thus it failed to reach the tail of the distribution. In those cases, the null hypotheses could not be rejected. If we cannot reject the null hypothesis in a regression problem, we must conclude that the hypothesized independent variable has no effect on the dependent variable.

A way to visualize this concept is to draw two scatter plots of x and y data along a predetermined line. The first will have little variance of the errors, meaning that all the data points will move close to the line. Now do the same except the data points will have a large estimate of the error variance, meaning that the data points are scattered widely along the line. Clearly the confidence about a relationship between x and y is effected by this difference between the estimate of the error variance.

Testing the Parameters of the Line

\beta_1

Notice that we have set up the presumption, the null hypothesis, as “no relationship”. This puts the burden of proof on the alternative hypothesis. In other words, if we are to validate our claim of finding a relationship, we must do so with a level of significance greater than 90, 95, or 99 percent. The status quo is ignorance, no relationship exists, and to be able to make the claim that we have actually added to our body of knowledge we must do so with significant probability of being correct. John Maynard Keynes got it right and thus was born Keynesian economics starting with this basic concept in 1936.

The test statistic for this test comes directly from our old friend the standardizing formula:

t_c=\frac{b_1-\beta_1}{S_{b_1}}

Media Attributions

  • LinearRegExam2
  • SlopeGraphs
  • ScatterPlotExam5
  • ScatterPlotPattern1
  • ScatterPlotPattern2
  • ScatterPlotPattern3
  • ScatterPlotExam6
  • FinalExamExam
  • Correlation
  • GraphicalOutliers
  • SimpRegGraph
  • GenregGraph

Quantitative Analysis for Business Copyright © by Barbara Illowsky; Susan Dean; and Margo Bergman is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Statology

Statistics Made Easy

Correlation vs. Regression: What’s the Difference?

Correlation and regression are two terms in statistics that are related, but not quite the same.

In this tutorial, we’ll provide a brief explanation of both terms and explain how they’re similar and different.

What is Correlation?

Correlation measures the linear association between two variables, x and y . It has a value between -1 and 1 where:

  • -1 indicates a perfectly negative linear correlation between two variables
  • 0 indicates no linear correlation between two variables
  • 1 indicates a perfectly positive linear correlation between two variables

For example, suppose we have the following dataset that contains two variables: (1) Hours studied and (2) Exam Score received for 20 different students:

hypothesis regression correlation

If we created a scatterplot of hours studied vs. exam score, here’s what it would look like:

hypothesis regression correlation

Just from looking at the plot, we can tell that students who study more tend to earn higher exam scores. In other words, we can visually see that there is a  positive correlation between the two variables.

Using a calculator, we can find that the correlation between these two variables is r = 0.915 . Since this value is close to 1, it confirms that there is a strong positive correlation between the two variables.

What is Regression?

Regression is a method we can use to understand how changing the values of the x variable affect the values of the y variable.

A regression model uses one variable,  x , as the predictor variable, and the other variable,  y , as the response variable . It then finds an equation with the following form that best describes the relationship between the two variables:

ŷ = b 0  + b 1 x

  • ŷ: The predicted value of the response variable
  • b 0 : The y-intercept (the value of y when x is equal to zero)
  • b 1 : The regression coefficient (the average increase in y for a one unit increase in x)
  • x: The value of the predictor variable

For example, consider our dataset from earlier:

Using a linear regression calculator , we find that the following equation best describes the relationship between these two variables:

Predicted exam score = 65.47 + 2.58*(hours studied)

The way to interpret this equation is as follows:

  • The predicted exam score for a student who studies zero hours is  65.47 .
  • The average increase in exam score associated with one additional hour studied is  2.58 .

We can also use this equation to predict the score that a student will receive based on the number of hours studied.

For example, a student who studies 6 hours is expected to receive a score of 80.95 :

Predicted exam score = 65.47 + 2.58*(6) =  80.95 .

We can also plot this equation as a line on a scatterplot:

Correlation vs. Regression line on scatterplot

We can see that the regression line “fits” the data quite well.

Recall earlier that the correlation between these two variables was r = 0.915 . It turns out that we can square this value and get a number called “r-squared” that describes the total proportion of variance in the response variable that can be explained by the predictor variable.

In this example, r 2 = 0.915 2 = 0.837 . This means that 83.7% of the variation in exam scores can be explained by the number of hours studied.

Correlation vs. Regression: Similarities & Differences

Here is a summary of the similarities and differences between correlation and regression:

Similarities:

  • Both quantify the direction of a relationship between two variables.
  • Both quantify the strength of a relationship between two variables.

Differences:

  • Regression is able to show a cause-and-effect relationship between two variables. Correlation does not do this.
  • Regression is able to use an equation to predict the value of one variable, based on the value of another variable. Correlation does not does this.
  • Regression uses an equation to quantify the relationship between two variables. Correlation uses a single number.

Additional Resources

The following tutorials offer more in-depth explanations of topics covered in this post.

An Introduction to the Pearson Correlation Coefficient An Introduction to Simple Linear Regression Simple Linear Regression Calculator What is a Good R-squared Value?

Featured Posts

5 Tips for Choosing the Right Statistical Test

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

One Reply to “Correlation vs. Regression: What’s the Difference?”

Thank you very much for this very nice explanation, helped a lot!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

10.1: Testing the Significance of the Correlation Coefficient

  • Last updated
  • Save as PDF
  • Page ID 10998

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

The correlation coefficient, \(r\), tells us about the strength and direction of the linear relationship between \(x\) and \(y\). However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient \(r\) and the sample size \(n\), together. We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute \(r\), the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, \(r\), is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is \(\rho\), the Greek letter "rho."
  • \(\rho =\) population correlation coefficient (unknown)
  • \(r =\) sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient \(\rho\) is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient \(r\) and the sample size \(n\).

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is "significant."

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between \(x\) and \(y\). We can use the regression line to model the linear relationship between \(x\) and \(y\) in the population.

If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is "not significant".

  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is not significantly different from zero."
  • What the conclusion means: There is not a significant linear relationship between \(x\) and \(y\). Therefore, we CANNOT use the regression line to model a linear relationship between \(x\) and \(y\) in the population.
  • If \(r\) is significant and the scatter plot shows a linear trend, the line can be used to predict the value of \(y\) for values of \(x\) that are within the domain of observed \(x\) values.
  • If \(r\) is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If \(r\) is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed \(x\) values in the data.

PERFORMING THE HYPOTHESIS TEST

  • Null Hypothesis: \(H_{0}: \rho = 0\)
  • Alternate Hypothesis: \(H_{a}: \rho \neq 0\)

WHAT THE HYPOTHESES MEAN IN WORDS:

  • Null Hypothesis \(H_{0}\) : The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship(correlation) between \(x\) and \(y\) in the population.
  • Alternate Hypothesis \(H_{a}\) : The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between \(x\) and \(y\) in the population.

DRAWING A CONCLUSION:There are two methods of making the decision. The two methods are equivalent and give the same result.

  • Method 1: Using the \(p\text{-value}\)
  • Method 2: Using a table of critical values

In this chapter of this textbook, we will always use a significance level of 5%, \(\alpha = 0.05\)

Using the \(p\text{-value}\) method, you could choose any appropriate significance level you want; you are not limited to using \(\alpha = 0.05\). But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, \(\alpha = 0.05\). (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook.)

METHOD 1: Using a \(p\text{-value}\) to make a decision

Using the ti83, 83+, 84, 84+ calculator.

To calculate the \(p\text{-value}\) using LinRegTTEST:

On the LinRegTTEST input screen, on the line prompt for \(\beta\) or \(\rho\), highlight "\(\neq 0\)"

The output screen shows the \(p\text{-value}\) on the line that reads "\(p =\)".

(Most computer statistical software can calculate the \(p\text{-value}\).)

If the \(p\text{-value}\) is less than the significance level ( \(\alpha = 0.05\) ):

  • Decision: Reject the null hypothesis.
  • Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is significantly different from zero."

If the \(p\text{-value}\) is NOT less than the significance level ( \(\alpha = 0.05\) )

  • Decision: DO NOT REJECT the null hypothesis.
  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is NOT significantly different from zero."

Calculation Notes:

  • You will use technology to calculate the \(p\text{-value}\). The following describes the calculations to compute the test statistics and the \(p\text{-value}\):
  • The \(p\text{-value}\) is calculated using a \(t\)-distribution with \(n - 2\) degrees of freedom.
  • The formula for the test statistic is \(t = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}\). The value of the test statistic, \(t\), is shown in the computer or calculator output along with the \(p\text{-value}\). The test statistic \(t\) has the same sign as the correlation coefficient \(r\).
  • The \(p\text{-value}\) is the combined area in both tails.

An alternative way to calculate the \(p\text{-value}\) ( \(p\) ) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: \(p\text{-value}\) method

  • Consider the third exam/final exam example.
  • The line of best fit is: \(\hat{y} = -173.51 + 4.83x\) with \(r = 0.6631\) and there are \(n = 11\) data points.
  • Can the regression line be used for prediction? Given a third exam score ( \(x\) value), can we use the line to predict the final exam score (predicted \(y\) value)?
  • \(H_{0}: \rho = 0\)
  • \(H_{a}: \rho \neq 0\)
  • \(\alpha = 0.05\)
  • The \(p\text{-value}\) is 0.026 (from LinRegTTest on your calculator or from computer software).
  • The \(p\text{-value}\), 0.026, is less than the significance level of \(\alpha = 0.05\).
  • Decision: Reject the Null Hypothesis \(H_{0}\)
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score (\(x\)) and the final exam score (\(y\)) because the correlation coefficient is significantly different from zero.

Because \(r\) is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.

METHOD 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a good idea of whether the computed value of \(r\) is significant or not . Compare \(r\) to the appropriate critical value in the table. If \(r\) is not between the positive and negative critical values, then the correlation coefficient is significant. If \(r\) is significant, then you may want to use the line for prediction.

Example \(\PageIndex{1}\)

Suppose you computed \(r = 0.801\) using \(n = 10\) data points. \(df = n - 2 = 10 - 2 = 8\). The critical values associated with \(df = 8\) are \(-0.632\) and \(+0.632\). If \(r <\) negative critical value or \(r >\) positive critical value, then \(r\) is significant. Since \(r = 0.801\) and \(0.801 > 0.632\), \(r\) is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Horizontal number line with values of -1, -0.632, 0, 0.632, 0.801, and 1. A dashed line above values -0.632, 0, and 0.632 indicates not significant values.

Exercise \(\PageIndex{1}\)

For a given line of best fit, you computed that \(r = 0.6501\) using \(n = 12\) data points and the critical value is 0.576. Can the line be used for prediction? Why or why not?

If the scatter plot looks linear then, yes, the line can be used for prediction, because \(r >\) the positive critical value.

Example \(\PageIndex{2}\)

Suppose you computed \(r = –0.624\) with 14 data points. \(df = 14 – 2 = 12\). The critical values are \(-0.532\) and \(0.532\). Since \(-0.624 < -0.532\), \(r\) is significant and the line can be used for prediction

Horizontal number line with values of -0.624, -0.532, and 0.532.

Exercise \(\PageIndex{2}\)

For a given line of best fit, you compute that \(r = 0.5204\) using \(n = 9\) data points, and the critical value is \(0.666\). Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction, because \(r <\) the positive critical value.

Example \(\PageIndex{3}\)

Suppose you computed \(r = 0.776\) and \(n = 6\). \(df = 6 - 2 = 4\). The critical values are \(-0.811\) and \(0.811\). Since \(-0.811 < 0.776 < 0.811\), \(r\) is not significant, and the line should not be used for prediction.

Horizontal number line with values -0.924, -0.532, and 0.532.

Exercise \(\PageIndex{3}\)

For a given line of best fit, you compute that \(r = -0.7204\) using \(n = 8\) data points, and the critical value is \(= 0.707\). Can the line be used for prediction? Why or why not?

Yes, the line can be used for prediction, because \(r <\) the negative critical value.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: critical value method

Consider the third exam/final exam example. The line of best fit is: \(\hat{y} = -173.51 + 4.83x\) with \(r = 0.6631\) and there are \(n = 11\) data points. Can the regression line be used for prediction? Given a third-exam score ( \(x\) value), can we use the line to predict the final exam score (predicted \(y\) value)?

  • Use the "95% Critical Value" table for \(r\) with \(df = n - 2 = 11 - 2 = 9\).
  • The critical values are \(-0.602\) and \(+0.602\)
  • Since \(0.6631 > 0.602\), \(r\) is significant.
  • Conclusion:There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score (\(x\)) and the final exam score (\(y\)) because the correlation coefficient is significantly different from zero.

Example \(\PageIndex{4}\)

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if \(r\) is significant and the line of best fit associated with each r can be used to predict a \(y\) value. If it helps, draw a number line.

  • \(r = –0.567\) and the sample size, \(n\), is \(19\). The \(df = n - 2 = 17\). The critical value is \(-0.456\). \(-0.567 < -0.456\) so \(r\) is significant.
  • \(r = 0.708\) and the sample size, \(n\), is \(9\). The \(df = n - 2 = 7\). The critical value is \(0.666\). \(0.708 > 0.666\) so \(r\) is significant.
  • \(r = 0.134\) and the sample size, \(n\), is \(14\). The \(df = 14 - 2 = 12\). The critical value is \(0.532\). \(0.134\) is between \(-0.532\) and \(0.532\) so \(r\) is not significant.
  • \(r = 0\) and the sample size, \(n\), is five. No matter what the \(dfs\) are, \(r = 0\) is between the two critical values so \(r\) is not significant.

Exercise \(\PageIndex{4}\)

For a given line of best fit, you compute that \(r = 0\) using \(n = 100\) data points. Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction no matter what the sample size is.

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between \(x\) and \(y\) in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between \(x\) and \(y\) in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatter plot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

The assumptions underlying the test of significance are:

  • There is a linear relationship in the population that models the average value of \(y\) for varying values of \(x\). In other words, the expected value of \(y\) for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.)
  • The \(y\) values for any particular \(x\) value are normally distributed about the line. This implies that there are more \(y\) values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of \(y\) values lie on the line.
  • The standard deviations of the population \(y\) values about the line are equal for each value of \(x\). In other words, each of these normal distributions of \(y\) values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

The left graph shows three sets of points. Each set falls in a vertical line. The points in each set are normally distributed along the line — they are densely packed in the middle and more spread out at the top and bottom. A downward sloping regression line passes through the mean of each set. The right graph shows the same regression line plotted. A vertical normal curve is shown for each line.

Linear regression is a procedure for fitting a straight line of the form \(\hat{y} = a + bx\) to data. The conditions for regression are:

  • Linear In the population, there is a linear relationship that models the average value of \(y\) for different values of \(x\).
  • Independent The residuals are assumed to be independent.
  • Normal The \(y\) values are distributed normally for any value of \(x\).
  • Equal variance The standard deviation of the \(y\) values is equal for each \(x\) value.
  • Random The data are produced from a well-designed random sample or randomized experiment.

The slope \(b\) and intercept \(a\) of the least-squares line estimate the slope \(\beta\) and intercept \(\alpha\) of the population (true) regression line. To estimate the population standard deviation of \(y\), \(\sigma\), use the standard deviation of the residuals, \(s\). \(s = \sqrt{\frac{SEE}{n-2}}\). The variable \(\rho\) (rho) is the population correlation coefficient. To test the null hypothesis \(H_{0}: \rho =\) hypothesized value , use a linear regression t-test. The most common null hypothesis is \(H_{0}: \rho = 0\) which indicates there is no linear relationship between \(x\) and \(y\) in the population. The TI-83, 83+, 84, 84+ calculator function LinRegTTest can perform this test (STATS TESTS LinRegTTest).

Formula Review

Least Squares Line or Line of Best Fit:

\[\hat{y} = a + bx\]

\[a = y\text{-intercept}\]

\[b = \text{slope}\]

Standard deviation of the residuals:

\[s = \sqrt{\frac{SSE}{n-2}}\]

\[SSE = \text{sum of squared errors}\]

\[n = \text{the number of data points}\]

IMAGES

  1. PPT

    hypothesis regression correlation

  2. Correlation vs. Regression Made Easy: Which to Use + Why

    hypothesis regression correlation

  3. PPT

    hypothesis regression correlation

  4. Correlation vs Regression

    hypothesis regression correlation

  5. Correlation and Regression Analysis: Learn Everything With Examples

    hypothesis regression correlation

  6. PPT

    hypothesis regression correlation

VIDEO

  1. Hypothesis Testing in Simple Linear Regression

  2. Correlation and Regression in Research methodology

  3. Inferences for Regression & Correlation

  4. 5 Statistics Chapter-5(Correlation vs Regression| Hypothesis)

  5. Correlation vs Regression: What's the difference?

  6. Hypothesis Testing based on Correlation

COMMENTS

  1. 12.2.1: Hypothesis Test for Linear Regression

    The test statistic value is the same value of the t-test for correlation even though they used different formulas. We look in the same place using technology as the correlation test. The test statistic is greater than the critical value of 2.160 and in the rejection region. The decision is to reject \(H_{0}\).

  2. 12.1.2: Hypothesis Test for a Correlation

    The calculator returns the t-test statistic, p-value, and the correlation = \ (r\). Excel: Type the data into two columns in Excel. Select the Data tab, then Data Analysis, then choose Regression and select OK. Be careful here. The second column is the \ (y\) range, and the first column is the \ (x\) range.

  3. 12.5: Testing the Significance of the Correlation Coefficient

    The two methods are equivalent and give the same result. Method 1: Using the p-value p -value. Method 2: Using a table of critical values. In this chapter of this textbook, we will always use a significance level of 5%, α = 0.05 α = 0.05.

  4. 1.9

    In general, a researcher should use the hypothesis test for the population correlation \ (\rho\) to learn of a linear association between two variables, when it isn't obvious which variable should be regarded as the response. Let's clarify this point with examples of two different research questions. Consider evaluating whether or not a linear ...

  5. 12.4 Testing the Significance of the Correlation Coefficient

    The correlation coefficient, r, tells us about the strength and direction of the linear relationship between x and y.However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n, together.. We perform a hypothesis test of the "significance of the correlation ...

  6. Hypothesis Test for Correlation

    The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is "close to zero" or "significantly different from zero.". We decide this based on the sample correlation coefficient r and the sample size n. If the test concludes that the correlation coefficient is significantly different from zero, we ...

  7. 5.2

    The research question will also give us the hypothesized parameter value. This is the number that goes in the hypothesis statements (i.e., \(\mu_0\) and \(p_0\)). For the difference between two groups, regression, and correlation, this value is typically 0. Hypotheses are always written in terms of population parameters (e.g., \(p\) and \(\mu\)).

  8. Lesson 15: Tests Concerning Regression and Correlation

    For this reason, we'll learn, not one, but three (!) possible hypothesis tests for testing the null hypothesis that the correlation coefficient is 0. Then, because we haven't yet derived an interval estimate for the correlation coefficient, we'll also take the time to derive an approximate confidence interval for ρ. 15.1 - A Test for the Slope.

  9. Correlation Coefficient

    What does a correlation coefficient tell you? Correlation coefficients summarize data and help you compare results between studies.. Summarizing data. A correlation coefficient is a descriptive statistic.That means that it summarizes sample data without letting you infer anything about the population. A correlation coefficient is a bivariate statistic when it summarizes the relationship ...

  10. Interpreting Correlation Coefficients

    Hypothesis Test for Correlation Coefficients. Correlation coefficients have a hypothesis test. As with any hypothesis test, this test takes sample data and evaluates two mutually exclusive statements about the population from which the sample was drawn. ... Instead of correlation, I'd use regression with an interaction effect. I'd want to ...

  11. Pearson Correlation Coefficient (r)

    Revised on February 10, 2024. The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation. It is a number between -1 and 1 that measures the strength and direction of the relationship between two variables. When one variable changes, the other variable changes in the same direction.

  12. Statistics review 7: Correlation and regression

    The same assumptions are needed in testing the null hypothesis that the correlation is 0, but in order to interpret confidence intervals for the correlation coefficient both variables must be Normally distributed. Both correlation and regression assume that the relationship between the two variables is linear.

  13. Simple Linear Regression and Correlation

    12 Simple Linear Regression and Correlation Barbara Illowsky; Susan Dean; and Margo Bergman. Linear Regression and Correlation. ... The whole goal of the regression analysis was to test the hypothesis that the dependent variable, Y, was in fact dependent upon the values of the independent variable(s) as asserted by some foundation theory, such ...

  14. Regression and Correlation

    Quantitative Research Methods. Correlation is the relationship or association between two variables. There are multiple ways to measure correlation, but the most common is Pearson's correlation coefficient (r), which tells you the strength of the linear relationship between two variables. The value of r has a range of -1 to 1 (0 indicates no ...

  15. PDF Regression, Correlation and Hypothesis Testing Cheat Sheet

    The data are coded using = and = h and it is found that a linear relationship exists between and . The equation of the regression line of on is = 0.0023 + 1.8 . Find an equation to describe the relationship between and h, giving your answer in the form. = h , where and are constants to be found.

  16. PDF Chapter 9 Simple Linear Regression

    218 CHAPTER 9. SIMPLE LINEAR REGRESSION 9.2 Statistical hypotheses For simple linear regression, the chief null hypothesis is H 0: β 1 = 0, and the corresponding alternative hypothesis is H 1: β 1 6= 0. If this null hypothesis is true, then, from E(Y) = β 0 + β 1x we can see that the population mean of Y is β 0 for

  17. Correlation vs. Regression: What's the Difference?

    Differences: Regression is able to show a cause-and-effect relationship between two variables. Correlation does not do this. Regression is able to use an equation to predict the value of one variable, based on the value of another variable. Correlation does not does this. Regression uses an equation to quantify the relationship between two ...

  18. 2.5.2 Hypothesis Testing for Correlation

    It is also possible to use a hypothesis test to determine whether a given product moment correlation coefficient calculated from a sample could be representative of the same relationship existing within the whole population. For full information on hypothesis testing, see the revision notes from section 5.1.1 Hypothesis Testing

  19. How to Write a Hypothesis for Correlation

    A hypothesis is a testable statement about how something works in the natural world. While some hypotheses predict a causal relationship between two variables, other hypotheses predict a correlation between them. According to the Research Methods Knowledge Base, a correlation is a single number that describes the relationship between two variables.

  20. PDF correlation and regression

    Calculate the product moment correlation coefficient between. x and y . Describe briefly the effect on the product moment correlation coefficient if another piece of data, x = 10 with y = 70 , is added to the other 10 bivariate observations. 2 FS2-V 2 , x = 150 , y = 480 , x = 4110 , y = 24760 , r ≈ 0.922.

  21. 2.4.1 Correlation & Regression

    What is linear regression? If strong linear correlation exists on a scatter diagram, then a line of best fit can be drawn. This is a linear graph added to the scatter diagram that best approximates the relationship between the two variables; At GCSE this will have been drawn by eye as a line that fits closest to the data values

  22. 11.2: Correlation Hypothesis Test

    What the conclusion means: There is a significant linear relationship between \(x\) and \(y\). We can use the regression line to model the linear relationship between \(x\) and \(y\) in the population. ... Correlation Hypothesis Test is shared under a CC BY 4.0 license and was authored, ...

  23. Smarter foragers do not forage smarter: a test of the diet hypothesis

    Our results do not support a strict correlation between the complexities of finding fruit and increased brain size . While brain size in different clades has probably been shaped by different evolutionary forces [ 21 ], we suggest that increases in primate brain size were driven by factors unrelated to locating fruit trees and choosing ...

  24. 10.1: Testing the Significance of the Correlation Coefficient

    The correlation coefficient, r, tells us about the strength and direction of the linear relationship between x and y. However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n, together.