Lesson 8
Using the Correlation Coefficient
8.1: Putting the Numbers in Context (5 minutes)
Warm-up
The mathematical purpose of this activity is for students to match bivariate data with its context. Students should think about whether they might expect a strong correlation or not as well as whether the relationship has a positive or negative correlation. Monitor for students who discuss linear relationships or variability in the data.
Launch
Arrange students in groups of two to four. Provide access to the scatter plots and contexts for the warm-up.
Student Facing
Match the variables to the scatter plot you think they best fit. Be prepared to explain your reasoning.
\(x\) variable | \(y\) variable | |
---|---|---|
1. | daily low temperature in Celsius for Denver, CO | boxes of cereal in stock at a grocery in Miami, FL |
2. | average number of free throws shot in a season | basketball team score per game |
3. | measured student height in feet | measured student height in inches |
4. | average number of minutes spent in a waiting room | hospital satisfaction rating |
Student Response
For access, consult one of our IM Certified Partners.
Anticipated Misconceptions
Students may struggle with matching the pairs of variables with a scatter plot. Encourage students to think about how related the variables are and how the \(y\) variable might change as the \(x\) variable increases.
Activity Synthesis
The goal of this discussion is for students to discuss how the characteristics of the scatter plots allowed them to determine the context. For each context, select groups to share their match and reasoning. Select groups who used linear model and variability in their small group discussions.
Here are some questions for discussion.
- “How did the concept of linear relationships help you to make a match?” (For the height in inches and the height in feet, I knew that the data would be very linear. I chose scatter plot A because it was the only one that appeared linear. I did wonder why scatter plot A was not perfectly linear. Maybe it had to do with rounding or measurement error.)
- “How did you use the concept of linearity to help you to make a match?” (First, I expected the temperature and cereal context relationship to be totally random rather than linear at all, so that fit with scatter plot C. Second, I knew that problem 3 matched with scatter plot A, because I knew it should be almost perfectly linear. For a given height in feet, there is only one height in inches.)
- “Which matches were the most difficult? What helped you figure them out?” (It was hardest to find the matches for scatter plots B and D because they were pretty scattered. I noticed that the data in scatter plot D displayed a negative relationship, so I looked for a context that would have a negative relationship. It makes sense that customer satisfaction decreases as wait time increases.)
8.2: Never Know How Far You’ll Go (20 minutes)
Activity
The mathematical purpose of this activity is for students to use technology to find a correlation coefficient and use it to interpret the strength of the linear relationship in context. In addition, students interpret the coefficients of the equation of the line of best fit and use the equation to make predictions. As students examine the situation and describe variables that could influence the results, they are modeling with mathematics (MP4).
Launch
Show students how to find a correlation coefficient using technology. Here is the data from scatter plot A in the warm-up to use in the demonstration.
Copy and paste the table into a blank line using the Graphing tool available in Math Tools, or by navigating to desmos.com/calculator. Do not include the table header when copying. A scatter plot will appear in the graphing window.
height in feet (\(x\)) | height in inches (\(y\)) |
---|---|
5.5 | 66 |
5.25 | 63 |
5 | 60 |
5.5 | 66 |
6 | 72 |
5.8 | 70 |
5.9 | 71 |
6.25 | 75 |
5.4 | 65 |
5.1 | 61 |
4.9 | 59 |
5.7 | 68 |
5.75 | 69 |
5.9 | 71 |
5.5 | 66 |
In the next blank line, graph a line of best fit by typing "y1~ax1+b", which will appear as \(y_1 \sim ax_1+b\). A line will appear in the window with the graph, as well as some data in the entry area. The correlation coefficient, \(r\), can be found in this data.
The equation of the line of best fit is \(y = 12.048x - 0.229\), and the correlation coefficient is 0.999. Since there is a direct conversion between the units, we expect there to be a strong, positive relationship between the variables.
Although it is not perfect (\(r = 0.999\), not 1) due to rounding, the best fit line models the data very well, so there is a strong relationship between height in feet and height in inches. Since an increase in one of the values is paired with an increase in the other, there is a positive relationship.
Give students 10 minutes of work time to answer the questions.
Student Facing
Priya takes note of the distance the car drives and the time it takes to get to the destination for many trips.
distance (mi) (\(x\)) | travel time (min) (\(y\)) |
---|---|
2 | 4 |
5 | 7 |
10 | 11 |
10 | 15 |
12 | 16 |
15 | 22 |
20 | 23 |
25 | 25 |
26 | 28 |
30 | 36 |
32 | 35 |
40 | 37 |
50 | 51 |
65 | 70 |
78 | 72 |
- Distance is one factor that influences the travel time of Priya’s car trips. What are some other factors?
- Which of these factors (including distance) most likely has the most consistent influence for all the car trips? Explain your reasoning.
- Use technology to create a scatter plot of the data and add the best fit line to the graph.
- What do the slope and \(y\)-intercept for the line of best fit mean in this situation?
- Use technology to find the correlation coefficient for this data. Based on the value, how would you describe the strength of the linear relationship?
- How long do you think it would take Priya to make a trip of 90 miles if the linear relationship continues? If she drives 90 miles, do you think the prediction you made will be close to the actual value? Explain your reasoning.
Student Response
For access, consult one of our IM Certified Partners.
Launch
Show students how to find a correlation coefficient using technology. The digital version of this activity includes instructions for finding the correlation coefficient using Desmos. If you will be using technology other than Desmos (available in Math Tools), you may need to prepare alternate instructions. Here is the data from scatter plot A in the warm-up to use in the demonstration.
height in feet (\(x\)) | height in inches (\(y\)) |
---|---|
5.5 | 66 |
5.25 | 63 |
5 | 60 |
5.5 | 66 |
6 | 72 |
5.8 | 70 |
5.9 | 71 |
6.25 | 75 |
5.4 | 65 |
5.1 | 61 |
4.9 | 59 |
5.7 | 68 |
5.75 | 69 |
5.9 | 71 |
5.5 | 66 |
The equation of the line of best fit is \(y = 12.048x - 0.229\), and the correlation coefficient is 0.999. Since there is a direct conversion between the units, we expect there to be a strong, positive relationship between the variables.
Although it is not perfect (\(r = 0.999\), not 1) due to rounding, the best fit line models the data very well, so there is a strong relationship between height in feet and height in inches. Since an increase in one of the values is paired with an increase in the other, there is a positive relationship.
Give students 10 minutes of work time to answer the questions.
Student Facing
Priya takes note of the distance the car drives and the time it takes to get to the destination for many trips.
distance (mi) (\(x\)) | travel time (min) (\(y\)) |
---|---|
2 | 4 |
5 | 7 |
10 | 11 |
10 | 15 |
12 | 16 |
15 | 22 |
20 | 23 |
25 | 25 |
26 | 28 |
30 | 36 |
32 | 35 |
40 | 37 |
50 | 51 |
65 | 70 |
78 | 72 |
- Distance is one factor that influences the travel time of Priya’s car trips. What are some other factors?
- Which of these factors (including distance) most likely has the most consistent influence for all the car trips? Explain your reasoning.
- Use technology to create a scatter plot of the data and add the best fit line to the graph.
- What do the slope and \(y\)-intercept for the line of best fit mean in this situation?
- Use technology to find the correlation coefficient for this data. Based on the value, how would you describe the strength of the linear relationship?
- How long do you think it would take Priya to make a trip of 90 miles if the linear relationship continues? If she drives 90 miles, do you think the prediction you made will be close to the actual value? Explain your reasoning.
Student Response
For access, consult one of our IM Certified Partners.
Anticipated Misconceptions
Students may misunderstand how to interpret negative correlation coefficients. Ask students how the sign of the correlation coefficient is related to the linear model for the situation. Students may benefit from drawing an example scatter plot representing the situation.
Activity Synthesis
The purpose of this discussion is for students to understand that the correlation coefficient quantifies the strength of a linear relationship.
Introduce the terms positive and negative relationship. Give some guidelines as to when to call a relationship strong or not. For example: When \(|r| \geq 0.8\), there is a strong, linear relationship, when \(|r| \leq 0.5\) the relationship is weak, and when \(0.5 < |r| < 0.8\), it is moderately strong. Although these are good guidelines, they should not be treated as a rule. Context is also important when determining whether to call a relationship strong or weak.
In general, \(r\) is related to how much of an improvement a linear model is over just using the mean as an estimate for the data.
Here are some questions for discussion.
- “How would you describe the relationship between distance and time for Priya’s trips? Explain your reasoning.” (It is a strong positive relationship. It is strong because \(r=0.99\), and it is positive because time, \(y\), tends to increase as distance, \(x\), increases.)
- Refer students to scatter plot D from the warm-up. “How would you describe the relationship between wait time and customer satisfaction? The correlation coefficient is \(r=\text{-}0.52\).” (It is a negative relationship. It is a moderately strong relationship, since the linear model is an improvement over merely using the mean for the data, but it is not a great improvement.)
- “What is an example of data with a strong, negative relationship?” (An example of a strong negative relationship could be between the time it takes you to run a mile and your speed. As your speed increases, the time decreases.)
Design Principle(s): Support sense-making
8.3: Correlation Zoo (10 minutes)
Activity
The mathematical purpose of this activity is to use the correlation coefficient to describe the relationship between two variables. Students examine a pair of variables and a correlation coefficient to describe the relationship between the variables as strong or weak and as positive or negative. Students must reason abstractly and quantitatively (MP2) when they interpret the situation to describe the relationship.
Launch
Arrange students in groups of 2. Give students an example of the type of response expected by telling students: “The cost of a package of light bulbs and the number of light bulbs in the package has a correlation coefficient value near 1. This means that these variables have a very strong, positive relationship. In other words, the price of the package is very closely related to the number of light bulbs in the package (strong relationship) and when one of the variables goes up, the other variable tends to go up, too (positive relationship).”
Supports accessibility for: Memory; Organization
Student Facing
For each situation, describe the relationship between the variables, based on the correlation coefficient. Make sure to mention whether there is a strong relationship or not as well as whether it is a positive relationship or negative relationship.
- Number of steps taken per day and number of kilometers walked per day. \(r = 0.92\)
- Temperature of a rubber band and distance the rubber band can stretch. \(r = 0.84\)
- Car weight and distance traveled using a full tank of gas. \(r = \text{-}0.86\)
- Average fat intake per citizen of a country and average cancer rate of a country. \(r = 0.73\)
- Score on science exam and number of words written on the essay question. \(r = 0.28\)
- Average time spent listening to music per day and average time spent watching TV per day. \(r = \text{-}0.17\)
Student Response
For access, consult one of our IM Certified Partners.
Student Facing
Are you ready for more?
A biologist is trying to determine if a group of dolphins is a new species of dolphin or if it is a new group of individuals within the same species of dolphin. The biologist measures the width (in millimeters) of the largest part of the skull, zygomatic width, and the length (in millimeters) of the snout, rostral length, of 10 dolphins from the same group of individuals.
The data appears to be linear and the equation of the line of best fit is \(y = 0.201x + 110.806\) and the \(r\)-value is 0.201.
\(x\), rostral length (mm) |
\(y\), zygomatic width (mm) |
288 |
147 |
247 |
147 |
268 |
171 |
278 |
177 |
258 |
168 |
272 |
184 |
272 |
161 |
258 |
159 |
273 |
168 |
277 |
166 |
-
After checking the data, the biologist realizes that the first zygomatic width listed as 147 mm is an error. It is supposed to be 180 mm. Use technology to find the equation of a line of best fit and the correlation coefficient for the corrected data. What is the equation of the line of best fit and the correlation coefficient?
-
Compare the new equation of the line of best fit with the original. What impact did changing one data point have on the slope, \(y\)-intercept, and correlation coefficient on the line of best fit?
-
Why do you think that weak positive association became a moderately strong association? Explain your reasoning.
-
Use technology to change the \(y\)-value for the first and second entries in the table.
-
How does changing each point’s \(y\)-value impact the correlation coefficient?
-
Can you change two values to get the correlation coefficient closer to 1? Use data to support your answer.
-
By leaving \((288,180)\), can you change a value to get the relationship to change from a positive one to a negative one? Use data to support or refute your answer.
-
Student Response
For access, consult one of our IM Certified Partners.
Activity Synthesis
The purpose of this discussion is for students to interpret the data based on the relationship between the two variables that they determined using the correlation coefficient.
Ask:
- “Do these correlations make sense based on what you understand about these variables?” (Yes, mostly. Walking more, for example, usually means you take more steps.)
- “Based on the information here, does a greater fat intake cause cancer?” (Not necessarily. There seems to be a link, but there are many other factors about countries with greater fat intake that might play a part.) Causal relationships is the focus of the next lesson, so it is okay for students to struggle with this question now.
- “What does it mean for the relationship between the score on a science exam and the number of words written to be weak and positive?” (It means that, in general, the exam score tended to increase as the number of words written increased, but the relationship between these two variables was weak. This makes sense, because if you wrote very few words, you probably would not get a good score, but writing lots of words does not necessarily guarantee you a good score.)
Lesson Synthesis
Lesson Synthesis
Here are some questions for discussion.
- “What does it mean for two variables to have a weak, positive relationship?” (When one of the variables increases, the other variable also tends to increase, but since the linear relationship is weak, the data does not follow a linear path.)
- “If the \(r\)-value for a line of best fit in a scatter plot is 0.8, what would you expect the data in the scatter plot to look like?” (I would expect the scatter plot to show the data generally increasing from left to right, and for the data to look somewhat like a line, but not perfectly like a line.)
- “What does the correlation coefficient tell you about the relationship between two variables?” (It tells you if the two variables are positively or negatively related. It also tells you how strong the relationship between the two variables is.)
8.4: Cool-down - How Bad Is It, Doc? (5 minutes)
Cool-Down
For access, consult one of our IM Certified Partners.
Student Lesson Summary
Student Facing
The value for the correlation coefficient can be used to determine the strength of the relationship between the two variables represented in the data.
In general, when the variables increase together, we can say they have a positive relationship. If an increase in one variable’s data tends to be paired with a decrease in the other variable’s data, the variables have a negative relationship. When the data is tightly clustered around the best fit line, we say there is a strong relationship. When the data is loosely spread around the best fit line, we say there is a weak relationship.
A correlation coefficient with a value near 1 suggests a strong, positive relationship between the variables. This means that most of the data tends to be tightly clustered around a line, and that when one of the variables increases in value, the other does as well. The number of schools in a community and the population of the community is an example of variables that have a strong, positive correlation. When there is a large population, there is usually a large number of schools, and small communities tend to have fewer schools, so the correlation is positive. These variables are closely tied together, so the correlation is strong.
Similarly, a correlation coefficient near -1 suggests a strong, negative relationship between the variables. Again, most of the data tend to be tightly clustered around a line, but now, when one value increases, the other decreases. The time since you left home and the distance left to reach school has a strong, negative correlation. As the travel time increases, the distance to school tends to decrease, so this is a negative correlation. The variables are again closely, linearly related, so this is a strong correlation.
Weaker correlations mean there may be other reasons the data is variable other than the connection between the two variables. For example, number of pets and number of siblings have a weak correlation. There may be some relationship, but there are many other factors that account for the variability in the number of pets other than the number of siblings.
The context of the situation should be considered when determining whether the correlation value is strong or weak. In physics, measuring with precise instruments, a correlation coefficient of 0.8 may not be considered strong. In social sciences, collecting data through surveys, a correlation coefficient of 0.8 may be very strong.