Lesson 15

Comparing Data Sets

15.1: Bowling Partners (10 minutes)

Warm-up

The mathematical purpose of this activity is for students to compare different distributions using shape, measures of center, and measures of variability. This warm-up prompts students to compare four distributions representing recent bowling scores for potential teammates. It gives students a reason to use language precisely (MP6) and gives you the opportunity to hear how they use terminology and talk about characteristics of the images in comparison to one another. 

Launch

Arrange students in groups of 2–4.

Student Facing

Each histogram shows the bowling scores for the last 25 games played by each person. Choose 2 of these people to join your bowling team. Explain your reasoning.

Person A

  • mean: 118.96
  • median: 111
  • standard deviation:​ ​32.96
  • interquartile range: 44
Histogram for bowler A

Person B

  • mean: 131.08
  • median: 129
  • standard deviation: 8.64
  • interquartile range: 8
Histogram for Bowler B

Person C

  • mean: 133.92
  • median: 145
  • standard deviation: 45.04
  • interquartile range: 74
Histogram for bowler C

Person D

  • mean: 116.56
  • median: 103
  • standard deviation: 56.22
  • interquartile range: 31.5
Histogram for bowler D

Student Response

Student responses to this activity are available at one of our IM Certified Partners

Activity Synthesis

Ask each group to share one bowler they would choose and their reasoning. If none of the groups select a certain player, ask why this player was not chosen or to give reasons why another team may want this player on their team. Since there is no single correct answer to the question of which one does not belong, attend to students’ explanations and ensure the reasons given are correct. During the discussion, ask students to explain how they used the statistics given as well as the histograms.

15.2: Comparing Marathon Times (10 minutes)

Activity

The mathematical purpose of this activity is for students to compare measures of center and measures of variability in context. Monitor for students who

  1. Determine the slower age group by using an informal description of the shift in data
  2. Determine the slower age group by using a numerical estimate for the mean or median for measures of center
  3. Determine variability from the range of values.
  4. Use a numerical estimate for IQR or standard deviation as measures of variability.

Launch

Conversing: MLR 5 Co-Craft Questions. Display only the first line of this task (“All of the marathon runners from each of two different age groups have their finishing times represented in the dot plot.”) and the two dot plots, and invite pairs of students to write possible mathematical questions about the situation. Then, invite pairs to share their questions with the class. This helps students produce the language of mathematical questions and talk about the relationships between the two data sets in this task, the measures of center and measures of variability, prior to being asked to analyze another’s reasoning. Design Principle(s): Cultivate conversation

Student Facing

All of the marathon runners from each of two different age groups have their finishing times represented in the dot plot.

Dot plot from 220 to 460 by 20’s. ages 30 through 39 marathon finish times in minutes. Beginning at 220 up to but not including 240, number of dots in each interval is 1, 11, 10, 10, 5, 4, 4, 5, 0, 0, 0, 0.
Dot plot from 220 to 460 by 20’s. ages 40 through 49 marathon finish times in minutes. Beginning at 220 up to but not including 240, number of dots in each interval is 0, 1, 7, 5, 4, 5, 4, 3, 5, 1, 6, 3.
  1. Which age group tends to take longer to run the marathon? Explain your reasoning.
  2. Which age group has more variable finish times? Explain your reasoning.

Student Response

Student responses to this activity are available at one of our IM Certified Partners

Student Facing

Are you ready for more?

  1. How do you think finish times for a 20–29 age range will compare to these two distributions?

  2. Find some actual marathon finish times for this group and make a box plot of your data to help compare.

Student Response

Student responses to this activity are available at one of our IM Certified Partners

Activity Synthesis

The purpose of this discussion is for students to understand how to compare data sets using measures of center and measures of variability.

Select students to share their answers and reasoning in a sequence that moves from less formal reasoning to more definite values and statistics.

After several estimates for measure of center and measure of variation are mentioned, display the actual values for these data sets.

Ages 30–39

  • Mean: 294.5 minutes
  • Standard deviation: 41.84 minutes
  • Median: 288.5 minutes
  • IQR: 65 minutes
  • Q1: 260 minutes
  • Q3: 325 minutes

Ages 40–49

  • Mean: 340.6 minutes
  • Standard deviation: 59.93 minutes
  • Median: 332 minutes
  • IQR: 92 minutes
  • Q1: 289 minutes
  • Q3: 381 minutes

Ask students:

  • “What measure of center and measure of variability are most appropriate to use with the distributions? Explain your reasoning.” (The median is the most appropriate because it is closer to where the majority of the data is. Since I used the median as the measure of center, I used the IQR as a measure of variability.)
  • “Which points show values that are most likely to be outliers?” (The slowest runners in each group, on the right end of the dot plot, might be outliers.)
  • “Based on the displayed information, are there any outliers in these data sets?” (No. For the 30–39 age group, outliers would need to be values greater than 422.5 minutes, but the slowest times are around 380 minutes. For the 40–49 age group, outliers would need to be greater than 534 minutes, but the longest times are around 450 minutes.)

15.3: Comparing Measures (15 minutes)

Activity

In this activity, students take turns with a partner determining the best measure of center and the best measure of variability for several data sets. Students trade roles explaining their thinking and listening, providing opportunities to explain their reasoning and critique the reasoning of others (MP3). Students also determine which data set has a greater measure of center and which has a greater measure of variability.

Launch

Arrange students in groups of 2. Tell students that for each data display or description of a data set in column A, one partner determines the appropriate measure of center and measure of variability and explains why they think it is appropriate. The partner's job is to listen and make sure they agree. If they don't agree, the partners discuss until they come to an agreement. For the next data display or description of a data set in column B, the students swap roles. If necessary, demonstrate this protocol before students start working. The last item has a column C. Students can work together to determine the best measures for set C. Once an agreement is reached for each group of data sets, students will determine which data set has the greatest measure of center, and which data set has the greatest measure of variability.

Engagement: Internalize Self Regulation. Demonstrate giving and receiving constructive feedback. Use a structured process and display sentence frames to support productive feedback. For example, “That measure of center could/couldn’t be true because.…” or “Based on the shape of the distribution, a better choice would be _____ because….”
Supports accessibility for: Social-emotional skills; Organization; Language

Student Facing

For each group of data sets,

  • Determine the best measure of center and measure of variability to use based on the shape of the distribution.
  • Determine which set has the greatest measure of center.
  • Determine which set has the greatest measure of variability.
  • Be prepared to explain your reasoning.

1a

Dot plot from negative 16 to negative 3 by 1's. Distribution 1a. Beginning at negative 12, number of dots above each increment is 6, 4, 3, 2, 1, 2, 3, 4, 6.

1b

Dot plot from negative 16 to negative 3 by 1's. Distribution 1b. Beginning at negative 14, number of dots above each increment is 1, 2, 4, 5, 7, 5, 4, 2, 1, 0, 0, 0.

2a

Dot plot from 11 to 33 by 1's. Distribution 2a. Beginning at 13, number of dots above each increment is 1, 1, 2, 2, 2, 3, 3, 4, 3, 3, 2, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0.

2b

Dot plot from 11 to 33 by 1's. Distribution 2b. Beginning at 27, number of dots above each increment is 1, 5, 6, 8, 6, 5, 1.

3a

Dot plot from 0 to 12 by 1's. Distribution 3a. Beginning at 0, number of dots above each increment is 0, 3, 2, 1, 1, 0, 2, 2, 3, 3, 5, 4.

3b

Dot plot from 0 to 12 by 0.5's. Distribution 3b. Beginning at 0, number of dots above each increment is 0, 4, 5, 3, 3, 2, 2, 0, 1, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
 

4a

Dot plot from 78 to 112 by 2's. Distribution 4a. Beginning at 78, number of dots above each increment is 0, 0, 0, 0, 0, 0, 2, 2, 3, 3, 4, 5, 4, 3, 3, 2, 2.

4b

Dot plot from 78 to 112 by 2's. Distribution 4b. Beginning at 78, number of dots above each increment is 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0. 

5a

Box plot from 0 to 1,200 by 100's. Distribution 5a. Whisker from 500 to 600. Box from 600 to 900 with vertical line at 700. Whisker from 900 to 1100.

5b

Box plot from 0 to 1,200 by 100's. Distribution 5b. Whisker from 200 to 450. Box from 450 to 650 with vertical line at 500. Whisker from 650 to 700.

6a

A political podcast has mostly reviews that either love the podcast or hate it.

6b

A cooking podcast has reviews that neither hate nor love the podcast.​​​

7a

Stress testing concrete from site A has all 12 samples break at 450 pounds per square inch (psi).

7b

Stress testing concrete from site B has samples break every 10 psi starting at 450 psi until the last core is broken at 560 psi. 

7c

Stress testing concrete from site C has 6 samples break at 430 psi and the other 6 break at 460 psi.

Student Response

Student responses to this activity are available at one of our IM Certified Partners

Anticipated Misconceptions

For the situations described in words, students may think there is not enough information to answer the question. Ask these students, "What do you think the distributions might look like for the situations described?" Tell them to use their distributions to answer the question and be prepared to explain their reasoning.

Activity Synthesis

Select students to share how they determined whether to use the mean or the median, and how they figured out which data set showed greater variability.

  • “What were some ways you handled the last two problems.” (I reasoned about what a dot plot might look like to imagine where the center might be and how spread out the data might look.)
  • “Describe any difficulties you experienced and how you resolved them.” (I forgot what information was in a box plot, so I asked my partner.)
  • “How did you decide whether to use the mean or the median?” (I recalled what I learned about shape from the previous lesson. When the distribution was symmetric or close to it, I used the mean. When it was skewed, I used the median.)
  • “How did you decide which data set showed greater variability?” (It was easy for the box plot, I calculated the IQR. For the dot plots, I looked at which one was more spread apart. For the problem contexts, I tried to see which one would have data that would be more varied.)

Lesson Synthesis

Lesson Synthesis

The purpose of this discussion is to ensure students know how to compare data sets using measures of variability, including standard deviation, and measures of center. Here are some questions for discussion.

  • “How do you compare the measures of variability for a data set?” (You either calculate them or estimate them from a data display. The data set with the higher measure of variability is more variable.)
  • “How do you estimate variability when looking at data displays?” (You try to estimate the center and then estimate how spread apart the data is.)
  • “How do you determine which measure of center to use for a data set?” (You look at the shape and use the mean when it is symmetric or really close and the median when it is skewed or if there are outliers.)
  • “Why is the median the preferred measure of center for skewed distributions?” (The median is preferred because it more accurately represents the center of the data. Data values farther from the center impact the median less than the mean, so the median remains near the typical values.)
  • “Why is the mean the preferred measure of center for symmetric distributions?” (In a symmetric distribution the mean is equal to the median. The mean is preferred because it takes into account all of the values in the data set when it is calculated.)

15.4: Cool-down - Comparing Mascots (5 minutes)

Cool-Down

Cool-downs for this lesson are available at one of our IM Certified Partners

Student Lesson Summary

Student Facing

To compare data sets, it is helpful to look at the measures of center and measures of variability. The shape of the distribution can help choose the most useful measure of center and measure of variability.

When distributions are symmetric or approximately symmetric, the mean is the preferred measure of center and should be paired with the standard deviation as the preferred measure of variability. When distributions are skewed or when outliers are present, the median is usually a better measure of center and should be paired with the interquartile range (IQR) as the preferred measure of variability.

Once the appropriate measure of center and measure of variability are selected, these measures can be compared for data sets with similar shapes.

For example, let’s compare the number of seconds it takes football players to complete a 40-yard dash at two different positions. First, we can look at a dot plot of the data to see that the tight end times do not seem symmetric, so we should probably find the median and IQR for both sets of data to compare information.

Dot plot from 4 point 25 to 5 point 75 by  point 25’s. Wide receiver times in seconds. Beginning at 4 point 25 up to but not including 4  point 5, number of dots in each interval is 12, 11, 2, 0, 0, 0.
 
Dot plot from 4 point 25 to 5 point 75 by point 25’s. Tight end times in seconds. Beginning at 4 point 25 up to but not including 4 point 5, number of dots in each interval is 0, 10, 6, 4, 3, 1.
 

The median and IQR could be computed from the values, but can also be determined from a box plot.

Box plot.
Box plot for tight end times.

This shows that the tight end times have a greater median (about 4.9 seconds) compared to the median of wide receiver times (about 4.5 seconds). The IQR is also greater for the tight end times (about 0.5 seconds) compared to the IQR for the wide receiver times (about 0.25 seconds).

This means that the tight ends tend to be slower in the 40-yard dash when compared to the wide receivers. The tight ends also have greater variability in their times. Together, this can be taken to mean that, in general, a typical wide receiver is faster than a typical tight end, and the wide receivers tend to have more similar times to one another than the tight ends do to one another.