Lesson 14

Outliers

14.1: Health Care Spending (10 minutes)

Warm-up

The purpose of this warm-up is to elicit the idea that outliers are often present in data, which will be useful when students investigate the source of outliers and what to do with them in a later activity. 

As students work, monitor for students who

  • estimate the IQR from values in the box plot
  • use a measurement tool to determine the IQR from the box plot

Students are given the formulas for outliers: a value is considered an outlier for a data set if it is greater than Q3 + 1.5 \(\boldcdot\) IQR or less than Q1 - 1.5 \(\boldcdot\) IQR. To find extreme values, we are comparing very large or small values to the bulk of the data. This means using the quartiles and interquartile range to compare the value to typical distances to the center of the data.

Launch

Display the histogram and box plot for all to see. Tell students to think of one thing they notice and one thing they wonder about the images. Give students 1 minute of quiet think time, and then 1 minute to discuss the things they notice with their partner. Listen for students who notice that there is a value that seems greatly different from the rest of the data. Select a few students to share things they notice and wonder making sure to select the identified student that notice an extreme value.

Student Facing

The histogram and box plot show the average amount of money, in thousands of dollars, spent on each person in the country (per capita spending) for health care in 34 countries.

Histogram from 1 to 10 by 1’s. Per capita health spending by country (thousands of dollars). Beginning at 1 up to but not including 2, height of bar at each interval is 7, 8, 3, 8, 6, 1, 0, 0, 1.

 

Box plot
  1. One value in the set is an outlier. Which one is it? What is its approximate value?
  2. By one rule for deciding, a value is an outlier if it is more than 1.5 times the IQR greater than Q3. Show on the box plot whether or not your value meets this definition of outlier.

Student Response

Student responses to this activity are available at one of our IM Certified Partners

Activity Synthesis

Select previously identified students in the order listed in the lesson narrative to share their method for creating this visualization of outliers in the box plot.

Tell students:

  • Values in a data set that are greatly different from the rest of the data are called outliers. The precise meaning of greatly different will be different for different situations. For example, a possible $4,000 difference in this graph does seem like a lot, but if the data represented the entire budgets of these countries in the billions or trillions of dollars (rather than spending on each member of the population for healthcare), it would not be a great difference.
  • Using the IQR to determine outliers helps to adjust the difference to the variability of the bulk of the middle data. Using 1.5 times the IQR allows for some variability on the ends of the distribution to be considered usual.
  • It is also possible for there to be values that are unusually low compared to the rest of the data set. Consider this box plot that displays \(\text{Q1} - 1.5 \boldcdot \text{IQR}\). The minimum value for this data set should be considered an outlier.
    Box plot
  • For the purposes of this unit, a value will be considered an outlier for a data set if it is greater than Q3 + 1.5 \(\boldcdot\) IQR or less than Q1 - 1.5 \(\boldcdot\) IQR. These formulas compare extreme values to the middle half of the data to determine if the value should be considered an outlier.

Create a display for the outlier formulas so that can be referenced throughout the rest of the unit.

14.2: Investigating Outliers (15 minutes)

Activity

The mathematical purpose of this activity is for students to investigate the impact of outliers on measures of center and variability, and to make decisions about whether or not to include outliers in a data set.

Launch

Arrange students in groups of 2. Provide access to devices that can run GeoGebra or other statistical technology.

Display the data showing Per Capita Health Spending by Country in 2016 for all to see. Orient students to the data and explain that the distribution of this data set represents the histogram used in the warm-up. Ensure that students understand what per capita means. “Per capita health spending” means the average health spending per person. For example, the United States spends approximately \$9,892 on healthcare for each person in the population.

Remind students that we will classify a value in a data set as an outlier if it is greater than Q3 + 1.5 \(\boldcdot\) IQR or less than Q1 - 1.5 \(\boldcdot\) IQR.

Action and Expression: Internalize Executive Functions. Chunk this task into more manageable parts to support students who benefit from support with organizational skills in problem solving. For example, present one question at a time.
Supports accessibility for: Organization; Attention

Student Facing

Here is the data set used to create the histogram and box plot from the warm-up. 

  • 1.0803
  • 1.0875
  • 1.4663
  • 1.7978
  • 1.9702
  • 1.9770
  • 1.9890
  • 2.1011
  • 2.1495
  • 2.2230
  • 2.5443
  • 2.7288
  • 2.7344
  • 2.8223
  • 2.8348
  • 3.2484
  • 3.3912
  • 3.5896
  • 4.0334
  • 4.1925
  • 4.3763
  • 4.5193
  • 4.6004
  • 4.7081
  • 4.7528
  • 4.8398
  • 5.2050
  • 5.2273
  • 5.3854
  • 5.4875
  • 5.5284
  • 5.5506
  • 6.6475
  • 9.8923
  1. Use technology to find the mean, standard deviation, and five-number summary.
  2. The maximum value in this data set represents the spending for the United States. Should the per capita health spending for the United States be considered an outlier? Explain your reasoning.
  3. Although outliers should not be removed without considering their cause, it is important to see how influential outliers can be for various statistics. Remove the value for the United States from the data set.
    1. Use technology to calculate the new mean, standard deviation, and five-number summary.
    2. How do the mean, standard deviation, median, and interquartile range of the data set with the outlier removed compare to the same summary statistics of the original data set?

Student Response

Student responses to this activity are available at one of our IM Certified Partners

Anticipated Misconceptions

Students may incorrectly compute the expression for outliers. Remind them to use the correct order of operations using the math talk warm up in a previous lesson.

Activity Synthesis

The goal is to make sure that students understand that outliers can significantly impact measures of center and variability. Discuss the impact of the outlier on the median, mean, and standard deviation and the student responses to “Do you think that 9.8923 should be eliminated from the data set? Why or why not?”

If time permits, discuss questions such as:

  • “Which measure of center is more greatly affected by the inclusion of extreme values, the mean or median? Explain your reasoning.” (The mean since it uses the actual numerical value rather than the position of the values like the median does.)
  • “Which measure of variability is more greatly affected by the inclusion of extreme values, the standard deviation or the interquartile range? Explain your reasoning.” (The standard deviation since it uses the mean as well as the numerical value of each number in the data set whereas the IQR only uses the position of the middle half of the data.)
Listening, Speaking: MLR 2 Collect and Display. Before the whole-class discussion, give students the opportunity to talk with their partner about using statistical tools to calculate and display numeric statistics. Write down common or important phrases you hear students say about each data set, such as mean, median, outliers, or standard deviation. Write the students’ words on a visual display of the representations of one- and two-variable data sets. Continue to add to this display throughout the rest of the lesson. This will help students read and use mathematical language during their paired and whole-class discussion while making decisions about whether or not to include the outliers in a data set. Design Principle(s): Support sense-making

14.3: Origins of Outliers (10 minutes)

Activity

The mathematical purpose of this activity is to get students thinking about the source of outliers and whether or not it is appropriate to include them when analyzing data. It is important to stress that data should not be removed simply because it is an outlier. If there is any doubt about the reason for the outlier, the data should be included in any analysis done on the data set.

Launch

Arrange students in groups of 2. Give students quiet think time to answer the first question. Ask partners to compare answers. Follow with a whole-class discussion. Ask partners to answer and discuss the remaining questions.

Reading, Writing: MLR 3 Clarify, Critique, Correct. Present an incorrect or ambiguous statement for the first question involving outliers. For example, present the following statement: ”593 is an outlier since I multiplied by the IQR and found it.” Prompt discussion by asking, “What were the steps that the author took?”. Ask students to clarify and correct the statement. Improved statements should include some of the following: explanation of each step, order/time transition words (first, next, then, etc.), and/or reasons for decisions made during steps. This helps students evaluate, and improve on, the written mathematical arguments of others. Design Principle(s): Maximize meta-awareness
Representation: Internalize Comprehension. Activate or supply background knowledge. Allow students to use calculators to ensure inclusive participation in the activity.
Supports accessibility for: Memory; Conceptual processing

Student Facing

  1. The number of property crime (such as theft) reports is collected for 50 colleges in California. Some summary statistics are given:
    • 15
    • 17
    • 27
    • 31
    • 33
    • 39
    • 39
    • 45
    • 46
    • 48
    • 49
    • 51
    • 52
    • 59
    • 72
    • 72
    • 75
    • 77
    • 77
    • 83
    • 86
    • 88
    • 91
    • 99
    • 103
    • 112
    • 136
    • 139
    • 145
    • 145
    • 175
    • 193
    • 198
    • 213
    • 230
    • 256
    • 258
    • 260
    • 288
    • 289
    • 337
    • 344
    • 418
    • 424
    • 442
    • 464
    • 555
    • 593
    • 699
    • 768
    • mean: 191.1 reports
    • minimum: 15 reports
    • Q1: 52 reports
    • median: 107.5 reports
    • Q3: 260 reports
    • maximum: 768 reports
    1. Are any of the values outliers? Explain or show your reasoning.
    2. If there are any outliers, why do you think they might exist? Should they be included in an analysis of the data?
  2. The situations described here each have an outlier. For each situation, how would you determine if it is appropriate to keep or remove the outlier when analyzing the data? Discuss your reasoning with your partner.
    1. A number cube has sides labelled 1–6. After rolling 15 times, Tyler records his data:
      1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 20
    2. The dot plot represents the distribution of the number of siblings reported by a group of 20 people.
      Dot plot from 0 to 13 by 1’s. Number of siblings. Beginning at 0, number of dots above each increment is 3, 4, 6, 3, 2, 1, 0, 0, 0, 0, 0, 0, 1, 0.
    3. In a science class, 12 groups of students are synthesizing biodiesel. At the end of the experiment, each group recorded the mass in grams of the biodiesel they synthesized. The masses of biodiesel are
      0, 1.245, 1.292, 1.375, 1.383, 1.412, 1.435, 1.471, 1.482, 1.501, 1.532

Student Response

Student responses to this activity are available at one of our IM Certified Partners

Student Facing

Are you ready for more?

Look back at some of the numerical data you and your classmates collected in the first lesson of this unit.

  1. Are any of the values outliers? Explain or show your reasoning.

  2. If there are any outliers, why do you think they might exist? Should they be included in an analysis of the data?

Student Response

Student responses to this activity are available at one of our IM Certified Partners

Activity Synthesis

The purpose of this discussion is to highlight different reasons that outliers appear in data. For example, they could be data-entry or data collection errors, or they could be representative of the sample. The goal is to make sure that students understand that the inclusion of outliers in a data set needs to be evaluated in the context of the data. For the number cube rolls, it is clear that the data should not be used since it is impossible to achieve in the right circumstances. For the other two scenarios, students should understand that a deeper investigation should be done to determine whether the outlier should be included and be able to state circumstances for including or excluding the outlier in each context.

Here are some questions for discussion:

  • “Why is it important to analyze the source of outliers?” (To determine if there is a reason to exclude the data from an analysis.)
  • “What are reasons to keep an outlier in a data set?” (Just because a value is extremely large or small does not discount its reality. To be honest in the analysis, all valid data should be included.)
  • “What are reasons to remove an outlier from a data set?” (If there is an error in data collection or recording, the data may be faulty and it would not be honest to include data that does not fit the question asked.)
  • “What could be done about the 3 outliers for the college crime data to account for school size as the source of the outliers?” (The question could be changed to examine crime at schools with enrollment below some level. It may make better sense to look at “crime rate” like a measure of crimes per 100 students or something similar that takes into account the school size rather than just the number of crimes.)
  • “How do you know that a value is an outlier?” (If it is greater than the third quartile by more than one and a half times the interquartile range or less than the first quartile by the same amount.)

Lesson Synthesis

Lesson Synthesis

The purpose of this discussion is to make sure that students know what outliers are, what to do with them, and how they impact measures of center and measures of variability. Here are some questions for discussion:

  • “What is an outlier?” (A data value that differs from the other values in the data set. It can be defined in terms of the IQR and quartiles. If a data value is 1.5 times the IQR greater than the 3rd quartile or 1.5 times the IQR less than the 1st quartile then it is considered an outlier.)
  • “Why are outliers important to notice in a data set?” (They can indicate an error in the data collection process or an interesting case to more closely study. They are not always representative of the whole sample. Their presence can disproportionally affect the values of the mean, MAD, and standard deviation.)
  • “How do outliers affect measures of center?” (Outliers can cause the mean to be much higher or lower than what appears to be typical depending on if the outlier is much greater or much less than the mean. They have less effect on the median.)
  • “How do outliers affect measures of variability?” (They cause the variability to be higher than it would be if the outliers were not present.)
  • “Why would you eliminate an outlier?” (They would be eliminated if they are an error or if they are not representative of the sample as a whole. It depends on the context of the problem and the data collection process.)

14.4: Cool-down - Expecting Outliers (5 minutes)

Cool-Down

Cool-downs for this lesson are available at one of our IM Certified Partners

Student Lesson Summary

Student Facing

In statistics, an outlier is a data value that is unusual in that it differs quite a bit from the other values in the data set.

Outliers occur in data sets for a variety of reasons including, but not limited to:

  • errors in the data that result from the data collection or data entry process
  • results in the data that represent unusual values that occur in the population

Outliers can reveal cases worth studying in detail or errors in the data collection process. In general, they should be included in any analysis done with the data.

A value is an outlier if it is

  • more than 1.5 times the interquartile range greater than Q3 (if \(x > \text{Q3 } + 1.5 \boldcdot \text{ IQR}\))
  • more than 1.5 times the interquartile range less than Q1 (if \(x < \text{Q1 } - 1.5 \boldcdot \text{ IQR}\))

In this box plot, the minimum and maximum are at least two outliers.

Box plot

It is important to identify the source of outliers because outliers can impact measures of center and variability in significant ways. The box plot displays the resting heart rate, in beats per minute (bpm), of 50 athletes taken five minutes after a workout.

Box plot from 50 to 120 by 10’s. Heartbeats per minute. Whisker from 55 to 62. Box from 62 to 76 with vertical line at 70. Whisker from 76 to 112. Dotted line, labeled 1.5 times IQR, from 76 to 97.

Some summary statistics include:

  • mean: 69.78 bpm
  • standard deviation: 10.71 bpm
  • minimum: 55 bpm
  • Q1: 62 bpm
  • median: 70 bpm
  • Q3: 76 bpm
  • maximum: 112 bpm

It appears that the maximum value of 112 bpm may be an outlier. Since the interquartile range is 14 bpm (\(76 - 62 = 14\)) and \(\text{Q3 }+ 1.5 \boldcdot \text{ IQR } = 97\), we should label the maximum value as an outlier. Searching through the actual data set, it could be confirmed that this is the only outlier.

After reviewing the data collection process, it is discovered that the athlete with the heart rate measurement of 112 bpm was taken one minute after a workout instead of five minutes after. The outlier should be deleted from the data set because it was not obtained under the right conditions.

Once the outlier is removed, the box plot and summary statistics are:

Box plot from 50 to 120 by 10’s. Heartbeats per minute. Whisker from 55 to 61. Box from 61 to 75.5 with vertical line at 70. Whisker from 75.5 to 85.
  • mean: 68.92 bpm
  • standard deviation: 8.9 bpm
  • minimum: 55 bpm
  • Q1: 61 bpm
  • median: 70 bpm
  • Q3: 75.5 bpm
  • maximum: 85 bpm

The mean decreased by 0.86 bpm and the median remained the same. The standard deviation decreased by 1.81 bpm which is about 17% of its previous value. Based on the standard deviation, the data set with the outlier removed shows much less variability than the original data set containing the outlier. Since the mean and standard deviation use all of the numerical values, removing one very large data point can affect these statistics in important ways.

The median remained the same after the removal of the outlier and the IQR increased slightly. These measures of center and variability are much more resistant to change than the median and standard deviation. The median and IQR measure the middle of the data based on the number of values rather than the actual numerical values themselves, so the loss of a single value will not often have a great effect on these statistics.

The source of any possible errors should always be investigated. If the measurement of 112 beats per minute was found to be taken under the right conditions and merely included an athlete whose heart rate did not slow as much as the other athletes, it should not be deleted so that the data reflect the actual measurements. If the situation cannot be revisited to determine the source of the outlier, it should not be removed. To avoid tampering with the data and to report accurate results, data values should not be deleted unless they can be confirmed to be an error in the data collection or data entry process.