Lesson 3
Associations in Categorical Data
3.1: Cake or Pie (5 minutes)
Warmup
The mathematical purpose of this activity is for students to interpret data in a twoway table using relative frequencies. Listen for students mentioning relative frequencies and how they are calculating it (using row, column, or overall totals).
Launch
Arrange students in groups of 2. Give students quiet think time to answer the first question and think about the others. Ask partners to compare and discuss their answers. Follow with a wholeclass discussion.
Student Facing
The table displays the dessert preference and dominant hand (left or righthanded) for a sample of 300 people.
prefers cake  prefers pie  total  

lefthanded  10  20  30 
righthanded  90  180  270 
total  100  200  300 
For each of the calculations, describe the interpretation of the percentage in terms of the situation.
 10% from \(\frac{10}{100} = 0.1\)
 67% from \(\frac{180}{270} \approx 0.67\)
 30% from \(\frac{90}{300} = 0.3\)
Student Response
For access, consult one of our IM Certified Partners.
Anticipated Misconceptions
Students may be confused as to how to interpret the table. Encourage students to use the table to look for the specific numbers mentioned in the questions. Then students can describe the larger group as well as the subgroup being considered in each respective question.
Activity Synthesis
The goal is to make sure students understand that relative frequencies can be calculated using the overall total, or the totals from the rows or the columns.
Ask students who were identified as mentioning relative frequency, “When you said relative frequency, what did you mean?” (I meant the frequency of a particular cell in the table occurring relative to the total. For example, 10% comes from 10 lefthanded people who prefer cake out of the 100 total people who prefer cake.)
Here are some questions for discussion.
 “How does this activity relate to prior work with relative frequency tables?” (We did not make the table here, but we had to use the values that we could have found by creating relative frequency tables. For this activity, you would have needed three relative frequency tables.)
 “For the question about 67%, did you use the row totals or the column totals?” (The row totals.)
 “The overall total was 300. How many people would you expect to prefer cake and be left handed if the overall total was 600? How does this connect to the concept of relative frequency?” (20. It is connected to relative frequency because 10 out of 300 is the same ratio as 20 out of 600.)
3.2: Associations in Categorical Data (15 minutes)
Activity
The mathematical purpose of this activity is to look for associations in categorical data. Since the subgroups represented by rows or columns are not always equally represented in the data, associations are often best recognized by considering row or column frequencies and comparing across the categories. Students should determine that there is likely not an association between the categories if the percentages are similar, or that there is evidence of an association between the variables when the percentages are very different.
Monitor for students who use the term association in their explanations. Identify students who struggle with knowing whether to compare frequencies in the columns or the rows, and for students who do not know how to find relative frequencies.
Making spreadsheet technology available gives students an opportunity to choose appropriate tools strategically (MP5).
Launch
Arrange students in groups of 2.
Ask students what it might mean for two variables to be associated. Provide these examples and ask students whether they think the variables might be associated:
 the color of a car’s seats and whether the owner returns to the dealership for maintenance or not (there is probably not an association)
 the color of a car’s seats and whether the owner is ever burned from the seats being hot (there is probably an association here, since darker seats will probably get hotter than lightercolored seats)
After students have had a chance to work through the first problem about coral health, pause the work to discuss what students noticed about possible associations. Ensure that the discussion concludes that when the column (or row) percentages are similar (like with the silicon dioxide concentration), there is probably no association between the different variables. When the percentages are not close (like with the nitrate concentration), there is likely an association between the variables.
Give students quiet think time to work through the remaining problems. Follow with a wholeclass discussion.
Design Principle(s): Support sensemaking
Supports accessibility for: Attention; Socialemotional skills
Student Facing

The twoway table displays data about 55 different locations. Scientists have a list of possible chemicals that may influence the health of the coral. They first look at how nitrate concentration might be related to coral health. The table displays the health of the coral (healthy or unhealthy) and nitrate concentration (low or high).
low nitrate concentration high nitrate concentration total healthy 20 5 25 unhealthy 8 22 30 total 28 27 55 
Complete the twoway relative frequency table for the data in the twoway table in which the relative frequencies are based on the total for each column.
low nitrate concentration high nitrate concentration healthy unhealthy total 100% 100% 
When there is a low nitrate concentration, which had a higher relative frequency, healthy or unhealthy coral?

When there is a high nitrate concentration, is there a higher relative frequency of healthy or unhealthy coral?

Based on this data, is there a possible association between coral health and the level of nitrate concentration? Explain your reasoning.

The scientists next look at how silicon dioxide concentration might be related to coral health. The relative frequencies based on the total for each column are shown in the table. Based on this data, is there a possible association between coral health and the level of silicon dioxide concentration? Explain your reasoning.
low silicon dioxide concentration high silicon dioxide concentration healthy 44% 46% unhealthy 56% 54% total 100% 100%

 Jada surveyed 300 people from various age groups about their shoe preference. The twoway table summarizes the results of the survey.
prefers sneakers without laces prefers sneakers with laces prefers shoes that are not sneakers total 4–10 years old 21 12 3 36 11–17 years old 21 48 39 108 18–24 years old 15 54 87 156 total 57 114 129 300 Jada concludes that there is a possible association between age and shoe preference. Is Jada’s conclusion reasonable? Explain your reasoning.
 The twoway table summarizes data on writing utensil preference and the dominant hand for a sample of 100 people.
lefthanded righthanded total prefers pen 7 82 89 prefers pencil 6 5 11 total 13 87 100 Is there a possible association between dominant hand and writing utensil preference? Explain your reasoning.
Student Response
For access, consult one of our IM Certified Partners.
Student Facing
Are you ready for more?
The incomplete twoway table displays the results of a survey about the type of sports medicine treatment and recovery time for 33 student athletes who visited the athletic trainer.
returned to playing in less than 2 days 
returned to playing in 2 or more days 

treated with ice 
8 
4 
treated with heat 

What 2 values could you use to complete the twoway table to show that there is an association between returning to playing in less than 2 days and the treatment (ice or heat)? Explain your reasoning.

What 2 values could you use to complete the twoway table to show that there is no association between returning to playing in less than 2 days and the treatment (ice or heat)? Explain your reasoning.

Which 2 values were easier to choose, the 2 values showing an association, or the 2 values showing no association? Explain your reasoning.
Student Response
For access, consult one of our IM Certified Partners.
Anticipated Misconceptions
Students may not know how to determine if they notice an association between the variables. Ask students to describe whether coral with low nitrate concentration tends to be healthy or unhealthy, and then compare that to a description of what a highnitrate coral might have. Then ask students whether they think it matters whether the concentration is high or low.
Activity Synthesis
The purpose of this discussion is for students to gain a better understanding of when there is an association in the data. Discuss how calculating the relative frequency for each row or each column can help one get a better sense of the data.
Here are some questions for discussion:
 “What does it mean for two categorical variables to have an association?” (It means that the variables are statistically related to each other.)
 “For the data in Jada’s survey, the same number of people preferred sneakers without laces in the two youngest age groups. Why doesn’t this matter when deciding if there is an association between age group and shoe preference?” (It does not matter because we are looking for the relative frequency. Sneakers without laces represents the most frequent condition for the youngest age group and the least frequent condition for the secondyoungest age group.)
It may be worth noting that silicon dioxide is the main chemical component in sand, so it should make sense that it has little impact on coral health.
If time permits, you may want to discuss associations in Jada’s data using different twoway relative frequency tables.
The first table shows row relative frequencies. It can help show that most of the younger kids (ages 4–10) prefer sneakers without laces, while many of the young adults (ages 18–24) prefer shoes that are not sneakers. This indicates that shoe preference is likely associated with age group.
prefers sneakers without laces  prefers sneakers with laces  prefers shoes that are not sneakers  total  

4–10 years old  58.33%  33.33%  8.33%  100% 
11–17 years old  19.44%  44.44%  36.11%  100% 
18–24 years old  9.62%  34.62%  55.77%  100% 
In the next table, column relative frequencies are shown. It can help show that, among people who prefer shoes that are not sneakers, a very small percentage are younger kids. Conversely, among people who prefer sneakers without laces, about a third are younger kids. Since there are such large differences in the percentages, there is likely an association between age group and shoe preference.
prefers sneakers without laces  prefers sneakers with laces  prefers shoes that are not sneakers  

4–10 years old  36.84%  10.53%  2.33% 
11–17 years old  36.84%  42.11%  30.23% 
18–24 years old  26.32%  47.37%  67.44% 
total  100%  100%  100% 
3.3: Associating Your Own Variables (15 minutes)
Activity
The mathematical purpose of this activity is for students to understand what tables might look like when two categories are possibly associated or not. Students invent their own pair of variables that they expect to have an association and another pair that they expect are not associated. Then, students prepare a twoway table of invented data that shows evidence of the association or not. Then, students create a display of their information for the class to see.
Notice groups that create displays that communicate their mathematical thinking clearly, contain an error that would be instructive to discuss, or organize the information in a way that is useful for all to see.
For students struggling to think of examples, these are some suggestions:
Associated categories:
 age groups and price of a movie ticket
 region of the country (North, West, or South) and whether the person enjoys skiing
 day of the week and wakeup time (before 7 a.m. or after 7 a.m.)
Unassociated categories:
 eye color and whether or not you sing in the chorus
 saltwater or freshwater fish tank and decorations in the tank or no decorations in the tank
 favorite sport and whether or not you eat broccoli
Making spreadsheet technology available gives students an opportunity to choose appropriate tools strategically (MP5).
Launch
Arrange students in groups of 2–4. Provide each group with tools for creating a visual display.
Discuss any expectations for the group presentation. For example, each group member might be assigned a specific role for the presentation.
Supports accessibility for: Conceptual processing; Memory
Student Facing
 Work with your group to identify a pair of categorical variables you think might be associated and another pair you think would not be associated.
 Imagine your group collected data for each pair of categorical variables. Create a twoway table that could represent each set of data. Invent some data with 100 total values to complete each table. Remember that one table shows a possible association, and the other table shows no association.
 Explain or show why there appears to be an association for the first pair of variables and why there appears to be no association for the other pair of variables.
 Prepare a display of your work to share.
Student Response
For access, consult one of our IM Certified Partners.
Anticipated Misconceptions
Students may struggle with knowing what numbers to use in their madeup data set. As students create their twoway tables, ensure that they keep the association in mind. Encourage students to fill in the totals for the rows (or columns) first, and then adjust the numbers in the cells to show a clear difference between the rows (or columns).
Activity Synthesis
The goal of the discussion is to make sure students understand when there is a possible association and when there is no association in categorical data.
Invite each group to present their chosen variable pairs along with the display they created. After each group presents, discuss how the group created the data for each twoway table and how others can recognize if there is an association present or not. If time permits, ask questions such as:
 “Was one table easier to create than the other? Why?” (The data with no association was more difficult to create, because I had to think about getting the relative frequencies similar.)
 “What insight into twoway tables, relative frequencies, or associations in categorical data did you gain as a result of doing this task?” (First, this really got me thinking about categorical data, because I kept thinking about numerical data and that did not work. Second, it really helped me understand why we needed relative frequencies and not just the frequencies from the twoway table. The relative frequencies allowed me to compare data in the rows to know whether or not they were associated.)
Design Principle(s): Optimize output (for explanation)
Lesson Synthesis
Lesson Synthesis
The goal of this discussion is to make sure students know how to decide if there is an association between two categorical variables.
 “How do you know whether or not two variables are associated?” (First, you create a row or column relative frequency table. Second, you look for similarities in the columns or rows. If the relative frequencies in the rows or columns are similar, then there is probably no association. If they are very different, there is a possible association.)
 “What does it mean when we say two variables are associated? Provide an example.” (It means the variables are statistically related. For example, if we compared handedness (left or right) to messiness (neat or messy), we might find that left and righthanded people are equally messy, and there would be no association. If we found out that the percentage who are messy was much greater for righthanded people than for lefthanded people, then we would have evidence of an association between handedness and messiness.)
3.4: Cooldown  Graduate Debt (5 minutes)
CoolDown
For access, consult one of our IM Certified Partners.
Student Lesson Summary
Student Facing
An association between two variables means that the two variables are statistically related to each other. For example, we might expect that ice cream sales would be higher on sunny days than on snowy days. If sales were higher on sunny days than on snowy days, then we would say that there is a possible association between ice cream sales and whether or not it is sunny or snowing. When dealing with categorical variables, row or column relative frequency tables are often used to look for associations in the data.
Here is a twoway table displaying ice cream sales and weather conditions for 41 days for a particular creamery.
sunny day  snowy day  total  

sold fewer than 50 cones  8  7  15 
sold 50 cones or more  22  4  26 
total  30  11  41 
Noticing a pattern in the raw data can be difficult, especially when the row or column totals are not the same for different categories, so the data should be converted into a row or column relative frequency table to better compare the categories. For the creamery, notice that the number of days with low sales is about the same for the two weather types, which contradicts our intuition. In this case, it makes sense to look at the percentage of days that sold well under each weather condition separately. That is, consider the column relative frequencies.
sunny day  snowy day  

sold fewer than 50 cones  27%  64% 
sold 50 cones or more  73%  36% 
total  100%  100% 
From the column relative frequency table, it is clear that most of the sunny days resulted in sales of at least 50 cones (73%), while most of the snowy days resulted in fewer than 50 cones sold (64%). Because these percentages are quite different, this suggests there is an association between the weather condition and the number of cone sales. A bakery might wonder if the weather conditions impact their muffin sales as well.
sunny day  snowy day  

sold fewer than 50 muffins  32%  35% 
sold 50 muffins or more  68%  65% 
total  100%  100% 
For the bakery, it seems there is not an association between weather conditions and muffin sales, since the percentage of days with low sales are very similar under the different weather conditions, and the percentages are also close on days when they sold many muffins.
Using row or column relative frequency tables helps organize data so that columns (or rows) can be easily compared between different categories for a variable. This comparison can be accomplished using a twoway table or a twoway relative frequency table, but it requires you to account for the differences in the number of data values in a given category.