# Lesson 3

Associations in Categorical Data

## 3.1: Cake or Pie (5 minutes)

### Warm-up

The mathematical purpose of this activity is for students to interpret data in a two-way table using relative frequencies. Listen for students mentioning relative frequencies and how they are calculating it (using row, column, or overall totals).

### Launch

Arrange students in groups of 2. Give students quiet think time to answer the first question and think about the others. Ask partners to compare and discuss their answers. Follow with a whole-class discussion.

### Student Facing

The table displays the dessert preference and dominant hand (left- or right-handed) for a sample of 300 people.

prefers cake prefers pie total
left-handed 10 20 30
right-handed 90 180 270
total 100 200 300

For each of the calculations, describe the interpretation of the percentage in terms of the situation.

1. 10% from $$\frac{10}{100} = 0.1$$
2. 67% from $$\frac{180}{270} \approx 0.67$$
3. 30% from $$\frac{90}{300} = 0.3$$

### Student Response

For access, consult one of our IM Certified Partners.

### Anticipated Misconceptions

Students may be confused as to how to interpret the table. Encourage students to use the table to look for the specific numbers mentioned in the questions. Then students can describe the larger group as well as the subgroup being considered in each respective question.

### Activity Synthesis

The goal is to make sure students understand that relative frequencies can be calculated using the overall total, or the totals from the rows or the columns.

Ask students who were identified as mentioning relative frequency, “When you said relative frequency, what did you mean?” (I meant the frequency of a particular cell in the table occurring relative to the total. For example, 10% comes from 10 left-handed people who prefer cake out of the 100 total people who prefer cake.)

Here are some questions for discussion.

• “How does this activity relate to prior work with relative frequency tables?” (We did not make the table here, but we had to use the values that we could have found by creating relative frequency tables. For this activity, you would have needed three relative frequency tables.)
• “For the question about 67%, did you use the row totals or the column totals?” (The row totals.)
• “The overall total was 300. How many people would you expect to prefer cake and be left handed if the overall total was 600? How does this connect to the concept of relative frequency?” (20. It is connected to relative frequency because 10 out of 300 is the same ratio as 20 out of 600.)

## 3.2: Associations in Categorical Data (15 minutes)

### Activity

The mathematical purpose of this activity is to look for associations in categorical data. Since the subgroups represented by rows or columns are not always equally represented in the data, associations are often best recognized by considering row or column frequencies and comparing across the categories. Students should determine that there is likely not an association between the categories if the percentages are similar, or that there is evidence of an association between the variables when the percentages are very different.

Monitor for students who use the term association in their explanations. Identify students who struggle with knowing whether to compare frequencies in the columns or the rows, and for students who do not know how to find relative frequencies.

Making spreadsheet technology available gives students an opportunity to choose appropriate tools strategically (MP5).

### Launch

Arrange students in groups of 2.

Ask students what it might mean for two variables to be associated. Provide these examples and ask students whether they think the variables might be associated:

• the color of a car’s seats and whether the owner returns to the dealership for maintenance or not (there is probably not an association)
• the color of a car’s seats and whether the owner is ever burned from the seats being hot (there is probably an association here, since darker seats will probably get hotter than lighter-colored seats)

After students have had a chance to work through the first problem about coral health, pause the work to discuss what students noticed about possible associations. Ensure that the discussion concludes that when the column (or row) percentages are similar (like with the silicon dioxide concentration), there is probably no association between the different variables. When the percentages are not close (like with the nitrate concentration), there is likely an association between the variables.

Give students quiet think time to work through the remaining problems. Follow with a whole-class discussion.

Conversing, Writing: MLR2 Collect and Display. Before students begin writing a response to the last part of the first question, invite them to discuss their thinking with a partner. Listen for and collect vocabulary and phrases students use to describe whether two variables are associated. Display words and phrases such as “the variables are related” or “connected” for all to see, and then encourage students to use this language in their written responses. This will help students read and use mathematical language during their partner and whole-group discussions.
Design Principle(s): Support sense-making
Engagement: Develop Effort and Persistence. Break the class into small discussion groups and invite a representative from each group to report back to the whole class. Instead of waiting for whole-class discussion until the end of the activity, integrate supports by pausing after each of the three questions and allowing representatives to report out. Use these intervals to clarify misconceptions before students continue to their next question.
Supports accessibility for: Attention; Social-emotional skills

### Student Facing

1. The two-way table displays data about 55 different locations. Scientists have a list of possible chemicals that may influence the health of the coral. They first look at how nitrate concentration might be related to coral health. The table displays the health of the coral (healthy or unhealthy) and nitrate concentration (low or high).

low nitrate concentration high nitrate concentration total
healthy 20 5 25
unhealthy 8 22 30
total 28 27 55
1. Complete the two-way relative frequency table for the data in the two-way table in which the relative frequencies are based on the total for each column.

low nitrate concentration high nitrate concentration
healthy
unhealthy
total 100% 100%
2. When there is a low nitrate concentration, which had a higher relative frequency, healthy or unhealthy coral?

3. When there is a high nitrate concentration, is there a higher relative frequency of healthy or unhealthy coral?

4. Based on this data, is there a possible association between coral health and the level of nitrate concentration? Explain your reasoning.

5. The scientists next look at how silicon dioxide concentration might be related to coral health. The relative frequencies based on the total for each column are shown in the table. Based on this data, is there a possible association between coral health and the level of silicon dioxide concentration? Explain your reasoning.

low silicon dioxide concentration high silicon dioxide concentration
healthy 44% 46%
unhealthy 56% 54%
total 100% 100%
2. Jada surveyed 300 people from various age groups about their shoe preference. The two-way table summarizes the results of the survey.
prefers sneakers without laces prefers sneakers with laces prefers shoes that are not sneakers total
4–10 years old 21 12 3 36
11–17 years old 21 48 39 108
18–24 years old 15 54 87 156
total 57 114 129 300

Jada concludes that there is a possible association between age and shoe preference. Is Jada’s conclusion reasonable? Explain your reasoning.

3. The two-way table summarizes data on writing utensil preference and the dominant hand for a sample of 100 people.
left-handed right-handed total
prefers pen 7 82 89
prefers pencil 6 5 11
total 13 87 100

Is there a possible association between dominant hand and writing utensil preference? Explain your reasoning.

### Student Response

For access, consult one of our IM Certified Partners.

### Student Facing

#### Are you ready for more?

The incomplete two-way table displays the results of a survey about the type of sports medicine treatment and recovery time for 33 student athletes who visited the athletic trainer.

 returned to playing in less than 2 days returned to playing in 2 or more days treated with ice 8 4 treated with heat
1. What 2 values could you use to complete the two-way table to show that there is an association between returning to playing in less than 2 days and the treatment (ice or heat)? Explain your reasoning.

2. What 2 values could you use to complete the two-way table to show that there is no association between returning to playing in less than 2 days and the treatment (ice or heat)? Explain your reasoning.

3. Which 2 values were easier to choose, the 2 values showing an association, or the 2 values showing no association? Explain your reasoning.

### Student Response

For access, consult one of our IM Certified Partners.

### Anticipated Misconceptions

Students may not know how to determine if they notice an association between the variables. Ask students to describe whether coral with low nitrate concentration tends to be healthy or unhealthy, and then compare that to a description of what a high-nitrate coral might have. Then ask students whether they think it matters whether the concentration is high or low.

### Activity Synthesis

The purpose of this discussion is for students to gain a better understanding of when there is an association in the data. Discuss how calculating the relative frequency for each row or each column can help one get a better sense of the data.

Here are some questions for discussion:

• “What does it mean for two categorical variables to have an association?” (It means that the variables are statistically related to each other.)
• “For the data in Jada’s survey, the same number of people preferred sneakers without laces in the two youngest age groups. Why doesn’t this matter when deciding if there is an association between age group and shoe preference?” (It does not matter because we are looking for the relative frequency. Sneakers without laces represents the most frequent condition for the youngest age group and the least frequent condition for the second-youngest age group.)

It may be worth noting that silicon dioxide is the main chemical component in sand, so it should make sense that it has little impact on coral health.

If time permits, you may want to discuss associations in Jada’s data using different two-way relative frequency tables.

The first table shows row relative frequencies. It can help show that most of the younger kids (ages 4–10) prefer sneakers without laces, while many of the young adults (ages 18–24) prefer shoes that are not sneakers. This indicates that shoe preference is likely associated with age group.

prefers sneakers without laces prefers sneakers with laces prefers shoes that are not sneakers total
4–10 years old 58.33% 33.33% 8.33% 100%
11–17 years old 19.44% 44.44% 36.11% 100%
18–24 years old 9.62% 34.62% 55.77% 100%

In the next table, column relative frequencies are shown. It can help show that, among people who prefer shoes that are not sneakers, a very small percentage are younger kids. Conversely, among people who prefer sneakers without laces, about a third are younger kids. Since there are such large differences in the percentages, there is likely an association between age group and shoe preference.

prefers sneakers without laces prefers sneakers with laces prefers shoes that are not sneakers
4–10 years old 36.84% 10.53% 2.33%
11–17 years old 36.84% 42.11% 30.23%
18–24 years old 26.32% 47.37% 67.44%
total 100% 100% 100%

## 3.3: Associating Your Own Variables (15 minutes)

### Activity

The mathematical purpose of this activity is for students to understand what tables might look like when two categories are possibly associated or not. Students invent their own pair of variables that they expect to have an association and another pair that they expect are not associated. Then, students prepare a two-way table of invented data that shows evidence of the association or not. Then, students create a display of their information for the class to see.

Notice groups that create displays that communicate their mathematical thinking clearly, contain an error that would be instructive to discuss, or organize the information in a way that is useful for all to see.

For students struggling to think of examples, these are some suggestions:

Associated categories:

• age groups and price of a movie ticket
• region of the country (North, West, or South) and whether the person enjoys skiing
• day of the week and wake-up time (before 7 a.m. or after 7 a.m.)

Unassociated categories:

• eye color and whether or not you sing in the chorus
• saltwater or freshwater fish tank and decorations in the tank or no decorations in the tank
• favorite sport and whether or not you eat broccoli

Making spreadsheet technology available gives students an opportunity to choose appropriate tools strategically (MP5).

### Launch

Arrange students in groups of 2–4. Provide each group with tools for creating a visual display.

Discuss any expectations for the group presentation. For example, each group member might be assigned a specific role for the presentation.

Representation: Internalize Comprehension. Begin the activity with concrete or familiar contexts. Invite students to list pairs of categorical variables they have already studied. Invite them to use storytelling and examples relevant to their own experiences in the group presentation. For instance, if they chose to create a table on the association between day and wake-up time, they may also include some personal anecdotes as to why Monday is correlated with a later wake-up time. Encouraging them to explain the data in this personal way will support them in using approaches that create opportunities for sense-making and help them more fluently manipulate tables.
Supports accessibility for: Conceptual processing; Memory

### Student Facing

1. Work with your group to identify a pair of categorical variables you think might be associated and another pair you think would not be associated.
2. Imagine your group collected data for each pair of categorical variables. Create a two-way table that could represent each set of data. Invent some data with 100 total values to complete each table. Remember that one table shows a possible association, and the other table shows no association.
3. Explain or show why there appears to be an association for the first pair of variables and why there appears to be no association for the other pair of variables.
4. Prepare a display of your work to share.

### Student Response

For access, consult one of our IM Certified Partners.

### Anticipated Misconceptions

Students may struggle with knowing what numbers to use in their made-up data set. As students create their two-way tables, ensure that they keep the association in mind. Encourage students to fill in the totals for the rows (or columns) first, and then adjust the numbers in the cells to show a clear difference between the rows (or columns).

### Activity Synthesis

The goal of the discussion is to make sure students understand when there is a possible association and when there is no association in categorical data.

Invite each group to present their chosen variable pairs along with the display they created. After each group presents, discuss how the group created the data for each two-way table and how others can recognize if there is an association present or not. If time permits, ask questions such as:

• “Was one table easier to create than the other? Why?” (The data with no association was more difficult to create, because I had to think about getting the relative frequencies similar.)
• “What insight into two-way tables, relative frequencies, or associations in categorical data did you gain as a result of doing this task?” (First, this really got me thinking about categorical data, because I kept thinking about numerical data and that did not work. Second, it really helped me understand why we needed relative frequencies and not just the frequencies from the two-way table. The relative frequencies allowed me to compare data in the rows to know whether or not they were associated.)
Speaking, Representing: MLR8 Discussion Supports. Give students additional time to make sure that everyone in their group can explain their visual display and the relationships between the variables represented. Invite groups to rehearse what they will say when they share with the whole class. Rehearsing provides students with additional opportunities to speak and clarify their thinking, and will improve the quality of explanations shared during the whole-class discussion.
Design Principle(s): Optimize output (for explanation)

## Lesson Synthesis

### Lesson Synthesis

The goal of this discussion is to make sure students know how to decide if there is an association between two categorical variables.

• “How do you know whether or not two variables are associated?” (First, you create a row or column relative frequency table. Second, you look for similarities in the columns or rows. If the relative frequencies in the rows or columns are similar, then there is probably no association. If they are very different, there is a possible association.)
• “What does it mean when we say two variables are associated? Provide an example.” (It means the variables are statistically related. For example, if we compared handedness (left or right) to messiness (neat or messy), we might find that left- and right-handed people are equally messy, and there would be no association. If we found out that the percentage who are messy was much greater for right-handed people than for left-handed people, then we would have evidence of an association between handedness and messiness.)

## 3.4: Cool-down - Graduate Debt (5 minutes)

### Cool-Down

For access, consult one of our IM Certified Partners.

## Student Lesson Summary

### Student Facing

An association between two variables means that the two variables are statistically related to each other. For example, we might expect that ice cream sales would be higher on sunny days than on snowy days. If sales were higher on sunny days than on snowy days, then we would say that there is a possible association between ice cream sales and whether or not it is sunny or snowing. When dealing with categorical variables, row or column relative frequency tables are often used to look for associations in the data.

Here is a two-way table displaying ice cream sales and weather conditions for 41 days for a particular creamery.

sunny day snowy day total
sold fewer than 50 cones 8 7 15
sold 50 cones or more 22 4 26
total 30 11 41

Noticing a pattern in the raw data can be difficult, especially when the row or column totals are not the same for different categories, so the data should be converted into a row or column relative frequency table to better compare the categories. For the creamery, notice that the number of days with low sales is about the same for the two weather types, which contradicts our intuition. In this case, it makes sense to look at the percentage of days that sold well under each weather condition separately. That is, consider the column relative frequencies.

sunny day snowy day
sold fewer than 50 cones 27% 64%
sold 50 cones or more 73% 36%
total 100% 100%

From the column relative frequency table, it is clear that most of the sunny days resulted in sales of at least 50 cones (73%), while most of the snowy days resulted in fewer than 50 cones sold (64%). Because these percentages are quite different, this suggests there is an association between the weather condition and the number of cone sales. A bakery might wonder if the weather conditions impact their muffin sales as well.

sunny day snowy day
sold fewer than 50 muffins 32% 35%
sold 50 muffins or more 68% 65%
total 100% 100%

For the bakery, it seems there is not an association between weather conditions and muffin sales, since the percentage of days with low sales are very similar under the different weather conditions, and the percentages are also close on days when they sold many muffins.

Using row or column relative frequency tables helps organize data so that columns (or rows) can be easily compared between different categories for a variable. This comparison can be accomplished using a two-way table or a two-way relative frequency table, but it requires you to account for the differences in the number of data values in a given category.