Lesson 6
Areas in Histograms
- Let’s find proportions of data in certain intervals.
6.1: Find the Area
- Find the shaded area between the function, the \(x\)-axis, and the boundaries \(x = 1\) and \(x = 2\). Explain or show your reasoning.
- What proportion of the area between the function, the \(x\)-axis, and the boundaries \(x = 0\) and \(x = 7\) is shaded? Explain or show your reasoning.
6.2: Story Submissions
A publisher takes submissions for short stories to include in a book. 200 stories are submitted, but the publisher needs to be aware of how long each story is. The way the publisher will put together the collection of stories, a page typically contains 200 words. The mean number of words for each story is 2,600 and the standard deviation is 400 words.
- 2844
- 2643
- 3316
- 2084
- 2316
- 2513
- 2931
- 2563
- 2655
- 2345
- 2465
- 2821
- 2493
- 2263
- 2706
- 2501
- 2627
- 2220
- 2372
- 2635
- 3066
- 2824
- 2357
- 2522
- 2564
- 2901
- 2118
- 2325
- 3551
- 2734
- 2888
- 2695
- 2763
- 2867
- 3301
- 2546
- 2174
- 2515
- 2936
- 3308
- 3624
- 2927
- 3101
- 3118
- 2761
- 3020
- 2556
- 3193
- 2513
- 3247
- 2476
- 2678
- 2466
- 3311
- 2863
- 2632
- 2669
- 2710
- 2440
- 2846
- 2425
- 3143
- 2491
- 2736
- 2115
- 2175
- 1722
- 3462
- 2570
- 2797
- 2505
- 2308
- 2224
- 1613
- 2361
- 2724
- 2438
- 3377
- 2156
- 2219
- 2302
- 1908
- 1453
- 2213
- 3172
- 2976
- 2042
- 3063
- 2954
- 3153
- 2470
- 1650
- 2404
- 2188
- 2722
- 2359
- 2635
- 2896
- 2809
- 2864
- 2756
- 2663
- 2259
- 2904
- 3138
- 2739
- 2784
- 3124
- 1867
- 3184
- 2073
- 2463
- 2374
- 1976
- 2746
- 3462
- 2730
- 1952
- 2068
- 3054
- 2476
- 2853
- 2538
- 2167
- 2732
- 3304
- 2347
- 3015
- 2151
- 2446
- 2714
- 2839
- 2727
- 2489
- 2481
- 2367
- 3116
- 2650
- 2477
- 2360
- 2975
- 2871
- 2946
- 1849
- 2897
- 2625
- 2938
- 2407
- 2218
- 2287
- 2356
- 2125
- 3296
- 2289
- 2379
- 2868
- 2715
- 2793
- 2631
- 2973
- 2876
- 2295
- 2551
- 2381
- 3259
- 3094
- 2452
- 2149
- 3043
- 2638
- 2549
- 2542
- 2753
- 2985
- 2501
- 2393
- 2896
- 2135
- 3191
- 2319
- 1984
- 2013
- 2462
- 3186
- 2674
- 2273
- 2483
- 2671
- 2702
- 2819
- 2197
- 2427
- 2018
- 1927
- 2428
- 2438
- 1852
- 2395
- 1826
- 2767
- If a histogram is created using intervals of 200 words, what would be the area of the bar representing the number of stories that contain between 2,000 and 2,200 words? Explain or show your reasoning.
- What proportion of the total area is represented by the bar for stories that contain between 2,000 and 2,200 words? Explain or show your reasoning.
- What proportion of stories in this group contain between 2,000 and 2,200 words? Explain or show your reasoning.
- How does the proportion of the area you calculated relate to the proportion of stories in the group that contain between 2,000 and 2,200 words?
- What proportion of stories in this group are within 1 standard deviation of the mean number of words?
- What proportion of stories in this group are within 2 standard deviations of the mean number of words?
- What proportion of stories in this group are within 1 standard deviation of 2,400 words?
Prove more generally that the proportion of total area taken up by a bar in a histogram is equal to the proportion of all data values that are contained in the interval represented by bar. To begin, let \(n\) represent the number of data values in an interval given by one bar, \(M\) represent the number of data values in the entire set, and \(w\) be the width of the interval in each bar of the histogram. Prove that the proportion of area taken up by the bar is \(\frac{n}{M}\).
6.3: Website Load Times
A company collects data from 10,000 websites about how long it takes to load the site. The number of seconds it takes to fully load the website is summarized in the relative frequency table.
seconds to load | relative frequency |
---|---|
1.4–1.6 | 0.0003 |
1.6–1.8 | 0.0012 |
1.8–2.0 | 0.0053 |
2.0–2.2 | 0.0181 |
2.2–2.4 | 0.0442 |
2.4–2.6 | 0.0910 |
2.6–2.8 | 0.1555 |
2.8–3.0 | 0.1861 |
3.0–3.2 | 0.1938 |
3.2–3.4 | 0.1447 |
3.4–3.6 | 0.0923 |
3.6–3.8 | 0.0447 |
3.8–4.0 | 0.0166 |
4.0–4.2 | 0.0048 |
4.2–4.4 | 0.0012 |
4.4–4.6 | 0.0002 |
The relative frequency histogram summarizes the same data.
The mean time to load a website is 3 seconds and the standard deviation is 0.4 seconds.
- Would a normal distribution be a good model for this distribution? Explain your reasoning.
- What proportion of websites loaded within 1 standard deviation of the mean?
- What proportion of websites loaded within 2 standard deviations of the mean?
- What proportion of websites loaded within 1 standard deviation of 2.8 seconds?
- Compare the proportion of websites within 1 standard deviation of the mean to the proportion of stories in the submissions that are within 1 standard deviation of the mean number of words from the previous task. Do the same for the proportion within 2 standard deviations.
Summary
There is an important connection between areas in histograms and the data represented by the histogram. In particular, the proportion of the total area in the histogram that is represented by a single bar in the histogram is equivalent to the proportion of all the data that is included in that interval. This is made more interesting by the fact that, for normally distributed data, the proportion of values in an interval whose endpoints are described by the mean and standard deviation is always the same.
For example, a woodshop produces boards of various lengths. During a certain week, 5,000 boards are produced and measured. The mean length is 6 feet, and the standard deviation length is 1 foot. The table and histogram show a summary of the board lengths.
board length | 3.5–4 | 4–4.5 | 4.5–5 | 5–5.5 | 5.5–6 | 6–6.5 | 6.5–7 | 7–7.5 | 7.5–8 | 8–8.5 |
---|---|---|---|---|---|---|---|---|---|---|
frequency | 113 | 220 | 460 | 747 | 960 | 955 | 753 | 460 | 220 | 112 |
The total area of all the rectangles in the histogram is 2,500 since we could stack all the bars on top of one another and have a rectangle that is 5,000 tall and 0.5 wide. If we look at just the rectangles representing boards between 5.5 and 6 feet wide, the area is 480, which is 19.2% of the total area since \(\frac{480}{2,500} = 0.192\). Similarly, we can see from the data that 19.2% of the data is in this same interval since \(\frac{960}{5,000} = 0.192\). It is not a coincidence that these values are the same! The proportion of the total area that is in one of the rectangles is always equivalent to the proportion of all the data values that are in the same interval.
When the data is normally distributed, the proportions of certain regions are always the same. For example, there is always about 68% of the data within one standard deviation of the mean. Since the boards produced by the woodshop are approximately normal, we can test this information.
The boards within one standard deviation of the mean are between 5 and 7 feet long. Using the table, we can see that 3,415 boards are in this range (\(747+960+955+753 = 3,\!415\)) and those represent 68.3% (\(\frac{3,415}{5,000} = 0.683\)) of the boards produced in the woodshop.
Let’s say that, another week, the woodshop produces 5,000 boards again, but this time, the mean is 6.5 feet and the standard deviation is 0.75 feet. As long as the board lengths continue to be approximately normal, we can expect about 68% of the boards to be within 1 standard deviation of the mean. For that week, it means that about 68% of the boards will be between 5.75 and 7.25 feet long.
In fact, as long as the interval can be described using only the mean and standard deviation and the data is normally distributed, the proportion of data values in the interval can be found. In general, about 68% of the data is within 1 standard deviation of the mean, about 95% of the data is within 2 standard deviations of the mean, and more than 99% of the data is within 3 standard deviations of the mean.
Glossary Entries
- normal distribution
A specific distribution in statistics whose graph is symmetric and bell-shaped, has an area of 1 between the \(x\)-axis and the graph, and has the \(x\)-axis as a horizontal asymptote.
- relative frequency histogram
A histogram where the height of each bar is the fraction of the entire data set that falls into the corresponding interval (that is, it is the relative frequency with which the data values fall into that interval).