The City Sleep Study — Teacher Answers

🔐 Student Activity Unlock Codes (case-insensitive)

Phase 1 → 2

SAMPLE01

Phase 2 → 3

CUMUL02

Phase 3 → 4

STATS03

Phase 4 → Done

COMPAR04

Reference Data — City Sleep Study (n = 80)

Hours (h)	4≤h<5	5≤h<6	6≤h<7	7≤h<8	8≤h<9	9≤h<10	10≤h<11
Frequency	3	7	18	24	16	9	3
Midpoint	4.5	5.5	6.5	7.5	8.5	9.5	10.5
Cumul. freq.	3	10	28	52	68	77	80

🎨 Oral Launch — Read Aloud to Class

"A public health researcher surveyed 80 teenagers and young adults in a large city about their nightly sleep. The data is in front of you. Your job today: interrogate this data — figure out how it was collected, what patterns it shows, and what it tells us about the city's health. All work goes on the whiteboard first."

Sampling — How Did She Get This Data?

SL & HL 4.1 · Sampling methods & bias

Task 1aIdentify the sampling method

Convenience sampling

Source of bias: People at a shopping mall on Saturday afternoon are not representative of all residents aged 13–25. Teenagers may be under-represented if they don't visit malls independently; wealthier individuals may be over-represented.

Accept any one clearly stated source of bias with a logical explanation. Push students to connect the bias to the population of interest, not just say "it's not random."

Task 1bCompare sampling methods

Simple random: Assign each resident in the city a number; use a random number generator to select 80. Every individual has an equal chance of selection.
Systematic: Order the city's residents list and select every \(k\)th person, where \(k = \text{population}/80\). Start at a random point in the first interval.
Stratified (13–17 and 18–25): Divide the population into the two age groups, then sample from each group in proportion to its size.

Most appropriate: stratified sampling

Justification: Sleep patterns likely differ between teenagers (school schedules) and young adults (work/university schedules). Stratified sampling ensures both groups are proportionally represented, producing a more reliable estimate for each subgroup.

Unlock code: SAMPLE01

Task 1cStratified sample sizes

Total population = 3 200 + 4 800 = 8 000

\text{Teenagers: }\frac{3200}{8000}\times80 = 32\qquad\text{Young adults: }\frac{4800}{8000}\times80 = 48

32 teenagers · 48 young adults

Cumulative Frequency Graph

SL & HL 4.2 · Ogive · Quartiles · Percentiles · IQR

Task 2aBuild the cumulative frequency table

Upper boundary	5	6	7	8	9	10	11
Cumulative frequency	3	10	28	52	68	77	80

Plot points at upper class boundaries (5, 6, 7…) and start the curve at (4, 0). The curve must pass through (4, 0) and (11, 80).

Task 2bOgive — Q₁, Q₂, Q₃

Quartiles are read at cumulative frequencies \(\tfrac{1}{4}n\), \(\tfrac{1}{2}n\), \(\tfrac{3}{4}n\):

Q_1:\;CF = 20 \;\Rightarrow\; Q_1 \approx 6.4\text{ h} \qquad Q_2:\;CF = 40 \;\Rightarrow\; Q_2 \approx 7.3\text{ h} \qquad Q_3:\;CF = 60 \;\Rightarrow\; Q_3 \approx 8.2\text{ h}

\(Q_1\approx 6.4\) h · Median \(\approx 7.3\) h · \(Q_3\approx 8.2\) h

Accept: \(Q_1\in[6.3,6.5]\) | \(Q_2\in[7.2,7.4]\) | \(Q_3\in[8.1,8.3]\)

Values may vary depending on how carefully students draw the ogive. Insist on dotted construction lines visible on the board.

Task 2cPercentiles

\text{20th percentile: }CF = 0.20\times80 = 16 \;\Rightarrow\; P_{20}\approx 6.1\text{ h}\] \[\text{90th percentile: }CF = 0.90\times80 = 72 \;\Rightarrow\; P_{90}\approx 9.2\text{ h}

Accept: \(P_{20}\in[5.9,6.2]\) | \(P_{90}\in[9.1,9.4]\)

Contextual interpretation of \(P_{90}\): 90% of the 80 residents surveyed sleep fewer than approximately 9.2 hours per night. Only 10% sleep that long or more.

Unlock code: CUMUL02

Task 2dIQR and range

\text{IQR} = Q_3 - Q_1 \approx 8.2 - 6.4 = 1.8\text{ h}\qquad\text{Range} = 11 - 4 = 7\text{ h (class boundaries)}

IQR ≈ 1.8 h · Range = 7 h

Why IQR is more useful: The IQR describes the spread of the middle 50% of the data, ignoring extreme values at both ends. The range is sensitive to just two data points (the minimum and maximum) and is easily distorted by outliers.

Key insight

IQR is resistant to outliers; range is not. Both describe spread, but IQR is more robust for skewed or extreme data.

Mean, Standard Deviation & Box Plot

SL & HL 4.3 · Mean · σ · Outlier test · Box and whisker

Task 3aEstimate the mean from grouped data

\bar{x}=\frac{\sum f_i m_i}{n}=\frac{3(4.5)+7(5.5)+18(6.5)+24(7.5)+16(8.5)+9(9.5)+3(10.5)}{80}\] \[=\frac{13.5+38.5+117+180+136+85.5+31.5}{80}=\frac{602}{80}=7.525\ \text{h}

\(\bar{x} = 7.525\) h (accept 7.52 or 7.53)

Full working must be visible on the board even if students verify with GDC.

Task 3bStandard deviation — population or sample?

\sigma_n \approx 1.38\text{ h}\quad(\text{population SD, divides by }n)\qquad\sigma_{n-1}\approx 1.39\text{ h}\quad(\text{sample SD, divides by }n-1)

\(\sigma_n \approx 1.38\) h — population standard deviation

Justification: The 80 residents are the complete dataset being analysed — we treat them as the full population of interest, not a sample used to estimate a larger unknown parameter, so we divide by \(n\).

Contextual meaning: A resident's nightly sleep deviates from the mean (≈7.5 h) by about 1.38 hours on average. The researcher prefers \(\sigma\) over the range because \(\sigma\) uses every data value, making it a more stable measure of spread.

Key teaching moment: Both \(\sigma_n\) and \(\sigma_{n-1}\) are defensible depending on framing. IB convention: unless told otherwise, treat the dataset as the population. Credit students who argue for \(\sigma_{n-1}\) with a clear justification that the 80 people are a sample from the whole city.

Task 3cOutlier boundaries

Using GDC quartile values based on midpoints: \(Q_1=6.5\) h, \(Q_3=8.5\) h, IQR \(=2.0\) h

\text{Lower fence} = 6.5 - 1.5(2.0) = 3.5\text{ h}\qquad\text{Upper fence} = 8.5 + 1.5(2.0) = 11.5\text{ h}\] \[x_{\min}=4.5\text{ h} > 3.5\;\checkmark\qquad x_{\max}=10.5\text{ h} < 11.5\;\checkmark

No outliers in this dataset.

Contextual meaning: All sleep durations fall within the expected range. An outlier would represent a resident sleeping an unusually extreme amount — fewer than 3.5 h or more than 11.5 h — which would warrant a follow-up interview to check for data errors or unusual circumstances.

Because we use midpoint values, the fences are wider and no outliers appear. Good discussion: the choice of how to represent grouped data affects statistical conclusions.

Task 3dBox and whisker diagram

Five-number summary (midpoint-based, no outliers):

\min=4.5\text{ h},\quad Q_1=6.5\text{ h},\quad Q_2=7.5\text{ h},\quad Q_3=8.5\text{ h},\quad\max=10.5\text{ h}

Mean vs median: \(\bar{x}\approx7.53\) h, \(Q_2=7.5\) h. The mean and median are very close, suggesting an approximately symmetric distribution with a very slight positive skew.

What to look for on the board: Consistent scale; box from \(Q_1=6.5\) to \(Q_3=8.5\); vertical line at \(Q_2=7.5\); whiskers to min 4.5 and max 10.5; no outliers marked. Students should note the mean and median are nearly equal. Unlock code: STATS03

Comparing Two Distributions

SL & HL 4.2–4.3 · Comparative box plots · Linear transformations

Task 4aDraw a second box plot — neighbouring city

Given data for the neighbouring city: mean = 6.8 h, σ = 1.1 h, \(Q_1=6.1\) h, \(Q_2=6.7\) h, \(Q_3=7.4\) h.

Box plot summary (neighbouring city): box from 6.1 to 7.4; median line at 6.7; whiskers to reasonable min/max (no outliers given, so accept approximately \(6.1 - 1.5(1.3) = 4.15\) and \(7.4 + 1.5(1.3) = 9.35\)).

Students should draw both box plots on the same number line on the same scale. The neighbouring city's distribution sits lower and is more compact.

Task 4bCompare the two distributions

Model comparative statements:

Median: The original city has a higher median (\(Q_2\approx7.5\) h) than the neighbouring city (\(Q_2=6.7\) h), suggesting residents in the first city typically sleep more on a school/work night.
IQR (spread): The original city has a wider IQR (≈2.0 h) than the neighbouring city (1.3 h), indicating more variability in sleep duration among the middle 50% of residents.
Standard deviation: The original city shows greater overall spread (\(\sigma\approx1.38\) h) compared to the neighbouring city (\(\sigma=1.1\) h), meaning individual sleep times differ more widely from the mean.
Public health implication: The neighbouring city's residents sleep less on average, which may indicate greater issues with sleep deprivation and could warrant a targeted health intervention for that population.

Minimum expectation: 3 statements, each referencing a statistical measure AND interpreting it in context. Penalise answers that only compare numbers without linking to sleep patterns.

Task 4cEffect of a linear transformation — hours → minutes

Multiplying all data values by 60 applies the transformation \(Y=60X\):

\bar{y}=60\times7.525=451.5\text{ min}\qquad\sigma_Y=60\times1.38=82.8\text{ min}\] \[Q_1'=390\text{ min}\quad Q_2'=450\text{ min}\quad Q_3'=510\text{ min}\quad\text{IQR}'=120\text{ min}\quad\text{Range}'=360\text{ min}

All measures of location and spread are multiplied by 60.

What stays the same: The shape of the distribution — skewness, symmetry, relative position of the median within the box, and whether outliers exist are all unchanged.

\text{For }Y=kX:\quad\mathrm{E}(kX)=k\,\mathrm{E}(X)\quad\mathrm{Var}(kX)=k^2\,\mathrm{Var}(X)\quad\sigma(kX)=k\,\sigma(X)

Key distinction

For \(Y=X+c\): location shifts by \(c\), spread is unchanged. For \(Y=kX\): both location and spread scale by \(k\). This is a high-frequency exam question.

Unlock code: COMPAR04

📋 Consolidation — select 2–3 boards to debrief

Phase 1: Why is stratified sampling best here? What would change if we used convenience sampling for policy decisions?
Phase 2: How do you read a quartile from an ogive? What does the S-shape tell us about the distribution?
Phase 3: Why are the mean and median so close? What would cause them to diverge?
Phase 4: If you multiply data by 60, why does the IQR also multiply by 60 but the shape stays the same?