Answer Booklet · IBDP Mathematics AA · SL & HL · 4.1–4.3

The City Sleep Study
Teacher Answers

Not for distribution · Full worked solutions + marking guidance
🔐 Interactive HTML Unlock Codes (case-insensitive)
Phase 1 → 2
SAMPLE01
Phase 2 → 3
CUMUL02
Phase 3 → 4
STATS03
Phase 4 → Done
COMPAR04
Reference Data — City Sleep Study (n = 80)
Hours (h)4≤h<55≤h<66≤h<77≤h<88≤h<99≤h<1010≤h<11
Frequency3718241693
Midpoint4.55.56.57.58.59.510.5
Cumul. freq.3102852687780
1

Sampling — How Did She Get This Data?

SL & HL · 4.1
Task 1aIdentify the sampling method

Method:

Convenience sampling

Source of bias: People at a shopping mall on Saturday afternoon are not representative of all residents aged 13–25. Teenagers may be under-represented if they don't visit malls independently; wealthier individuals may be over-represented. Any well-reasoned contextual bias is acceptable.

Watch for: Students must name the method precisely and link the bias to a specific feature of the context — not just say "it's not random."
Task 1bCompare sampling methods
  • Simple random: Assign every resident aged 13–25 a number; use a random number generator to select 80. Every individual has an equal chance of selection. Requires a complete sampling frame, which may not exist.
  • Systematic: Order the population list, calculate interval \(k = \dfrac{N}{80}\), choose a random start between 1 and \(k\), then select every \(k\)-th person. Practical but vulnerable to periodic bias in the list.
  • Stratified (13–17 and 18–25): Divide the population into two strata and sample from each in proportion to their size. Guarantees representation of both age groups.

Most appropriate:

Stratified sampling by age group

Justification: Since the study aims to compare sleep across age groups, stratified sampling ensures both groups appear in correct proportion — preventing simple random sampling from accidentally under-representing either group.

Accept any well-reasoned choice. Credit students who argue for simple random with a valid justification. Quality of reasoning matters more than the chosen method.
Task 1cStratified sample sizes
\[\text{Total population} = 3\,200 + 4\,800 = 8\,000\] \[n_{\text{teenagers}} = \frac{3\,200}{8\,000} \times 80 = 0.4 \times 80 = 32\] \[n_{\text{young adults}} = \frac{4\,800}{8\,000} \times 80 = 0.6 \times 80 = 48\] \[\text{Check: } 32 + 48 = 80 \ \checkmark\]
32 teenagers  |  48 young adults
2

Cumulative Frequency Graph

SL & HL · 4.2
Task 2aCumulative frequency table
Upper boundary (h)567891011
Frequency3718241693
Cumulative frequency3102852687780
Common error: Students plotting at midpoints instead of upper class boundaries. The curve must pass through \((4,\ 0)\) and \((11,\ 80)\).
Task 2bOgive — Q₁, Q₂, Q₃

Quartiles are read at cumulative frequencies \(\tfrac{1}{4}n\), \(\tfrac{1}{2}n\), \(\tfrac{3}{4}n\):

\[Q_1:\ CF = \tfrac{1}{4}\times 80 = 20 \quad\Rightarrow\quad Q_1 \approx 6.4\ \text{h}\] \[Q_2:\ CF = \tfrac{1}{2}\times 80 = 40 \quad\Rightarrow\quad Q_2 \approx 7.3\ \text{h (median)}\] \[Q_3:\ CF = \tfrac{3}{4}\times 80 = 60 \quad\Rightarrow\quad Q_3 \approx 8.2\ \text{h}\]
\(Q_1\approx 6.4\) h  |  Median \(\approx 7.3\) h  |  \(Q_3\approx 8.2\) h

Accept: \(Q_1\in[6.3,6.5]\) | \(Q_2\in[7.2,7.4]\) | \(Q_3\in[8.1,8.3]\)

Cumulative Frequency Graph (Ogive) — City Sleep Study
4 5 6 7 8 9 10 11 Hours of sleep, h 0 10 20 30 40 50 60 70 80 Cumulative frequency Q₁≈6.4 h Q₂≈7.3 h Q₃≈8.2 h Ogive Q₁ Q₂ Q₃
Marking the graph: Look for dotted lines drawn horizontally from \(CF=20,40,60\) to the curve, then vertically down to the \(h\)-axis. Values must be read from the graph, not back-calculated from raw data.
Task 2cPercentiles
\[P_{20}:\ CF = 0.20\times80 = 16 \quad\Rightarrow\quad P_{20}\approx 6.2\ \text{h}\] \[P_{90}:\ CF = 0.90\times80 = 72 \quad\Rightarrow\quad P_{90}\approx 9.4\ \text{h}\]
\(P_{20}\approx 6.2\) h  |  \(P_{90}\approx 9.4\) h

Accept: \(P_{20}\in[6.0,6.3]\) | \(P_{90}\in[9.3,9.6]\)


Contextual interpretation of \(P_{90}\): 90% of the surveyed residents aged 13–25 sleep fewer than approximately 9.4 hours per night on a typical school/work night. Only 10% sleep more than this.

Watch for: Students must interpret in context — not just state "90% of the data lies below this value." The answer must reference sleep, residents, and the study.
Task 2dIQR and range
\[\text{IQR} = Q_3 - Q_1 \approx 8.2 - 6.4 = 1.8\ \text{h}\] \[\text{Range} = x_{\max} - x_{\min} = 11 - 4 = 7\ \text{h}\]
IQR \(\approx 1.8\) h  |  Range \(=7\) h

IQR: accept any value consistent with the student's own \(Q_1\) and \(Q_3\).


IQR vs range: The IQR measures the spread of the middle 50% of the data and is resistant to extreme values and outliers. The range uses only the two most extreme values and can be heavily distorted by a single outlier, giving a misleading impression of typical spread.

3

Mean, Standard Deviation & Box Plot

SL & HL · 4.3
Task 3aEstimate the mean from grouped data
\[\bar{x} = \frac{\displaystyle\sum f_i m_i}{\displaystyle\sum f_i} = \frac{(3)(4.5)+(7)(5.5)+(18)(6.5)+(24)(7.5)+(16)(8.5)+(9)(9.5)+(3)(10.5)}{80}\] \[= \frac{13.5+38.5+117+180+136+85.5+31.5}{80} = \frac{602}{80} = 7.525\ \text{h}\]
\(\bar{x}\approx 7.53\) h
This is an estimate because mid-interval values are used as a proxy for all data within each class. Accept \(\bar{x}=7.525\) or rounded to 3 s.f. Full working must be visible on the board even if students verify with GDC.
Task 3bStandard deviation — sample vs population

The two values the GDC gives:

\[\sigma_n = \sqrt{\frac{\sum f_i(m_i-\bar{x})^2}{n}} \approx 1.38\ \text{h} \qquad (\text{population SD, divides by }n)\] \[\sigma_{n-1} = \sqrt{\frac{\sum f_i(m_i-\bar{x})^2}{n-1}} \approx 1.39\ \text{h} \qquad (\text{sample SD, divides by }n-1)\]

Correct choice:

\(\sigma_n\approx 1.38\) h — population standard deviation

Justification: The 80 surveyed residents are the complete dataset being analysed — we are not using a sample to estimate a larger unknown population parameter. We treat the 80 values as the full population of interest, so we divide by \(n\).


Contextual interpretation: On average, a resident's nightly sleep deviates from the mean (\(\approx 7.5\) h) by about 1.38 hours. The researcher prefers \(\sigma\) over the range because \(\sigma\) incorporates every data value, not just the two extremes, making it a more stable and informative measure of spread.

Key teaching moment: Both \(\sigma_n\) and \(\sigma_{n-1}\) are defensible depending on framing. IB convention: unless told otherwise, the dataset is the population, so \(\sigma_n\) is expected. Credit students who argue for \(\sigma_{n-1}\) with a clear justification that the 80 people are a sample from a larger city population.
Task 3cOutlier boundaries

Using the GDC quartile values based on midpoints:

\[Q_1 = 6.5\ \text{h},\quad Q_3 = 8.5\ \text{h},\quad \text{IQR} = 8.5 - 6.5 = 2.0\ \text{h}\] \[\text{Lower fence} = Q_1 - 1.5\times\text{IQR} = 6.5 - 1.5(2.0) = 6.5 - 3.0 = 3.5\ \text{h}\] \[\text{Upper fence} = Q_3 + 1.5\times\text{IQR} = 8.5 + 1.5(2.0) = 8.5 + 3.0 = 11.5\ \text{h}\] \[x_{\min}=4.5\ \text{h} > 3.5\ \checkmark \qquad x_{\max}=10.5\ \text{h} < 11.5\ \checkmark\]
No outliers in this dataset.

Contextual meaning: All sleep durations in the sample fall within the expected range. No resident's sleep time is extreme enough to be flagged as an outlier relative to the rest of the group.

Note: Because we use midpoint values (not class boundaries), the fences are wider and no outliers appear. This is a good discussion point — the choice of how to represent grouped data affects statistical conclusions.
Task 3dBox and whisker diagram

Five-number summary (midpoint-based, no outliers):

\[\min=4.5\ \text{h},\quad Q_1=6.5\ \text{h},\quad Q_2=7.5\ \text{h},\quad Q_3=8.5\ \text{h},\quad \max=10.5\ \text{h}\]
Box and Whisker Diagram — Full Sample (n = 80)
4 5 6 7 8 9 10 11 Hours of sleep (h) x̄≈7.53 min=4.5 Q₁=6.5 Q₃=8.5 max=10.5

Mean vs median: \(\bar{x}\approx7.53\) h, \(Q_2=7.5\) h. The mean and median are very close, suggesting the distribution is approximately symmetric with only a very slight positive skew.

What to look for on the board: Consistent scale; box from \(Q_1=6.5\) to \(Q_3=8.5\); vertical line at \(Q_2=7.5\); whiskers extending to min midpoint (4.5) and max midpoint (10.5); no outliers. Students should explicitly compare mean and median to comment on shape.
4

Extension — Comparing Two Groups

SL & HL · 4.2–4.3
ReferenceTeenager sub-group (aged 13–17)
\[\bar{x}=7.0\ \text{h},\quad\sigma=1.45\ \text{h},\quad Q_1=6.1\ \text{h},\quad Q_2=6.9\ \text{h},\quad Q_3=8.0\ \text{h},\quad x_{\min}=4\ \text{h},\quad x_{\max}=10\ \text{h}\] \[\text{IQR}=8.0-6.1=1.9\ \text{h}\qquad\text{Range}=10-4=6\ \text{h}\] \[\text{Lower fence}=6.1-1.5(1.9)=3.25\ \text{h}\qquad\text{Upper fence}=8.0+1.5(1.9)=10.85\ \text{h}\] \[x_{\min}=4>3.25\ \checkmark\qquad x_{\max}=10<10.85\ \checkmark\quad\Rightarrow\text{ no outliers}\]
Task 4aParallel box plots
Parallel Box and Whisker Diagrams — Full Sample vs Teenagers (aged 13–17)
4 5 6 7 8 9 10 11 Hours of sleep (h) All 4.5 Q₁=6.5 Q₂=7.5 Q₃=8.5 10.5 13–17 4 Q₁=6.1 Q₂=6.9 Q₃=8.0 10 Full sample (n=80) Teenagers 13–17
Check: Both plots on the same scale. No outliers in either group. Students should draw both on the same number line on the board with a shared scale.
Task 4bCompare the two distributions

Model comparative statements:

  • Median: Teenagers have a lower median sleep time (\(Q_2=6.9\) h) than the full sample (\(Q_2=7.5\) h), suggesting teenagers tend to sleep less on a typical school night.
  • IQR: The IQR for teenagers (1.9 h) is slightly smaller than for the full sample (2.0 h), indicating slightly less variability in sleep duration among the middle 50% of teenagers.
  • Range: Both distributions have a similar range — the full sample spans 4.5 to 10.5 h (range = 6 h) and the teenager sub-group spans 4 to 10 h (range = 6 h).
  • Standard deviation: Teenagers show greater overall variability (\(\sigma=1.45\) h) than the full sample (\(\sigma\approx1.38\) h), meaning individual sleep times differ more widely from the teenage mean.
Minimum expectation: 3 statements, each referencing a statistical measure AND interpreting it in context. Penalise answers that only compare numbers without linking to sleep patterns.
Task 4cEffect of a linear transformation — hours → minutes

Multiplying all data values by 60 applies the linear transformation \(Y=60X\):

\[\bar{y} = 60\bar{x} = 60\times7.525 = 451.5\ \text{min}\] \[\sigma_Y = 60\,\sigma_X = 60\times1.38 = 82.8\ \text{min}\] \[Q_1' = 60\times6.5 = 390\ \text{min}\qquad Q_2' = 60\times7.5 = 450\ \text{min}\qquad Q_3' = 60\times8.5 = 510\ \text{min}\] \[\text{IQR}' = 60\times2.0 = 120\ \text{min}\qquad\text{Range}' = 60\times6 = 360\ \text{min}\]
All measures of location and spread are multiplied by 60.

What stays the same: The shape of the distribution — skewness, symmetry, relative position of the median within the box, and whether outliers exist are all unchanged.

Formal justification:

\[\text{For }Y=kX:\qquad\mathrm{E}(kX)=k\,\mathrm{E}(X)\qquad\mathrm{Var}(kX)=k^2\,\mathrm{Var}(X)\qquad\sigma(kX)=k\,\sigma(X)\] \[\text{Quartiles are data values, so }Q_r(kX)=k\,Q_r(X)\quad\Rightarrow\quad\text{IQR and Range also scale by }k\]
Key distinction: For \(Y=X+c\) (adding a constant): location measures shift by \(c\) but spread measures (\(\sigma\), IQR, range) are unchanged. For \(Y=kX\) (multiplying): both location and spread scale by \(k\). This distinction is a high-frequency exam question — ensure students can articulate the rule, not just apply it.