Answer Booklet · IBDP Mathematics AA · SL & HL · 4.4 · 4.10

The Soccer Scout Report
Bivariate Statistics

Not for distribution · Full worked solutions + marking guidance
🔐 Interactive HTML Unlock Codes (case-insensitive)
Phase 1 → Phase 2SCATTER
Phase 2 → Phase 3PEARSON
Phase 3 → Phase 4PREDICT
Phase 4 → CompleteCHURROS
📋 Datasets for Phases 2, 3 & 4
Phases 2 & 3 — Weekly training hours vs sprint speed (youth academy, n = 12)
Player123456789101112
Training hrs/wk (x)4566788910111213
Sprint speed m/s (y)6.16.46.36.87.07.26.97.57.47.87.98.1
Means\(\bar{x} = 8.25\) hrs/wk  ·  \(\bar{y} \approx 7.117\) m/s
Phase 4 — Churros sold vs goals scored (20 home matches)
Match12345678910
Churros sold (x)210185340290155410375230460310
Goals scored (y)1032143142
Match11121314151617181920
Churros sold (x)270195385440160325280500215355
Goals scored (y)2134032513
Means\(\bar{x} = 304.5\) churros  ·  \(\bar{y} = 2.25\) goals
1

What Do You See?

Give code SCATTER once all groups have ranked all six plots and answered Task 1c.

Task 1aDescribe and rank the six scatter plots

Expected descriptions and rankings. Accept reasonable variation in language — the goal is that students use direction, pattern, and spread.

A Shots on target vs Goals
shots on target
Strong positive  r ≈ +0.97
B Possession % vs League points
possession %
Moderate positive  r ≈ +0.82
C Fouls vs Yellow cards
fouls committed
Weak positive  r ≈ +0.61
D Age vs Transfer value (€M)
age (years)
⚠ Non-linear (inverted U)  r ≈ −0.05 — misleading!
E Shirt number vs Goals
shirt number
No correlation  r ≈ 0.00
F Pass accuracy % vs Goals conceded
pass accuracy %
Moderate negative  r ≈ −0.79

Expected ranking (strongest to weakest, by |r|):

A (+0.97) > B (+0.82) > F (−0.79) > C (+0.61) > E (≈0) · Plot D: non-linear — a linear measure is meaningless here
Ranking discussion: Students often debate F vs B since |−0.79| < 0.82. Both orderings are defensible. The debate itself surfaces the key idea: direction and strength are separate properties of a correlation.
Task 1bInvent a measuring number — what properties would it need?

Generative task — no single correct answer. Look for groups who independently arrive at any of:

  • Sign encodes direction: positive for upward trend, negative for downward.
  • Bounded range: a maximum (perfect positive) and minimum (perfect negative) so comparisons are meaningful.
  • Zero = no relationship: the number should equal zero when there is no pattern.
  • Closeness to ±1 reflects tightness: points clustered tightly around a line → close to ±1; wide scatter → close to 0.
Do not name Pearson's r here. Let groups write their invented properties on the board. Open Phase 2 with: "Turns out someone already invented exactly this number. Here it is — check whether it has all the properties you described."
Task 1cWhich plot would your measuring number mislead you on, and why?

Target answer: Plot D (Age vs Transfer value).

The relationship is clearly curved — transfer value rises through a player's twenties and falls sharply after ~27–30. A linear measuring number would give a value near zero, making it appear there is no relationship — which is obviously false.

A linear measuring number is misleading whenever the underlying relationship is non-linear.
Call back to this explicitly in Phase 2 when introducing Pearson's r: "This is exactly why the formula is only valid for linear relationships — we saw this with Plot D."
2

Someone Already Invented It

Give code PEARSON once all groups have found the regression line and made their first prediction.

Task 2aCalculate Pearson's r by hand using the formula

The activity introduces the formula directly on screen — no teacher introduction needed. Students read the definition of \(S_{xx}\), \(S_{yy}\), \(S_{xy}\) on the page and begin the table immediately.

\[r = \frac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}} \qquad \text{where} \quad S_{xx}=\sum(x_i-\bar{x})^2,\quad S_{yy}=\sum(y_i-\bar{y})^2,\quad S_{xy}=\sum(x_i-\bar{x})(y_i-\bar{y})\]

Step 1 — compute the means:

\[\bar{x} = \frac{99}{12} = 8.25 \text{ hrs/wk} \qquad \bar{y} = \frac{85.4}{12} \approx 7.117 \text{ m/s}\]

Step 2 — complete the calculation table:

\(x_i\)\(y_i\)\(x_i - \bar{x}\)\(y_i - \bar{y}\)\((x_i-\bar{x})^2\)\((y_i-\bar{y})^2\)\((x_i-\bar{x})(y_i-\bar{y})\)
46.1−4.25−1.01718.0631.0344.321
56.4−3.25−0.71710.5630.5142.329
66.3−2.25−0.8175.0630.6671.838
66.8−2.25−0.3175.0630.1000.713
77.0−1.25−0.1171.5630.0140.146
87.2−0.25+0.0830.0630.007−0.021
86.9−0.25−0.2170.0630.0470.054
97.5+0.75+0.3830.5630.1470.288
107.4+1.75+0.2833.0630.0800.496
117.8+2.75+0.6837.5630.4671.879
127.9+3.75+0.78314.0630.6142.938
138.1+4.75+0.98322.5630.9674.671
Sums →\(S_{xx} = 88.25\)\(S_{yy} = 4.657\)\(S_{xy} = 19.65\)

Step 3 — substitute into the formula:

\[r = \frac{19.65}{\sqrt{88.25 \times 4.657}} = \frac{19.65}{\sqrt{411.0}} = \frac{19.65}{20.27} \approx 0.969\]
r ≈ 0.969  (3 s.f.) — strong positive linear correlation
Divide the table work: each group member takes 3–4 rows, then reconvene to sum the columns. The hand calculation reveals what r actually measures — a normalised sum of paired deviations. When both variables are simultaneously above their means the product is positive; when both are below it is also positive. A consistently positive sum means the variables rise and fall together. Point out row 6 (player 8, y = 7.2): x is just below the mean but y is just above — the product is negative, slightly pulling r down. This is what scatter looks like numerically.
Task 2bVerify with GDC and interpret r = 0.969

GDC steps are printed on the student page — students follow them independently. Your role here is to circulate and check that DiagnosticOn / Stat Wind settings are active before anyone gets stuck.

GDC gives r = 0.969 — matches the hand calculation ✓

The GDC applies the same formula internally. The hand calculation was for understanding; the GDC gives speed in examinations.

Interpretation: There is a strong positive linear correlation between weekly training hours and sprint speed. Players who train more hours per week tend to have higher sprint speeds, and the relationship is well-described by a straight line.

  • "What does 'strong' mean here?" r = 0.969 is very close to +1 — points cluster tightly around a line with very little scatter.
  • "Does training cause faster sprinting?" Not necessarily — we only know the two variables move together. (Planted seed for Phase 4.)
GDC note: Students must have DiagnosticOn active to see r. TI-Nspire: Statistics → Linear Regression (mx+b). Casio fx-CG50: STAT → CALC → ax+b. If r is missing, DiagnosticOn is likely off — pre-empt this before the phase.

Statistical significance testing is not part of the AA syllabus — do not introduce it. r is sufficient as a descriptive measure. If students ask, acknowledge it exists but say it is beyond this course.
Task 2cFind the mean point, sketch a line of best fit, then find the regression line of y on x using GDC

Mean point from the sums above:

Mean point: \((\bar{x},\,\bar{y}) = (8.25,\; 7.117)\)

Students plot this point on their scatter diagram and draw a line through it that best follows the trend. The constraint is explicit: the line must pass through \((\bar{x}, \bar{y})\).

Discovery prompt before giving the GDC line: ask "why must the regression line pass through the mean point?" Let groups reason first. Intuition: the mean point is the centre of gravity of the data — a line that misses it is systematically over-predicting on one side and under-predicting on the other.

The regression coefficients follow directly from the table sums:

\[b = \frac{S_{xy}}{S_{xx}} = \frac{19.65}{88.25} \approx 0.2227 \qquad a = \bar{y} - b\bar{x} = 7.117 - 0.2227 \times 8.25 \approx 5.280\]
Regression line of y on x:   ŷ = 0.223x + 5.28  (3 s.f.)

Verify mean point: \(\hat{y}(8.25) = 0.2227(8.25) + 5.280 = 1.837 + 5.280 = 7.117\) ✓

Students compare their hand-sketched line to the GDC line. In most cases these are close but not identical — the GDC minimises the sum of squared vertical residuals, which the eye cannot do exactly.

Task 2dUse the regression line to predict sprint speed for a player training 10 hrs/wk

x = 10 lies within the data range (4–13) — this is an interpolation.

\[\hat{y} = 0.2227(10) + 5.280 = 2.227 + 5.280 = 7.507 \approx 7.51 \text{ m/s}\]
Predicted sprint speed ≈ 7.51 m/s

The scout now has a concrete tool: give a training load, get a predicted sprint speed. Phase 3 will examine how far this tool can be trusted — and in which direction.

3

How Far Can You Trust It?

Give code PREDICT once all groups have found the x on y line and completed Task 3c.

Task 3aInterpolation vs extrapolation — which predictions do you trust?

Using the regression line from Phase 2 (ŷ = 0.223x + 5.28), classify and assess three predictions:

x valueIn range? (4–13 hrs)Predicted ŷVerdict
9 hrs/wk✓ Interpolation0.2227(9) + 5.280 = 7.28 m/sReliable
12 hrs/wk✓ Interpolation (near edge)0.2227(12) + 5.280 = 7.95 m/sReliable, with mild caution
25 hrs/wk✗ Extrapolation0.2227(25) + 5.280 = 10.85 m/sUnreliable

Discussion questions for the whiteboard:

  • "Why can't we trust x = 25?" There is no evidence the linear relationship holds beyond 4–13 hrs. The predicted 10.85 m/s would put a youth academy player among elite senior sprinters — physically implausible, and completely unsupported by the data.
  • "Is x = 12 as reliable as x = 9?" Both are technically interpolations, but x = 9 is near the centre of the data while x = 12 is near the edge (max observed is 13). Predictions near the mean point have least uncertainty; predictions near the boundary of the data range have more. Proximity to the centre matters, not just being inside the range.
  • "What about x = 13.5 — just barely outside?" Technically extrapolation, but marginally. There is no hard threshold — the question is how much we trust the linear trend to continue just beyond the data. Judgment is required, not just rule-following.
  • "Is extrapolation ever acceptable?" Sometimes, with strong theoretical or physical reasons to believe the model extends. Here we have none — for instance, there may be diminishing returns or injury risk at very high training loads.
Key takeaway: The regression line is a model of the data we have. It describes what happened within the observed range. Inside that range (interpolation) is reasonable. Outside (extrapolation) is an assumption — the further outside, the less trustworthy. The line has no knowledge of what happens beyond the data.
Task 3bThe reverse question — rearrange first, then x on y line via GDC, then contrast

Students work through three steps independently — the activity page guides all of it. Your role is to circulate and prompt the Step 3 discussion if groups accept the two answers without comparing them.

Step 1 — rearranging the y-on-x line (what most groups do first):

\[\text{Rearranging } \hat{y} = 0.2227x + 5.280 \text{ gives } x = \frac{y - 5.280}{0.2227} = \frac{7.6 - 5.280}{0.2227} \approx 10.42 \text{ hrs/wk}\]

Now find the proper x on y regression line using the same sums from Phase 2:

\[d = \frac{S_{xy}}{S_{yy}} = \frac{19.65}{4.657} \approx 4.220 \qquad c = \bar{x} - d\bar{y} = 8.25 - 4.220 \times 7.117 \approx -21.78\]
Regression line of x on y:   x̂ = 4.22y − 21.78  (3 s.f.)

Verify mean point: \(\hat{x}(7.117) = 4.220(7.117) - 21.78 = 30.03 - 21.78 = 8.25\) ✓

Predict x for y = 7.6 m/s:

\[\hat{x} = 4.220(7.6) - 21.78 = 32.07 - 21.78 = 10.29 \approx 10.3 \text{ hrs/wk}\]
Proper prediction: ≈ 10.3 hrs/wk  vs  rearranged line: ≈ 10.4 hrs/wk — different answers
Task 3cWhy are there two different lines? When do you use each?

This is the conceptual heart of Phase 3. Push groups to articulate the answer on their boards before giving it.

What each line minimises:

  • y on x line minimises the sum of squared vertical distances from each point to the line — i.e. errors in predicting y.
  • x on y line minimises the sum of squared horizontal distances — i.e. errors in predicting x.

These are genuinely different optimisation problems, so they give different lines. They only coincide when r = ±1 (perfect linear fit — all points on the line, no residuals in either direction). As |r| decreases, the angle between them opens up.

\[\text{Gradient of } y \text{ on } x: \quad b = \frac{S_{xy}}{S_{xx}} \approx 0.223\] \[\text{Gradient of } x \text{ on } y \text{ (as a function of } y \text{)}: \quad d = \frac{S_{xy}}{S_{yy}} \approx 4.220\] \[\text{Both lines pass through } (\bar{x},\bar{y}) = (8.25,\; 7.117)\]

Rule for which line to use:

Predicting y from a known x → use the y on x line
Predicting x from a known y → use the x on y line
Never rearrange one to substitute for the other
Board to draw together: sketch both lines on a single scatter diagram with the mean point marked. Show them crossing at (8.25, 7.117) and diverging on either side. The gap between the lines at extreme x or y values makes the point viscerally — they are not the same line, and the gap is not a rounding error.

Why does rearranging the y-on-x line give the wrong answer? Because rearranging doesn't change which residuals the line was fitted to minimise. You are still using a line that was optimised to predict y, not x. For this dataset the answers are close (10.3 vs 10.4) because r is very high. Tell groups: try to imagine how far apart the lines would be if r = 0.5. That is when this distinction becomes critical.
4

The Churros Conspiracy

Give code CHURROS once all groups have written their verdict and invented a spurious correlation.

Task 4aRun the numbers — r, regression line, prediction for 600 churros

Students run the full analysis before drawing any conclusions. Do not pre-empt the tension.

\[S_{xx} = 201{,}595 \qquad S_{yy} = 37.75 \qquad S_{xy} = 2{,}682.5\] \[r = \frac{2682.5}{\sqrt{201595 \times 37.75}} = \frac{2682.5}{2759.6} \approx 0.972\]
r = 0.972  (3 s.f.) — very strong positive correlation
\[b = \frac{2682.5}{201595} \approx 0.01331 \qquad a = 2.25 - 0.01331 \times 304.5 \approx -1.802\] \[\hat{y} = 0.01331(600) - 1.802 \approx 6.18 \text{ goals}\]
Regression line: ŷ = 0.0133x − 1.80  ·  Predicted goals for 600 churros ≈ 6.18
Let the numbers land before Task 4b. r = 0.972 is stronger than the training dataset. The maths is unambiguous — do not rush to explain it away. Let groups sit with it.
Task 4bQuestion the claim — guided discussion questions

The activity scaffolds the discovery through five guided questions. Expected responses:

  • Is 600 churros interpolation or extrapolation? Extrapolation — the data range is 155–500. The prediction is already unreliable on that basis alone.
  • When do matches tend to have large crowds? Important fixtures — rivals, cup games, top-of-table clashes, home derbies.
  • Would those matches also tend to have more goals? Yes — higher stakes, more attacking play, home crowd energy, stronger opposition sometimes draws a more intense performance from both sides.
  • What is actually driving both variables? Crowd size / match importance. Big matches → more fans → more churros sold AND more goals. The churros are irrelevant.
  • Would the prediction hold for a quiet Tuesday league match? No — if churros caused goals, it should work regardless of context. The fact that it doesn't exposes the lurking variable.
The goal is for students to name the lurking variable themselves through the questioning sequence. Do not offer it. If a group is stuck after the crowd question, ask: "So if you sold 600 churros at a 200-person Tuesday match, do you expect 6 goals?" That usually breaks the impasse.
Task 4cThe verdict
r = 0.972 shows a very strong positive relationship — but the prediction for 600 churros is an extrapolation and therefore unreliable. More importantly, both variables are driven by crowd size: big matches sell more food and produce more goals. Selling churros will not improve results. The sponsor's claim does not hold up.
Precision point for consolidation: The maths is correct — r = 0.972 and the regression line accurately describe the co-variation in this dataset. What is wrong is the causal interpretation: assuming that because two variables move together, one must be causing the other. This distinction is examinable — students should be able to identify a lurking variable and explain why a prediction based on a spurious relationship is unreliable.
🎯 Consolidation — Boards to Select for Debrief