The Soccer Scout Report — Teacher Answers

🔐 Student Activity Unlock Codes (case-insensitive)

Phase 1 → 2

SCATTER

Phase 2 → 3

PEARSON

Phase 3 → 4

PREDICT

Phase 4 → Done

CHURROS

Phase 2 Dataset — Youth Academy Sprint Training (n = 12)

x (hrs/wk)	4	5	6	6	7	8	8	9	10	11	12	13
y (m/s)	6.1	6.4	6.3	6.8	7.0	7.2	6.9	7.5	7.4	7.8	7.9	8.1
Means	\(\bar{x} = 8.25\) hrs/wk · \(\bar{y} = 7.117\) m/s

Phase 4 Dataset — Churros sold vs goals scored (20 home matches)

Match	1	2	3	4	5	6	7	8	9	10
Churros (x)	210	185	340	290	155	410	375	230	460	310
Goals (y)	1	0	3	2	1	4	3	1	4	2

Match	11	12	13	14	15	16	17	18	19	20
Churros (x)	270	195	385	440	160	325	280	500	215	355
Goals (y)	2	1	3	4	0	3	2	5	1	3
Means	\(\bar{x} = 304.5\) churros · \(\bar{y} = 2.25\) goals

🎨 Oral Launch — Read Aloud to Class

"A football scout has been collecting data on youth academy players. Before you analyse any numbers, she needs you to look at some charts and tell her what you see. Your job today: figure out how to measure relationships in data, build the tool mathematicians invented for exactly this, and then use it to call out a very suspicious claim."

What Do You See?

Give code SCATTER once all groups have ranked all six plots and answered Task 1c.

Task 1aDescribe and rank the six scatter plots

Expected descriptions and rankings (accept reasonable variation in ranking of middle plots):

Plot A (Training vs Sprint): Strong positive linear — rank 1 or 2.
Plot B (Possession vs Goals): Moderate positive linear — rank 3.
Plot C (Fouls vs Yellow cards): Strong positive linear — rank 1 or 2.
Plot D (Age vs Transfer value): Curved (non-linear) relationship — misleading for linear measure.
Plot E (Shirt number vs Goals): No relationship — random scatter, rank last.
Plot F (Goals conceded vs Rank): Strong positive linear — rank 1–3.

The ranking exercise is deliberately ambiguous for the middle plots. What matters is the reasoning — push groups to justify using directional language ("as x increases, y tends to…") and words like "tight", "scattered", "curved".

Task 1bInvent a measuring number — what properties would it need?

Target properties students should identify:

Bounded: Should have a maximum and minimum value so comparisons make sense.
Direction: Should be positive for positive relationships, negative for negative ones, and zero for no relationship.
Strength: Should be larger (in magnitude) for stronger relationships.
Scale-free: Should not change if you change units (e.g. convert km to miles).

Use this as a bridge to Phase 2: "You described exactly what Pearson invented in 1895. Here it is — check whether it has all the properties you described."

Task 1cWhich plot would your measuring number mislead you on, and why?

Target answer: Plot D (Age vs Transfer value)

The relationship is clearly curved — transfer value rises through a player's twenties and falls sharply after ~27–30. A linear measuring number would give a value near zero, making it appear there is no relationship — which is obviously false.

Key insight

A linear measuring number is misleading whenever the underlying relationship is non-linear. Call back to this in Phase 2: "This is exactly why Pearson's r is only valid for linear relationships."

Unlock code: SCATTER

Someone Already Invented It

Give code PEARSON once all groups have found the regression line and made their first prediction.

Task 2aCalculate Pearson's r by hand

The activity introduces the formula directly on screen — no teacher introduction needed. Key hand-calculation results:

x	y	\(x-\bar{x}\)	\(y-\bar{y}\)	\((x-\bar{x})^2\)	\((y-\bar{y})^2\)	\((x-\bar{x})(y-\bar{y})\)
4	6.1	−4.25	−1.017	18.06	1.034	4.322
5	6.4	−3.25	−0.717	10.56	0.514	2.330
6	6.3	−2.25	−0.817	5.06	0.667	1.838
6	6.8	−2.25	−0.317	5.06	0.100	0.713
7	7.0	−1.25	−0.117	1.56	0.014	0.146
8	7.2	−0.25	0.083	0.063	0.007	−0.021
8	6.9	−0.25	−0.217	0.063	0.047	0.054
9	7.5	0.75	0.383	0.563	0.147	0.287
10	7.4	1.75	0.283	3.063	0.080	0.495
11	7.8	2.75	0.683	7.563	0.467	1.879
12	7.9	3.75	0.783	14.063	0.614	2.936
13	8.1	4.75	0.983	22.563	0.967	4.671
Sums				88.25	4.657	19.65

r = \frac{S_{xy}}{\sqrt{S_{xx}\cdot S_{yy}}} = \frac{19.65}{\sqrt{88.25 \times 4.657}} = \frac{19.65}{\sqrt{411.12} } = \frac{19.65}{20.28} \approx 0.969

r ≈ 0.969 — very strong positive linear correlation

Accept r in [0.96, 0.98] due to rounding differences. GDC will give r = 0.969 exactly (to 3 s.f.). Expect most hand calculations to match to 2 s.f. at minimum.

Task 2bVerify with GDC and interpret

r ≈ 0.969 (GDC confirms)

Model interpretation sentence: There is a very strong positive linear relationship between weekly training hours and sprint speed — players who train more hours per week tend to have significantly higher sprint speeds.

Push groups to say "tend to" rather than "will" — this is correlation, not causation. Also ask: could another variable be driving both?

Task 2cMean point and regression line of y on x

\bar{x} = \frac{99}{12} = 8.25\ \text{hrs/wk} \qquad \bar{y} = \frac{85.4}{12} \approx 7.117\ \text{m/s}\] \[b = \frac{S_{xy}}{S_{xx}} = \frac{19.65}{88.25} \approx 0.2227 \qquad a = \bar{y} - b\bar{x} = 7.117 - 0.2227(8.25) \approx 5.280

Regression line: \(\hat{y} = 0.223x + 5.28\) · Mean point: (8.25, 7.117)

The line passes through the mean point by construction — the regression formula guarantees this. Students who are asked why: because the least-squares minimisation sets the average residual to zero, which forces the line through \((\bar{x}, \bar{y})\).

Task 2dPredict sprint speed for 10 hrs/wk

x = 10 lies within the data range (4–13) — this is an interpolation.

\hat{y} = 0.2227(10) + 5.280 = 2.227 + 5.280 = 7.507 \approx 7.51\ \text{m/s}

Predicted sprint speed ≈ 7.51 m/s

Unlock code: PEARSON

How Far Can You Trust It?

Give code PREDICT once all groups have found the x on y line and completed Task 3c.

Task 3aInterpolation vs extrapolation — three predictions

x value	In range? (4–13 hrs)	Predicted ŷ	Verdict
9 hrs/wk	✓ Interpolation	0.223(9) + 5.28 = 7.29 m/s	Reliable
12 hrs/wk	✓ Interpolation (edge)	0.223(12) + 5.28 = 7.96 m/s	Reliable, mild caution
25 hrs/wk	✗ Extrapolation	0.223(25) + 5.28 = 10.86 m/s	Unreliable

10.86 m/s exceeds the world record sprint speed — a clear sign the linear model does not hold at extreme values.

The world 100m sprint record implies ~10.4 m/s average. 10.86 m/s is faster than Usain Bolt's peak. Use this as a vivid reality check.

Task 3bThe reverse question — y = 7.6 m/s, find x

Step 1 — Rearranging y-on-x line:

\hat{y} = 0.2227x + 5.280 \;\Rightarrow\; x = \frac{7.6 - 5.280}{0.2227} = \frac{2.32}{0.2227} \approx 10.4\ \text{hrs/wk}

Step 2 — Regression line of x on y (from GDC):

b_{xy} = \frac{S_{xy}}{S_{yy}} = \frac{19.65}{4.657} \approx 4.220 \qquad a_{xy} = \bar{x} - b_{xy}\bar{y} = 8.25 - 4.220(7.117) \approx -21.74\] \[\hat{x} = 4.220y - 21.74 \;\Rightarrow\; \hat{x}(7.6) = 4.220(7.6) - 21.74 = 32.07 - 21.74 \approx 10.3\ \text{hrs/wk}

Rearrangement: ≈ 10.4 hrs/wk · x-on-y line: ≈ 10.3 hrs/wk

The answers are close (because r is very high) but not identical. For a scout making practical decisions, this difference is small. But the principle matters.

Task 3cTwo lines — why?

Where do they intersect? At the mean point (8.25, 7.117).
Are they the same line? No — y-on-x minimises vertical residuals; x-on-y minimises horizontal residuals. Same data, different optimisation criteria.
Which to use for 3b? The x-on-y line — you are predicting x from a given y value.
When are they identical? Only when r = ±1, i.e. all points lie perfectly on a straight line.

Key insight

The gap between the two lines depends on r. When r is close to 1, the lines are nearly identical. When r is weak (e.g. 0.5), the lines diverge dramatically — and using the wrong one gives badly wrong answers.

Show both lines crossing at (8.25, 7.117) and diverging. The gap makes the point viscerally. Unlock code: PREDICT

The Churros Conspiracy

Give code CHURROS once all groups have written their verdict.

Task 4aRun the numbers — r, regression line, prediction for 600 churros

Students run the full analysis before drawing conclusions. Do not pre-empt the tension.

S_{xx} = 201{,}595 \qquad S_{yy} = 37.75 \qquad S_{xy} = 2{,}682.5\] \[r = \frac{2682.5}{\sqrt{201595 \times 37.75}} = \frac{2682.5}{2759.6} \approx 0.972

r = 0.972 (3 s.f.) — very strong positive correlation

b = \frac{2682.5}{201595} \approx 0.01331 \qquad a = 2.25 - 0.01331 \times 304.5 \approx -1.802\] \[\hat{y}(600) = 0.01331(600) - 1.802 \approx 6.18\ \text{goals}

Regression line: ŷ = 0.0133x − 1.80 · Prediction for 600 churros ≈ 6.18 goals

Let the numbers land before Task 4b. r = 0.972 is stronger than the training dataset. The maths is unambiguous — do not rush to explain it away.

Task 4bQuestion the claim

600 churros — interpolation or extrapolation? Extrapolation (data range 155–500). Unreliable on that basis alone.
When do matches attract big crowds? Important fixtures — rivals, cup games, top-of-table clashes, derbies.
Would those matches tend to have more goals? Yes — higher stakes, more attacking play, home crowd energy.
What is actually driving both variables? Crowd size / match importance. Big matches → more fans → more churros sold AND more goals. Churros are irrelevant.
Would the prediction hold for a quiet Tuesday match? No — exposes the lurking variable.

The goal is for students to name the lurking variable themselves. Do not offer it. If a group is stuck, ask: "So if you sold 600 churros at a 200-person Tuesday match, do you expect 6 goals?" That usually breaks the impasse.

Task 4cThe verdict

r = 0.972 shows a very strong positive relationship — but the prediction for 600 churros is an extrapolation and unreliable. Both variables are driven by crowd size: big matches sell more food and produce more goals. Selling churros will not improve results. The sponsor's claim does not hold up.

Precision point for consolidation

The maths is correct — r = 0.972 accurately describes the co-variation. What is wrong is the causal interpretation: assuming that because two variables move together, one causes the other.

Unlock code: CHURROS

📋 Consolidation — select 2–3 boards to debrief

Phase 1: Why does Plot D have a low r even though there's a strong relationship?
Phase 2: What does each part of \(r = S_{xy}/\sqrt{S_{xx}S_{yy}}\) measure geometrically?
Phase 3: When would the two regression lines be the same? (Only when r = ±1.)
Phase 4: Can you invent another spurious correlation from real-world data?