The results of the OECD’s 2015 Programme for International Student Assessment (PISA) were published a couple of weeks ago. The PISA assessment has measured the performance of 15 year-olds in Reading, Maths and Science every three years since 2000. I got the impression that teachers and academics (at least those using social media) were interested mainly in various aspects of the analysis. The news media, in contrast, focussed on the rankings. So did the OECD and politicians according to the BBC website. Andreas Schleicher of the OECD mentions Singapore ‘getting further ahead’ and John King US Education Secretary referred to the US ‘losing ground’.

What they are talking about are some single-digit changes in scores of almost 500 points. Although the PISA analysis might be informative, the rankings tell us very little. No one will get promoted or relegated as a consequence of their position in the PISA league table. Education is not football. What educational performance measures do have in common with all other performance measures – from football to manufacturing – is that performance is an outcome of causal factors. Change the causal factors and the performance will change.

**common causes vs special causes**

Many factors impact on performance. Some fluctuations are inevitable because of the variation inherent in raw materials, climatic conditions, equipment, human beings etc. Other changes in performance occur because a key causal factor has changed significantly. The challenge is in figuring out whether fluctuations are due to variation inherent in the process, or whether they are due to a change in the process itself – referred to as **common causes and special causes**, respectively.

The difference between common causes and special causes is important because there’s no point spending time and effort investigating common causes. Your steel output might have suffered because of a batch of inferior iron ore, your team might have been relegated because two key players sustained injuries, or your PISA score might have fallen a couple of points due to a flu epidemic just before the PISA tests. It’s impossible to prevent such eventualities and even if you could, some other variation would crop up instead. However, if performance has improved or deteriorated following a change in supplier, strategy or structure you’d want to know whether or not that special cause has had a real impact.

**spotting the difference**

This was the challenge facing Walter A Shewhart, a physicist, engineer and statistician working for the Western Electric Company in the 1920s. Shewhart figured out a way of representing variations in performance so that quality controllers could see *at a glance* whether the variation was due to common causes or special causes. The representation is generally known as a control chart. I thought it might be interesting to plot some PISA results as a control chart, to see if changes in scores represented a real change or whether they were the fluctuations you’d expect to see due to variation inherent in the process.

If I’ve understood Shewhart’s reasoning correctly, it goes like this: Even if you don’t change your process, fluctuations in performance will occur due to the many different factors that impact on the effectiveness of your process. In the case of the UK’s PISA scores, each year similar students have learned and been assessed on very similar material, so the process remains unchanged; what the PISA scores measure is student performance. But student performance can be affected by a huge number of factors; health, family circumstances, teacher recruitment, changes to the curriculum a decade earlier etc.

For statistical purposes, the variation caused by those multiple factors can be treated as **random**. (It isn’t truly random, but for most intents and purposes can be treated as if it is.) This means that over time, UK scores will form a normal distribution – most will be close to the mean, a few will be higher and a few will be lower. And we know quite a bit about the features of normal distributions.

Shewhart came up with a formula for calculating the **upper and lower limits** of the variation you’d expect to see as a result of common causes. If a score falls outside those limits, it’s worth investigating because it probably indicates a special cause. If it doesn’t, it isn’t worth investigating, because it’s likely to be due to common causes rather than a change to the process. Shewhart’s method is also useful for finding out whether or not an intervention has made a real difference to performance. Donald Wheeler, in *Understanding Variation: The key to managing chaos*, cites the story of a manager spotting a change in performance outside the control limits and discovering it was due to trucks being loaded differently without the supervisor’s knowledge.

**getting the PISA scores under control**

I found it surprisingly difficult, given the high profile of the PISA results, to track down historical data and I couldn’t access it via the PISA website – if anyone knows of an accessible source I’d be grateful. Same goes for any errors in my calculations. I decided to use the UK’s overall scores for Mathematics as an example. In 2000 and 2003 the UK assessments didn’t meet the PISA criteria, so the 2000 score is open to question and the 2003 score was omitted from the tables.

I’ve followed the method set out in Donald Wheeler’s book, which is short, accessible and full of examples. At first glance the formulae might look a bit complicated, but the maths involved is very straightforward. Year 6s might enjoy applying it to previous years’ SATs results.

**Step 1: Plot the scores and find the mean.**

year | 2000* | 2003* | 2006 | 2009 | 2012 | 2015 | mean (Xbar§) |

UK maths score | 529 | – | 495 | 492 | 494 | 492 | 500.4 |

Table 1: UK maths scores 2000-2015

* In 2000 and 2003 the UK assessments didn’t meet the PISA criteria, so the 2000 score is open to question and the 2003 score was omitted from the results.

§ I was chuffed when I figured out how to type a bar over a letter (the symbol for mean) but it got lost in translation to the blog post.

Fig 1: UK Maths scores and mean score

**Step 2: Find the moving range (mR) values and calculate the mean. **

The moving range is the differences between consecutive scores, referred to as mR values.

year | 2000 | 2003 | 2006 | 2009 | 2012 | 2015 | mean
(R bar) |

UK maths score | 529 | – | 495 | 492 | 494 | 492 | |

mR values | 34 | 3 | 2 | 2 | 10.25 |

Table 2: moving range (mR values) 2000-2015

Fig 2: Differences between consecutive scores (mR values)

**Step 3: Calculate the Upper Control Limit for the mR values (UCL _{R}). **

To do this we multiply the mean of the mR values (Rbar) by 3.27.

UCL_{R} = 3.27 x Rbar = 3.27 x 10.25 = 33.52

Fig 3: Differences between scores (mR values) showing upper control limit (UCL_{R})

**Step 4: Calculate the Upper Natural Process Limit (UNPL) for the individual scores using the formula UNPL = Xbar + (2.66 x Rbar )**

UNPL = Xbar + (2.66 x Rbar ) = 500.4 + (2.66 x 10.25) = 500.4 + 27.27 = 527.67

**Step 5: Calculate the Lower Natural Process Limit (LNPL) for the individual scores using the formula LNPL = Xbar – (2.66 x Rbar )**

LNPL = Xbar – (2.66 x Rbar) = 500.4 – (2.66 x 10.25) = 500.4 – 27.27 = 473.13

We can now plot the UK’s Maths scores showing the upper and lower natural process limits – the limits of the variation you’d expect to see as a result of common causes.

Fig 4: UK Maths scores showing upper and lower natural process limits

What Fig 4 shows is that the UK’s 2000 Maths score falls just outside the upper natural process limit, so even if the OECD hadn’t told us it was an anomalous result, we’d know that something different happened to the process in that year. You might think this is pretty obvious because there’s such a big difference between the 2000 score and all the others. But what if the score had been just a bit lower? I put in some other numbers:

score | Xbar | Rbar | UCL_{R} |
UNPL | LNPL |

529 (actual) | 500.4 | 10.25 | 33.52 | 527.67 | 473.13 |

520 | 498.6 | 8 | 26.16 | 519.88 | 477.32 |

510 | 496.6 | 5.5 | 17.99 | 511.23 | 481.97 |

500 | 494.6 | 3 | 9.81 | 502.58 | 486.62 |

Table 3: outcomes of alternative scores for year 2000

Table 3 shows if the score had been 520, it would still have been outside the natural process limits, but a score of 510 would have been within them.

Fig 5: UK Maths scores showing upper and lower natural process limits for a year 2000 score of 510

** ups, downs and targets**

The ups and downs of test results are often viewed as more important than they really are; up two points good, down two points bad – even though a two-point fluctuation might be due to random variation.

The process control model has significant implications for target-setting too. Want to improve your score? Then you need to work harder or smarter. Never mind the fact that students and teachers can work their socks off only to find that their performance is undermined by a crisis in recruiting maths teachers or a whole swathe of schools converting to academies. Working harder or smarter but ignoring natural variation supports what’s been called Ackoff’s proposition – that “almost every problem confronting our society is a result of the fact that our public policy makers are doing the wrong things and are trying to do them righter”.

To get tough on PISA scores we need to get tough on the causes of PISA scores.

** Reference**

Wheeler, DJ (1993). *Understanding variation: The key to managing chaos*. SPC Press Inc, Knoxville, Tennessee.

I also tried hard to find the historical data. Interesting that you had the same difficulty. I’d have thought that it would be the most telling in terms of policy impact (or not). I’ve not come across Ackoff’s proposition before but it perfectly encapsulates the difficulty in responding to the DfE’s question, ‘But how do we use assessment to hold schools to account?’.