Writing about web page http://www.smf.co.uk/how-to-high-stakes-exams-mismeasure-potential/
On 29 June the Social Market Foundation published our CAGE policy briefing, "The long term consequences of having a bad day: How high-stakes exams mismeasure potential," by Victor Lavy (Warwick and Hebrew University of Jerusalem), Avraham Ebenstein (Hebrew University of Jerusalem), and Sefi Roth (Royal Holloway University of London). Below is their summary of findings; the full version is available here:
Examination candidates can have a bad day for many reasons unrelated to knowledge, skill, or cognitive ability. Possible causes include minor infection, migraine, hay fever, menstruation, disturbed sleep, and atmospheric pollution. When the stakes are high, as with exams used to rank students or admit them to further training or employment, there can be permanent consequences for the individual and for society.
There is a lack of evidence on these consequences. Empirical challenges include the difficulty in identifying the return to cognitive ability separately from the return to doing well on the examination.
Our solution to this problem is to examine the consequences of fluctuations in a random factor on exam performance. We use fluctuations in air pollution for this purpose. When the same student takes multiple exams, and exposure to ambient pollution varies randomly from day to day, we can use the associated variation in performance to measure the component of a student’s score which is related entirely to luck. The results show that transitory ambient pollution exposure is associated with a significant decline in student performance.
We then examine these students during adulthood (8-10 years after the exams) and we find that pollution exposure during exams causes lasting damage to post-secondary educational attainment and earnings later in life.
Our analysis highlights that high-stakes exams provide measures of student quality that may be imprecise and misleading. These measures can lead to allocative inefficiency because students who have had a bad day because of factors outside their control are mis-ranked. After that, they are inefficiently assigned to further training and to different occupations and this reduces labour productivity overall.
As well as illustrating these problems with high-stakes exams, our findings also expand understanding of the costs of pollution. They imply that a narrow focus on traditional health outcomes, such as hospitalisation and increased mortality, is not enough. The full cost of pollution includes loss of mental acuity, which is essential to productivity in most professions. The use of high-stakes exams to measure mental acuity then multiplies this loss over many years.
After that, Simon Lebus of Cambridge Assessment published a critical comment, "High stakes in adverse conditions." He pointed out that exams are designed to discriminate among candidates, for example by setting time constraints. He questioned "whether it is meaningful to suggest that the 'fairness' of an exam (in any event always a form of sampling) is compromised by random external factors any more than by the peculiarities of its own design." Below are his concluding sentences; you can read the full version here.
If you really want variation to play a role then go with 'local' assessment in schools – where a much greater range of variables is likely to apply, ranging from the physical environment of the classroom (noise, light, discipline and so on) to the nature of the tasks, the facilities, the 'fairness' of the teacher's judgement, possible support from the teacher and so on. Remedies to exclude such variation range from the expensive to the draconian to the impossible. If we were to try to mitigate every external variable (and not just pollution) we would end up chasing myriad effects and influences, with no necessary improvement in the predictive validity of qualifications – which currently are at levels which enjoy public confidence.
Despite legitimate questions about the design and role of high stakes assessment, it is important to recognise that trying to design assessments around the principle of excluding every element of randomness would (perhaps counter-intuitively) end up likely introducing even more randomness, with a consequent adverse impact on both equity and attainment.
Below, Victor and Lavy and his co-authors reply to Simon Lebus, in a comment that appeared first on the SMF blog on 19 August.
Thank you for your thoughtful comments. You raise three key issues that are critical to high-stakes testing, and related to our research findings. The first issue relates to exactly what disturbances should or should not a student be protected from. For example, in your example of the concerned grandmother, she perceived the time constraint as an unjust factor that prevented her from demonstrating her actual knowledge. In your opinion (and in ours), exams with time constraints are a legitimate way to distinguish weaker or stronger students, since the speed with which you can perform a task is a reflection of aptitude. However, we would submit that sensitivity to pollution (or lack thereof), is not a factor that should not be used to rank students. As our findings demonstrate, pollution targets asthmatic students and other students with health ailments, and it seems likely that the optimal tests would be designed to minimize the influence of this type of external factor.
A second issue worth noting is that our study finds that temporary disturbances like pollution have a very large impact on student performance. If students were relatively insensitive to pollution, there would be little need to worry about the testing conditions. However, we find that pollution has a very large impact on performance, and this is observed even when exploiting variation across the same student’s exams across different days – and so it is likely that we are identifying a causal association, rather than simply observing a correlation between scores and pollution. As you suggest, it is likely that noise and temperature affect student performance as well, and all these factors should in an ideal world be eliminated from the testing environment. Our findings suggest that efforts to eliminate these factors are not wasted.
The third issue our study raises is a question of fairness. If pollution were truly random and all students had an equal chance of being exposed, it is likely that it could be justly ignored by the testing administration. However, the parts of cities that are most exposed to pollution are generally the poorest areas of cities which are near factories and other dis-amenities. And so, it is likely underprivileged students who are most likely to be affected. Therefore, to ignore the potential impact of pollution on test-takers is to open up the possibility of stacking the deck further against students from more modest backgrounds.
In light of these factors, we submit that high stakes exams like A levels or the Israeli Bagrut should be held in controlled environments, with all possible attention paid to making these settings consistent and fair for all students. We propose that there are policy interventions that could be made at relatively low cost. These including rescheduling exams to different days or different testing sites, concentrating exams during seasons with less pollution, and reducing the weight of any single exam by having students judged by their average score over multiple administrations. In light of the great academic and financial stakes of these tests, it is no trivial matter.
Victor, Avi and Sefi