Released a year ago on this day, the 2017/8 GEM Report highlighted the multiple layers of accountability in education: different mechanisms, several actors, contrasting perceptions and nuanced meanings across languages.
One of its key messages was that, while accountability was an essential part of a solution package for challenges in education systems, it was necessary that all actors:
‘…should approach the design of accountability with a degree of humility, recognizing that education problems are complex in nature and often do not lend themselves to a single solution.’
It noted the trend in richer countries of tying student test scores to hold schools and teachers accountable. But it found that this approach risked promoting unhealthy competition, gaming the system and further marginalizing disadvantaged students.
In our recommendations, we made clear that ‘governments should design school and teacher accountability mechanisms that are supportive and formative, and avoid punitive mechanisms; especially the types based on narrow performance measures’.
We will be reiterating this message this week at the 5th International Conference on Education Monitoring and Evaluation to take place in Beijing, where the Chinese edition of the report will be shared with the participants.
However, a recent study argues that more testing has a positive impact on education. Drawing on six rounds of data from the Programme for International Student Assessment (PISA), the authors have concluded that ‘accountability systems that use standardized tests to compare outcomes across schools and students produce better student outcomes’ and that ‘both rewards to schools and rewards to students for better outcomes result in greater student learning’ (p. 28).
So how do we reconcile these seemingly contradictory findings?
Different uses for tests can motivate different outcomes. This makes it important to identify the different ways tests can be used when analysing what effects they have on student outcomes.
The 2017/8 GEM Report showed that, across 101 education systems, some type of test for accountability was found in over 50% of OECD countries and 45% of non-OECD countries. These tests can be evaluative, whereby student scores are aggregated and disseminated at the school level, often through league tables or report cards. They can also be punitive, whereby results are tied to sanctions or rewards for schools. The United States under the No Child Left Behind policy would be the clearest example of a punitive system. Summative policies are those that have tests that are not deemed to be either punitive or evaluative.
Our review of the evidence found that evaluative policies promoting school choice exacerbated disparities by further advantaging more privileged children (pp. 49-52). Moreover, punitive systems had unclear achievement effects but troublesome negative consequences, including removing low-performing students from the testing pool and explicit cheating (pp. 52-56).
By contrast, the way that Bergbauer and her colleagues classify testing systems confuses concepts. This leads to a misinterpretation of how external and internal factors affect student achievement.
Using 13 indicators on the use and purpose of testing, the authors created four categories of assessment use.
- Standardized external comparisons include assessments that explicitly ‘allow comparisons of student outcomes across schools and students’ and attach rewards to students or (head) teachers. This category conflates the evaluative and punitive categories used in the 2017/8 GEM Report without understanding the key difference between high stakes on students and high stakes on educators.
- Standardized monitoring involves using assessments to monitor student, teacher or school performance, but makes no public external comparisons.
- Internal testing resembles low stakes formative assessment for ‘general pedagogical management’
- Internal teacher monitoring covers internal assessments ‘directly focused on teachers’.
They found a positive relationship between the first two categories and student achievement but no relationship between the latter two categories and achievement. However, a deeper look at the indicators grouped under each category suggests that the four categories may be misleading. Correcting for those misclassifications would support the GEM Report’s findings.
First, the internal testing category (3) is associated with the indicator ‘achievement data posted publicly’. The authors justify this by suggesting that principals are posting grade point averages or teachers are posting grades on the blackboard. But this is inaccurate, as it does not take into account the wording of the question in the PISA school questionnaire, which suggests that achievement is publicly posted in the media, and therefore cannot be classified as internal testing. Their earlier work had emphasized the role of school report cards and league tables as key for facilitating market-based accountability, in contrast with the approach they have taken in this research. Their understanding also directly contradicts how the OECD interprets their own data.
The same is true with the internal teacher monitoring category (4), which uses indicators related to student assessments and class observations. While both these factors can help monitor teachers, placing them in the same category fails to recognize the differences in stakes and motivations when teachers are held accountable for their students’ test scores.
In fact, their results related to the indicators ‘achievement data posted publicly’ and ‘making judgements about teachers’ effectiveness’ suggest the use of test scores for accountability purposes are not associated with greater student achievement, a conclusion driven home in the GEM Report.
So what is driving the results of Bergbauer, Hanushek, and Woessmann? In the ‘standardized external comparisons’ category (1), three indicators are positively associated with student achievement:
- principals use student assessments to compare their school to district/national performance;
- presence of a high stakes student examination at the end of lower secondary; and
- presence of a high stakes student examination that dictates student’s career opportunities.
But the first indicator does not compare schools to schools nor does it make school results public; it is therefore unclear how this relates to external accountability. The latter two place stakes solely on students and not on teachers or schools.
The final indicator driving their overall results is part of the ‘standardized monitoring’ category. They find that those that do more standardized testing do better on PISA. But this practice, teaching to the test, which some consider to be detrimental, is also dispelled by the authors.
A subtler breakdown of results shows that linking test scores to teacher or school accountability has no significant effect on student achievement. Ultimately, we cannot hold individuals to account for something beyond their control, a point the authors also recognize: ‘the optimal design of incentives generally calls for rewarding the results of behavior directly under the control of the actor and not rewarding results from other sources. The problem…is that most testing includes the results of action of multiple parties’.
Let us also not forget that the analysis does not attempt to capture student level equity or gaming-the-system behaviour, both central concerns when using testing for accountability.
A year on, the messages of the 2017/8 GEM Report are as relevant as ever.