Can ChatGPT evaluate the quality of insights in a student essay? In January 2023, Daisy Christodoulou published an article (Can ChatGPT mark writing?, No More Marking Blog) exploring this question. She tested ChatGPT using English essays, and found that while its grading and feedback were more or less aligned with her own, the AI was sometimes unable to spot fundamental problems in essays. Christodoulou offers some fascinating insights, but one thing she does not explore in any detail is the relationship between ChatGPT’s qualitative evaluation of an essay and the content of the essay itself.
In this post, I will share the results of my own brief experiment, in which I pasted two film reviews into ChatGPT and asked it to evaluate (and compare) the quality of insights in these reviews. My aim here was to use these texts as substitute ‘essays’, and consider how the AI-generated responses might help or hinder a marker in giving meaningful feedback.
The dialogues quoted from below were conducted on 23 March 2023, using the free ChatGPT 3.5. (I later repeated the dialogues with both this and ChatGPT 4.0, but found the responses from 23 March gave more substantial and interesting feedback. I will not discuss those later dialogues in this post.)
Review 1: Peter Bradshaw
To begin with, I asked ChatGPT to ‘Evaluate the quality of insight in the following review of the film Nomadland’, then pasted the text of Peter Bradshaw’s review of Nomadland from The Guardian. I wanted to see how it would respond to this vague request, in which I provided no criteria on which to base an evaluation.
The response did a good job of summarising the review using selected words from the text. For instance, at the end of its evaluation the chatbot said:
[Bradshaw] does offer a mild critique that the film could have been more explicit in its condemnation of the economic forces causing the characters' hardships. However, they still recognize the film's greatness and Chloé Zhao's exceptional filmmaking.
This is clearly pieced together from the following passage in Bradshaw’s review, in which I have highlighted the details used by ChatGPT:
[The film is] brought off with exceptional intelligence and style. Arguably it is not angry enough about the economic forces that are causing all this but it still looks superbly forthright. There is real greatness in Chloé Zhao’s film-making.
The paraphrasing techniques used here – transposing words like ‘exceptional’ and ‘greatness’ into slightly different contexts, and spelling out that ‘all this’ means ‘the characters’ hardships’ – are effective in ‘feedback’ terms, because they seem to indicate an understanding of the writer’s (i.e. student’s) argument. However, Bradshaw says the film is ‘not angry enough’, not that it is ‘not explicit enough’, and this may be a significant error.
In its final verdict on the review, ChatGPT said:
Overall, this review appears to be well-written and insightful, providing a thorough analysis of the film’s themes, characters, and style while also highlighting its strengths and potential weaknesses.
Without my having provided any criteria, the AI assumes that an insightful film review will explore themes, characters, and style, and that it will offer a balanced assessment of the film’s strengths and weaknesses.
More of ChatGPT’s ‘marking criteria’ emerged when I asked it to ‘Identify some ways in which the review could be improved’. It offered some valid points in response to this prompt. For example:
The review could have delved deeper into the film's technical aspects, such as its cinematography, editing, and sound design. While the review mentions the film's style and how it differs from traditional narrative-driven films, it could have gone into more detail about how the film achieves this style and what specific techniques the filmmakers used.
This is arguably an area where Bradshaw’s review falls short. Bradshaw does discuss how Nomadland differs from traditional narrative-driven films, and he does connect this with Zhao’s stylistic choices; in this part of his review, he might have discussed film-making technique in more detail. So the AI’s choice to structure feedback like this (‘While you did talk about x, you could have delved deeper into y...’) could be helpful for a student looking to improve their essay by building on their strengths.
On that note, here is another of ChatGPT’s constructive criticisms:
The review could have included a more critical evaluation of the film. While the review praises the film’s strengths, it could have offered a more balanced assessment that acknowledges both the film’s strengths and its weaknesses.
This picks up on the detail quoted above, from the AI’s initial evaluation, noting that Bradshaw’s critique of the film is ‘mild’. The AI has perhaps noticed that Bradshaw’s more negative language is limited to the very end of his review, and is couched in the word ‘arguably’. Again, if we imagine this as feedback being provided to a student, ChatGPT’s evaluations do a good job of mixing praise and criticism: ‘You balance your argument by acknowledging the film’s weaknesses, but you only do this briefly at the end – you could have included a more critical evaluation.’
Other responses, however, show ChatGPT’s limitations, and would constitute problematic essay feedback. For instance:
The review could have offered more specific examples of how the film explores its themes and characters. While the review mentions some of the film's themes, such as the impact of economic hardship on older Americans, it could have gone into more detail about how the film portrays these themes and how they are relevant to contemporary society.
This is not really a fair critique: Bradshaw does highlight specific examples of how the film explores ‘the impact of economic hardship on older Americans’, and he does allude to contemporary issues such as factory closures, the dominance of Amazon, and the importance of the tourist trade in this part of America:
...looking for seasonal work in bars, restaurants and – in this film – in a gigantic Amazon warehouse in Nevada, which takes the place of the agricultural work searched for by itinerant workers in stories such as The Grapes of Wrath.
Fern, a widow and former substitute teacher in Empire, Nevada – a town wiped off the map by a factory closure – who is forced into piling some possessions into a tatty van and heading off...
At times, the film looks like a tour of a deserted planet, especially when she heads out to the Badlands national park in South Dakota, where there is also tourist-trade work to be had.
ChatGPT also says:
The review could have provided more context for the film's production and reception. For example, the review could have mentioned the awards and critical acclaim that the film has received, or how it fits into Chloé Zhao's broader filmography.
Some of this is fair – the review was published after Nomadland’s Oscar success, so Bradshaw could have mentioned this – but it misses the contextual details Bradshaw includes about the film’s production:
Zhao was even allowed to film inside one of Amazon’s eerie service-industry cathedrals.
The movie is inspired by Jessica Bruder’s 2017 nonfiction book, Nomadland: Surviving America in the Twenty-First Century, and by the radical nomadist and anti-capitalist leader Bob Wells, who appears as himself.
The people she meets on the road are, mostly, real nomads who have vivid presences on screen.
As with the previous criticism, ChatGPT has not acknowledged key details of the review in its initial assessment, so its critique is not balanced: it is like a marker who blames a student for ‘not doing x’ when the student in fact spent several paragraphs on ‘x’. (Human markers sometimes do this, of course.)
Review 2: Beatrice Loayza
I then asked ChatGPT, ‘Is the following review of the film Nomadland more incisive than the previous one?’, and pasted the text of Beatrice Loayza’s review of Nomadland, from Sight & Sound. Again, I deliberately did not provide any assessment criteria. ChatGPT’s answer was ‘yes’, for several reasons – some valid, some less so. First of all, it said, Loayza ‘provides a detailed analysis of the film's themes and cinematography, as well as the performance of Frances McDormand’. This is fair, and picks up on one of the criticisms of Bradshaw cited above (namely his lack of attention to technical aspects). Loayza comments on specific camera techniques, naming the cinematographer and describing the light effects he achieves. She also does more than Bradshaw to explain why McDormand’s performance is so effective.
ChatGPT picks up on another of its own criticisms of Bradshaw by praising Loayza’s critical perspective on the film:
However, the review also criticizes the film's lack of force and clarity in its insights into labor in the 21st century and the exploitation of older Americans. The author points out that the film's depiction of workers exploited by Amazon feels too easygoing and questions the film's liberal naivete in addressing the conditions of the nomadic lifestyle. Overall, the review provides a more nuanced and thoughtful analysis of the film.
This draws upon the following passage in Loayza’s review; again, I have highlighted phrases that ChatGPT seems to have picked up on:
[The film’s] insights into labour in the 21st century, and the exploitation of an older generation of Americans, lack force and clarity. At the very beginning of the film, Fern is employed by Amazon’s CamperForce programme, which provides base wages and free parking space to seasonal workers in their 60s and 70s. In 2020, Amazon doubled its profits during a global pandemic, which makes Zhao’s easygoing depiction of workers exploited by the company feel rather toothless. That the film aims to capture the ways in which a kind of working-class American outsider struggles without fully addressing the conditions of that struggle casts over it the shadow of a questionable liberal naivete.
- In its initial assessment of Bradshaw’s review, ChatGPT noted that his critique of the film was ‘mild’
- In suggesting improvements, it built on this comment by recommending a more balanced approach
- And in drawing a comparison with Loayza’s review, it notes her more substantial version of Bradshaw’s criticism.
At each stage, the AI appears to be drawing upon specific evidence from the texts, rather than just ‘hallucinating’ these evaluative comments.
Elsewhere in its comparison between Bradshaw and Loayza, however, ChatGPT did hallucinate some differences in order to justify its verdict. I will not cite these here, as this post is already very long, but the inaccuracies were of a similar kind to those in the summary of Bradshaw discussed in the previous section.
If these film reviews were formative essays that I had to mark, I could use ChatGPT’s feedback to offer legitimate praise and criticism, suggest improvements, and judge the relative merits of the two essays in relation to each other. However, I would also notice that ChatGPT misses important details in these texts and draws some un-founded contrasts between them.
In the course of this experiment, I tried several variations on the above prompts. Here are some things to note if you want to try a similar experiment yourself:
- I fed the reviews into ChatGPT several times, and in a different order. When I asked it to make a comparative evaluation, it tended to prefer the second review (even if this was Bradshaw’s). When I asked if it could reverse its comparative evaluation (i.e. ‘Can you argue that the other review is more insightful than the first?’), its responses varied: sometimes it doubled down on its first opinion, sometimes it conceded that an alternative opinion could be justified. Again, the reasons given for these opinions ranged from ‘valid’ to ‘hallucinatory’.
- This post demonstrates what Chat-GPT is capable of in the hands of a technically ignorant, time-poor amateur like me, but by using the right prompts and follow-up prompts, it would no doubt be possible to collate more credible ‘essay feedback’, and then ask the AI to present and construct this in an appropriate way. Have a look at the other articles and resources linked to on the AI in Education Learning Circle webpage, try an experiment of your own, and share the results in the comments below. In particular, you might think about the learning outcomes and marking criteria specific to your discipline, or your module, and consider how you might train ChatGPT to use these in evaluating a piece of text.