AI Models Cheat on Tests

Written by

in

In an eerie similarity to high school students, AI models have been caught cheating to improve their test scores.

METR (from their X profile: A research non-profit that develops evaluations to empirically test AI systems for capabilities that could threaten catastrophic harm to society) recently found that AI/LLM tools “reward hack” (aka cheat) in order to improve their scores on standardized testing.

In the last few months, we’ve seen increasingly clear examples of reward hacking[1] on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring code or subverting the task setup, rather than actually solving the problem we’ve given them. This isn’t because the AI systems are incapable of understanding what the users want—they demonstrate awareness that their behavior isn’t in line with user intentions and disavow cheating strategies when asked—but rather because they seem misaligned with the user’s goals.

Earlier this month, they posted a report, Recent Frontier Models Are Reward Hacking, that describes this behavior as well as their documented examples. Their post continues as they’ve learned that the cheating isn’t simply because of technological limitations:

Historical examples of reward hacking seemed like they could be explained in terms of a capability limitation: the models didn’t have a good understanding of what their designers intended them to do. For example, the CoastRunners AI had no general knowledge about what objects in the game represented or how humans “intended” the gameplay to work, making it impossible for the model to even know it was reward hacking.

But modern language models have a relatively nuanced understanding of their designers’ intentions. They can describe which behaviors are undesirable and why and claim that they would never do anything like reward hacking because they’ve been trained to be safe and aligned—but they still do it.

More posts