Synthesis Blog
Back
AI Safety II: Goodharting and Reward Hacking

In this post, the second in the series (after “Concepts and Definitions”), we embark on a comprehensive exploration of Goodhart’s law: how optimization processes can undermine their intended goals by optimizing proxy metrics. Goodharting lies at the heart of what is so terrifying about making AGI, so this is a key topic for AI safety. Starting with the classic taxonomy of regressional, extremal, causal, and adversarial goodharting, we then trace these patterns from simple mathematical models and toy RL environments to the behaviours of state of the art reasoning LLMs, showing how goodharting manifests in modern machine learning through shortcut learning, reward hacking, goal misgeneralization, and even reward tampering, with striking examples from current RL agents and LLMs. 

Introduction

The more we measure, and the more hinges on the results of the measurement, the more we care about the measurement itself rather than its original intention. We mentioned this idea, known as Goodhart’s law, in the first part of the series. This seductively simple principle has haunted economists, social and computer scientists since Charles Goodhart half-jokingly coined it in 1975. Since then, optimization processes have globalized and have been automated to a very large extent. Algorithms trade trillions of dollars, decide on health plans, and write text that billions of people read every day. With that, Goodhart’s law has come from a sociological curiosity to an important engineering constraint.

Why is goodharting increasingly important now? Three important reasons:

  • scale: modern reinforcement learning systems have billions of parameters and billions of episodes of (usually simulated) experience; below, we will see that reward hacking that does not happen in a 500K-parameter policy can explode once capacity crosses a larger threshold;
  • opacity: learned objectives and policies are distributed across “giant inscrutable matrices” in ways that are hard to inspect; we rarely know which exact proxy the agent is actually using until it fails;
  • automation: when an optimizer tweaks millions of weights per second, a misaligned proxy can lead to catastrophic consequences very quickly.

In this post, we consider Goodhart’s law in detail from the perspective of AI safety; this is a long one, so let me introduce its structure in detail. We begin with the classical four-way taxonomy:

  • regressional Goodhart is the statistical “tails come apart” effect: optimize far enough into the distribution, and the noise will drown out the signal;
  • extremal Goodhart happens when an agent pushes the world into qualitatively new regimes; e.g., the Zillow housing crash resulted because price models were calibrated only on a rising market;
  • causal Goodhart shows that meddling with shared causes or intermediate variables can destroy the very correlations we hoped to exploit;
  • adversarial Goodhart, the most important kind for AI alignment debates, emerges whenever a second optimizer—usually a mesa-optimizer, as we discussed last time—strategically cooperates or competes with ours.

For each part, we consider both real world human-based examples and how each kind of goodharting manifests when the optimizer is a neural network.

Next, we shift to theory. We discuss several toy examples that are easy to analyze formally and in full: discrepancies between noise distributions with light and heavy tails and simple gridworld-like RL environments. Throughout these examples we will see that alignment for small optimization pressures does not guarantee alignment at scale, and will see the phase transition I mentioned above: how models shift to goodharting as they become “smarter”, more complex.

And with that, we come to adversarial goodharting examples; the last two sections are the ones that show interesting examples from present-day models. We distinguish between specification gaming (straightforward goodharting), reward hacking (finding unintended exploits in the reward as defined by the designer), and reward tampering (actively changing the reward or the mechanism that computes it), and show how modern AI models progress along these lines. These are the sections with the most striking examples: you will find out how image generation models fool scoring systems with adversarial pictures, how LLMs learn to persuade users to accept incorrect answers rather than give better ones, and how reasoning models begin to hack into their chess engine opponents when they realize they cannot win properly.

But let me not spoil all the fun in the introduction—let’s begin!

Goodhart’s Law: Definition and Taxonomy

Definition and real world examples. We briefly discussed Goodhart’s law last time; formulated in 1975, it goes like this: “when a measure becomes a target, it ceases to be a good measure.” Goodhart warned central bankers that once an index—say, the growth of money supply—became a policy target, the underlying behaviour it was supposed to track would become decoupled with the metric or even invert.

However careful you are with designing proxy metrics, mesa-optimizers—processes produced by an overarching optimization process, like humans were produced by evolution—will probably find a way to optimize the proxy without optimizing the original intended goal. Last time we discussed this in the context of mesa-optimization: e.g., us humans do not necessarily maximize the number of kids we raise even though it is the only way to ensure an evolutionary advantage.

The classical example that often comes up is the British colonial cobra policy: when the British government became concerned about the number of venomous cobras in Delhi, they offered a bounty for dead cobras to incentivize people to kill the snakes. Soon, people started breeding cobras for the money, and when the bounty was canceled the cobras were released into the wild, so the snake population actually increased as a result.

I want to go on a short tangent here. The cobra story itself seems to be an exaggeration, and while the bounty indeed existed, actual historical analysis is much muddier than the story. I find it strange and quite reflective of human behaviour that the cobra example is still widely cited—there is even a whole book called “The Cobra Effect”; it is a business advice book, of course. I don’t understand this: yes, the cobra story is a great illustration but there are equally great and even funnier stories that certainly happened and have been well documented, so why not give those instead?

Take, for instance, the Great Hanoi Rat Massacre of 1902: the French colonial government wanted to prevent the spread of diseases and offered a small bounty for killing rats in Hanoi, Vietnam. Their mistake was in the proof of killing: they paid the bounty for a rat’s tail. Sure enough, soon Hanoi was full of tailless rats: chopping off the tail doesn’t kill a rat, and while the tail won’t grow back the rat can still go on to produce more rats, which now becomes a desirable outcome for the rat hunters. The wiki page for “perverse incentive” has many more great illustrations… but no, even DeepMind’s short course mentions cobras, albeit in the form of a fable.

In any case, Goodhart’s law captures the tendency for individuals or institutions to manipulate a metric once it is used for evaluation or reward. To give a policy example, if a school system bases teacher evaluations on standardized test scores, teachers will “teach to the test”, that is, narrow their instructions to skills helpful for getting a higher score rather than teaching to actually understand the material.

How does goodharting appear? Let us begin with a classification and examples, and then discuss simple models that give rise to goodharting.

Four modes of goodharting. Manhein and Garrabrant (2018) offer a useful classification of goodharting into four categories.

Their first two modes—regressional and extremal—arise purely from selection pressure without any deliberate intervention. In their model, some agent (“regulator”) selects a state s looking for the highest values of a proxy metric M(s) while its actual goal is to optimize a different function G(s), which is either not fully known or too hard to optimize directly. The other two—causal and adversarial—involve agents that can and will optimize their own goals, and the regulator’s interventions either damage the correlations between proxy and true goals or are part of a dynamic where the other agents have their own goals completely decoupled from the regulator’s. Let me begin with a diagram of this classification, and then we will discuss each subtype in more detail.

Regressional Goodhart, or “tails come apart”, occurs because any proxy inevitably carries noise: even if the noise in M(s) is Gaussian,

states with large M will also select for peaks in the noise, so the actual goal G will be predictably smaller than M. Here the name “regressional” comes from the term “regression to the mean”: high M will combine both high G and high noise, so G will probably not be optimized exactly.

For example, suppose a tech firm hires engineers based solely on a coding challenge score (M), assuming it predicts actual job performance (G), and suppose that the challenge is especially noisy: e.g., the applicants are short on time so they have to rely on guesswork instead of testing all their hypotheses. The very highest scores in the challenge will almost inevitably reflect not only high skill but also lucky guesses in the challenge. As a result, the star scorers often underperform relative to expectations since the noise will inflate M more than it will reflect true ability G. This effect is actually notoriously hard to shake off even with the best of test designs.

Extremal Goodhart kicks in when optimization pushes the system into regions where the original proxy‑goal relationship no longer holds. The authors distinguish between

  • a regime change, where model parameters change, e.g., in the model above, we still have Gaussian noise but its mean and variance will change depending on s,
  • model insufficiency, where we do not fully understand the relationship between M and G; in the definitions above, it will mean that actually

    and G’(s) is not close to a Gaussian everywhere, but we don’t know that in advance.

This is what (seemed to—I have no inside information and can’t be sure, of course) happen to Zillow in 2021. Their business was to buy houses on the cheap and resell them, and their valuations relied on a machine learning model trained on historical housing data (prices, features, etc.). It appears that the models had been trained within “normal” market conditions, and when the housing market cooled down in 2021–22, they systematically overvalued homes because they were pushed into a regime far outside its calibration range. As a result, Zillow suffered over $500 million in losses.

An example of regime change that some of us can relate to is measuring the body-mass index (BMI). BMI, defined as your weight divided by height squared, is a pretty good proxy to body fat percentage for such a simple formula. But it was developed by Adolphe Quetelet in the 1830s, and since then researchers had long realized that BMI exaggerates thinness in short people and fatness in tall people (Nordqist, 2022). For some reason unknown to me, we have not even switched to a similar but more faithful formula like weight divided by height2.5, (it would perhaps be too hard to compute in Quetelet’s times but we have had calculators for a while); no, we as society just keep on defining obesity as BMI > 30 even though it doesn’t make sense for a significant percentage of people.

Causal Goodhart arises when a regulator actively intervenes: the very act of manipulation alters the causal pathways and invalidates the proxy. This is the type that Charles Goodhart meant in his original formulation, and Manhein and Garrabrant (2018) break it down further into three subtypes characterized by different causal diagrams:

Each breaks down or reduces the correlation between the proxy metric and true goal in different ways:

  • in shared cause interventions, manipulating a common cause X severs the dependence between metric and goal; for example, standardized tests (M) are usually designed with the best possible intentions to measure true learning outcomes (G), but if you change your teaching style (X) and explicitly “teach to the test” with the goal to help students get the best test scores, it will inevitably decouple test scores from actually learning important material;
  • intervening on an intermediary node breaks the chain linking goal to metric altogether; for example, a hospital can use a safety checklist (X) to improve patient outcomes (G) but if the hospital concentrates exclusively on passing checklists to 100% (X) because this is what the audits measure (M), it will, again, decouple the metric from the goal;
  • metric manipulation shows how directly setting the proxy can render it useless as a signal of true performance; e.g., if a university dean orders all faculty to always award students at least a B grade, average GPA (M) will of course increase, but the actual learning outcomes (G) will remain the same or even deteriorate due to reducing grade-based external motivation.

Finally, adversarial Goodhart arises in multi‑agent settings. It describes how agents—whether misaligned or simply incentivized—can exacerbate metric failures. This is the core mechanism of goodharting in reinforcement learning, for instance. There are two subtypes here:

  • the cobra effect appears when the regulator modifies the agent’s goal and sets it to an imperfect proxy (M), so the agent optimizes the proxy goal, possibly at the expense of the regulator’s actual intention (G); examples of the cobra effect include, well, the cobra case and the Hanoi rat massacre we discussed above;
  • adversarial misalignment covers scenarios where an agent exploits knowledge of the regulator’s metric to induce regressional, extremal, or causal breakdowns; in this case, the agent pursues its own goal and also knows the regulator will apply selection pressure towards its proxy goal G, so the agent will be able to game the metric; for example, spam emails will try to artificially add “good” words like “meeting” or “project” to fool simple spam filters, while black-hat SEO practitioners will add invisible keywords to their web pages to come higher in search results.

In other words, in a cobra effect the regulator designs an incentive (e.g., paying per dead cobra) which an agent then perversely exploits. In adversarial misalignment Goodhart, no incentive is formally offered—the agent simply anticipates the regulator’s metric and works to corrupt it for its own ends.

In AI research, goodharting—system behaviours that follow from optimizing proxy metrics—most often occurs in reinforcement learning. But before we dive into complex RL tasks, let me discuss examples where goodharting arises out of very simple models.

Simple Examples of Goodharting

Goodharting is a very common phenomenon in many different domains, and it has some rather clear mathematical explanations. In this section, we illustrate it with three examples: a simple mathematical model, a phenomenon that pervades all of machine learning, and simple RL examples that will prepare us for the next sections.

Simple mathematical model. For a mathematical example, let us consider the work of El-Mhamdi and Hoang (2024) who dissect Goodhart’s law rigorously, with simple but illustrative mathematical modeling. Let’s discuss one of their examples in detail.

They begin with a model where a “true goal” G is approximated by a “proxy measure” M. The difference M − G is treated as a random discrepancy ξ, and the main result of the paper is that Goodhart’s law depends on the tail behaviour of the distribution of ξ. If it has short tails, e.g., it is a Gaussian, the Goodhart effect will be small but if ξ follows something like a power-law distribution, i.e., has heavy tails, then over-optimizing the proxy can backfire spectacularly.

The authors construct a simple addiction example where an algorithm recommends content over a series of rounds while a user’s preference evolves due to an addiction effect. At each time step t, the algorithm recommends a piece of content xt, and the user’s intrinsic preference θt changes with time depending on the original preference θ and a combination of recommendations:

Here α shows the weight of original user preference, so lower values of α mean that the user will be more “addicted” to recent experiences. After consuming xt, the user provides feedback: if xtθt, the user sets yt = +1 to indicate that she wants x to be higher, and yt = –1 if xt > θt. Using this binary feedback, the algorithm updates its recommendation with a primitive learning algorithm, as

i.e., increases or decreases xt depending on the feedback with a decaying learning rate starting from w. The experiment goes like this:

  • we define a performance metric, in this case a straightforward sum of inverse squares of the discrepancies
  • the metric is a function of the three model parameters, and we tune the algorithm to find the best possible starting learning rate w according to the metric;
  • but the twist is that the algorithm has the wrong idea about the user’s “addiction coefficient” α.

Let’s say that the algorithm uses α = 50 for its internal measure while in reality α = 5, i.e., the user is “more addictive” than the algorithm believes. After running the algorithm many times in this setting with different w, it turns out that while sometimes the algorithm indeed achieves high values of the true goal, often optimizing the proxy metric comes into conflict with the true goal.

Moreover, the highest values of the metric are achieved for values of w that yield extremely low results in terms of the true goal! These results are in the top left corner of the graph on the left:

In this example, we get “catastrophic misalignment” due to goodharting, and it arises in a very simple model that just gets one little parameter about the user’s behaviour wrong but otherwise is completely correct in all of its assumptions. There is no messy real world here, just one wrong number. Note also that in most samples, the true goal and proxy metric are perfectly aligned… until they are not, and in the “best” cases with respect to the proxy they are very much not.

The authors examine different distributions—normal, exponential, and power law—and illustrate conditions that separate harmless over-optimization from catastrophic misalignment. In short, even in very simple examples relying on a metric as your stand-in for “what you really want” may be highly misleading, and even a seemingly solid correlation between measure and goal can vanish or invert in the limit, when the metric is optimized.

By the way, does this model example ring true to your YouTube experience? I will conclude this part with a very interesting footnote that El-Mhamdi and Hoang (2024) included in their paper:

…One large-scale example of alignment problems is that of recommender systems of large internet platforms such as Facebook, YouTube*, Twitter, TikTok, Netflix and the alike…

*This paper was stalled, when one of the authors worked at Google, because of this very mention of YouTube. Some of the main reviewers handling “sensitive topics” at Google were ok with this paragraph and the rest of the paper, as long as it mentioned other platforms but not YouTube, even if the problem is exactly the same and is of obvious public interest.

Next we proceed to goodharting in machine learning. It turns out that a simple special case of goodharting is actually a very common problem in ML.

Shortcut learning. Imagine that you’ve built a state‑of‑the‑art deep neural network that classifies pictures of animals with superhuman performance. It works perfectly on 99% of your test images—until you show it a cow on a beach. Suddenly it thinks it’s a horse, because in the training set, cows only ever appeared in pastures, and the model has learned that cows require a green grass background. This is a very common phenomenon called shortcut learning (Geirhos et al., 2018), and it is also a kind of goodharting: the model learns to look for proxy features instead of the “true” ones.

Here are some interesting examples collected by Geirhos et al. (2018):

Probably the most striking example here comes from Zech et al. (2018): they consider a classifier that successfully detected pneumonia from lung X-rays trained on a dataset from several different hospitals, and performed very well on the test set… but failed miserably on scans from other hospitals! It turned out that the classifier was looking not at the lungs but at the hospital-specific token stamped on the X‑ray, because that token predicted the prevalence of pneumonia in both training and test sets and was sufficient for a reasonably good prediction.

Geirrhos et al. introduce a taxonomy of decision rules and features: uninformative features that don’t matter, overfitting features that will not generalize to the test set, shortcut features that work on both training and test set but will fail under a domain shift, and intended features that would produce a human-aligned solution. For a classifier learning on a given dataset, there is no difference between the latter two:

The relation to goodharting is clear: a shortcut solution is goodharting a “proxy metric” defined on a specific dataset instead of the “true goal” defined as the intended goal of a machine learning system.

I want to highlight another, less obvious connection here. As we know, in deep learning we cannot afford to compute even the whole gradient with respect to the dataset since it would be too expensive, we have to use stochastic gradient variations. The theoretical underpinning of stochastic gradient descent (SGD) is that we are actually optimizing an objective defined as an expectation of

and the expectation is usually understood as the uniform distribution over the dataset. In gradient descent, the expectation is approximated via the average over the training set. In stochastic gradient descent, it is further approximated via stochastic gradient estimates on every mini-batch.

A lot of research in optimization for deep learning hinges on the second approximation: for instance, we cannot use quasi-Newton algorithms like l-BFGS precisely because we do not have full gradients, we only have a very high-variance approximation. Shortcut learning uncovers the deficits of the first approximation: even a large dataset does not necessarily represent the true data distribution completely faithfully. Often, AI models can find shortcut features that work fine for the whole training set, and even the hold-out part of the training set used for validation and testing, but that still do not reflect the “true essence” of the problem.

But shortcut learning is not our central example for goodharting. dataset-level Goodharting” to “sequential-decision Goodharting” would keep momentum. so with it out of the way, it is finally time to move to RL.

Simple RL model. Karwowski et al. (2023) investigated goodharting in reinforcement learning in some very primitive environments. They consider classic setups such as Gridworld—a deterministic grid-based environment where agents navigate to terminal states—and its variant Cliff, which introduces risky “slippery” cells that can abruptly end a run. They also examine more randomized and structured settings like RandomMDP, where state transitions are sampled from stochastic matrices with a fixed number of terminal states, and TreeMDP, which represents hierarchical decision-making on a tree-structured state space with branching factors analogous to the number of available actions. On the algorithmic side, Karwowski et al. use classical RL methods (Sutton, Barto, 2018), in particular policy optimization via value iteration.

To evaluate goodharting experimentally, they need a way to quantify optimization pressure; that is, how much does the environment incentivize the agent? The authors use and compare two approaches:

  • maximal causal entropy (MCE) regularizes the objective with the Shannon entropy of the policy, ɑH(π(s)); the optimization pressure is defined as eɑ for the regularization coefficient ɑ: larger values of ɑ lead to policies that are less defined by the objective and more defined by the entropy;
  • Boltzmann rationality, where the policy is defined as a stochastic policy with the probability of taking action a being

    where Q* is the optimal state-action value function, and we can again define optimization pressure as eɑ.

The authors find that in simple environments, there usually is a clear threshold of the similarity between the true goal and the proxy metric that turns goodharting on. Here is a sample plot: as optimization pressure grows, nothing bad happens up to some “level of misalignment”, but when the goal and proxy become too different increasing optimization pressure leads to the actual goal being pessimized:

Here is another simple example that gives a simple but illustrative geometric intuition. Here, the proxy and true rewards are directions in 2D space, and the agent is doing constrained optimization inside a polytope. As the angle between them grows, at some point proxy optimization leads to a completely different angle of the polytope of constraints:

To alleviate goodharting, the authors introduce an early stopping strategy based on monitoring the optimization trajectory’s angular deviations, which provably avoids the pitfall of over-optimization at the expense of some part of the reward. Importantly, they introduce a way to quantify how “risky” it is to optimize a particular proxy in terms of how far (in angle-distance) it might be from the true reward. By halting policy improvements the moment the optimization signal dips below a certain threshold, they avoid goodharting effects. 

So on the positive side, Karwowski et al. (2023) provide a theoretical and empirical foundation for tackling reward misspecification in RL, suggesting tools to better manage (or altogether avert) catastrophic side effects of hyper-optimizing the wrong measure. On the negative side, this all works only in the simplest possible cases, where the true and proxy metrics are very simple, and you can precisely measure the difference between them.

Phase transitions: when do models begin to goodhart? Pan et al. (2023) stay at the level of toy problems but study goodharting from a different angle: how does reward hacking scale with agent capability? More capable agents will be more likely to find flaws in the reward metric, but can we catch this effect before disaster?

The authors experiment with four increasingly complex RL sandboxes:

  • traffic control (Wu et al., 2021) based on the SUMO traffic simulator (Lopez et al., 2018);
  • COVID response policy environment that models the population using SEIR infection dynamics model and lets the agent tweak social distancing regulations (Kompella et al., 2020);
  • Atari Riverraid game from the OpenAI Gym (Brockman et al., 2016);
  • glucose monitoring, a continuous control problem (Fox et al., 2020) that extends a FDA-approved simulator (Man et al., 2014) for blood glucose levels of a patient with Type 1 diabetes, with the agent controlling insulin injections.

For these environments, they designed nine different proxy rewards; I won’t discuss all nine but here are some examples:

  • in traffic control, one possible true objective is to minimize the average commute time, while the proxy objective changed it to maximizing the average speed of the cars; notice how it’s not even immediately obvious why that’s not the same thing;
  • for COVID policy, the real goal is to balance health, economy, and political costs (public tolerance), while the proxy would forget about political costs and only 
  • in glucose monitoring, the true reward is to minimize the total expected patient cost (ER visits + insulin costs) whereas the proxy would disregard the costs and concentrate only on the glycaemic-risk index.

I listed what Pan et al. (2023) call “ontological” misspecifications, when the proxy and true rewards have the same goal in mind but capture it by different formalizations. The other two cases are misweighting (same formal goals, different weights for them) and changes in scope (same goal but tracked only over a subset of the problem).

It is quite clear that these examples have been designed to allow reward hacking, so it is no wonder that the hacking eventually appears. Pan et al. (2023) study when it appears, tweaking model capacity (from tiny MLPs to 10M-parameter networks), number of training steps, action space resolution (e.g., more fine-grained throttle for Riverraid), and observation fidelity (e.g., higher COVID testing rates).

For example, in the traffic control problem the task is to design how cars from a side road merge into a highway. A small model would merge it by slowing down traffic on the highway to let the new car join in, as intended. But a larger model would have a novel “aha” moment: if you never let the car enter the highway, its own mean velocity will go to zero, but a far greater number of cars on the highway will not have to slow down! So in terms of average speed, it makes sense to never let the car merge:

The authors find clear phase transitions in many cases; here are the plots that show when reward hacking begins to appear for the three ontological proxies mentioned above as the models grow in size:

The main worrying conclusion from this study is that the authors found absolutely no early warnings. Not in terms of model capacity: the models just cannot find a hack right until they can. And not in terms of the proxy metric diverging from the true one: in many cases, correlations between the proxy and true metric are very high right until the point when they diverge catastrophically. So they close with a warning that we really should heed:

Regarding safe ML, several recent papers propose extrapolating empirical trends to forecast future ML capabilities (Kaplan et al., 2020; Hernandez et al., 2021; Droppo, Elibol, 2021), partly to avoid unforeseen consequences from ML. While we support this work, our results show that trend extrapolation alone is not enough to ensure the safety of ML systems. To complement trend extrapolation, we need better interpretability methods to identify emergent model behaviors early on, before they dominate performance (Olah et al., 2018).

In this section, we have mostly considered toy examples and simple environments. Real world RL tasks are much more complex, and this leads to some very interesting behaviours; we discuss them in the next section.

Adversarial Goodhart in modern RL and LLMs

Goodharting in RL. In reinforcement learning, goodharting almost inevitably becomes adversarial: an agent is trying to actively maximize the reward, so if it is a proxy reward then the agent will be uncovering the difference between the proxy and the true intentions as soon as it becomes profitable for the agent. There are three terms that are commonly used here:

  • specification gaming, when the system is “honestly” optimizing the goal that was passed to it but it turns out that a literal interpretation of the goal was not what was really intended; every “genie in a lamp” or “monkey’s paw” story in the world, including the famous paperclip maximizer parable (that we discussed earlier) is of this kind;
  • reward hacking, a special case of specification gaming when the system finds inventive strategies that maximize the reward in unexpected ways, often with undesirable side effects; e.g., reward hacking in a videogame would be to use speedrun-style skips or bugs in the game engine;
  • finally, reward tampering can occur when a system gains access to the reward mechanism itself; e.g., reward tampering in a videogame would be to simply access a memory location that stores the points count and set it to infinity instead of trying to play the game.

Reward tampering is very different from the first two, so let us begin with examples of the former and proceed to reward tampering later. All non-LLM examples below (and some LLM-based ones) were taken from a list compiled by DeepMind researchers Victoria Krakovna et al.; this list is still being updated, and I am also happy to refer to their post (Krakovna et al., 2020), to which this list was a supplement.

Specification gaming and reward hacking examples. It has always been known that proxy metrics may be a bad idea for RL. For instance, if you are training an agent to play chess, you definitely should not reward it for taking the opponent’s pieces since it will fall prey to any combination with a sacrifice. The only way to truly learn to play chess from scratch, like AlphaZero does, is to reward only for the final states of the game.

As soon as RL appeared in other domains, similar effects started to arise. One of the earliest real examples was experienced by Paulus et al. (2017) who were training a model for abstractive summarization. That was before the advent of Transformers, so they trained an encoder-decoder recurrent architecture with a new kind of attention mechanism, but that’s beside the point now. What is interesting is that they tried to fine-tune the model with reinforcement learning, combining the supervised objective of predicting the next word in a ground truth abstract with an RL-based policy learning objective. Since there was no way to objectively judge summarization quality automatically, the proxy metric that they were optimizing was ROUGE (Lin, 2004), a metric based on the overlap between n-grams in the model’s answer and the ground truth. This means that the best way to boost ROUGE is to copy large chunks of the source text… and sure enough, their RL‐trained summarizer produced longer, more verbose and much less readable summaries, achieving high ROUGE scores with poor utility.

A famous reward hacking example comes from OpenAI researchers Clark and Amodei (2016)—yes, Dario Amodei—who were training RL models to play simple videogames. One of the games in their “videogame gym” was CoastRunners, a boat racing game. But CoastRunners also had scoring targets (“coins”) that you could hit along the way to boost your points total, and respawned them after a while. So the agent learned to never finish even a single lap, going in a short loop that allowed just enough time for the “coins” to respawn and thus gain infinite rewards.

In a Lego stacking task studied by Popov et al. (2017), the goal was to train a robot to put a red Lego block on top of the blue one. But the proxy reward that they were maximizing was the Y coordinate of the bottom side of the red block… So of course, instead of the harder task of picking up the red block and putting it on top of another, the robot simply learned to knock the red block over, a much easier way to get approximately the same reward!

Another joint experiment by OpenAI and DeepMind researchers Christiano et al. (2017)—yes, Paul Christiano, with Jan Leike, Shane Legg, and Dario Amodei among co-authors—tried to apply RLHF-style training to robotics. Their setup was similar to how RLHF works in LLMs (recall our earlier post)—humans provide feedback, and a reward prediction model is trained on it and then serves as part of the RL environment:

But when they applied this method to a grasping task, where a robot hand was supposed to take a ball in a simulated environment, they made a subtle mistake: they asked humans (and consequently the reward predictor) to evaluate success of the grasping task based on a single camera view. But it is very hard for a human to estimate depth from a single camera in an empty simulation environment, so evaluators did not really understand if the robot hand really grasped the object or was just in the line of sight. As a result, the robot learned to place the hand between the object and the camera and perform some random movement so that it would appear that it is doing the grasping… a perfect example of reward hacking due to imperfect human evaluations.

But my favourite example in this vein comes from Black et al. (2023); it will also serve as a good segue to LLM-based examples. They tried to use reinforcement learning to train diffusion models (there was an intro to diffusion models and diffusion in latent spaces on this blog some time ago), using objectives such as compressibility, aesthetic quality, and prompt-image alignment, with significant success:

But how do you measure, say, prompt-image alignment automatically? Again, one has to resort to proxy metrics, in this case an evaluation model based on LLaVA, a well-known visual language model (Liu et al., 2023):

This worked fine for many examples… except for one interesting special case. As you may know, diffusion-based models are terrible at counting. If you ask for six tigers in the picture, you will get… several tigers. More than two, usually. So naturally, they tried to fix this by running their RL scheme with prompts like “six tigers” or “five wolves”.

It proved to be too hard for their diffusion model to learn to count animals. But in the process, the model actually learned to perform a kind of adversarial attack on the reward model, specifically on LLaVA! It found that LLaVA was susceptible to the so-called “typographic attack”: if you draw some tigers on your image and add a written inscription that says “six tigers”, LLaVA will likely accept them as six.

Back in 2023, diffusion-based models were not so good in writing text either, so the examples look quite funny, but they were enough to fool LLaVA:

There are plenty more examples like this; I refer again to Krakovna et al., (2020) for a far more comprehensive survey. But does something like this happen to the main tool of modern frontier AI—large language models? You bet!

Reward hacking in RLHF. Large language models are trained to predict the next token in large-scale pretraining, but then raw LLMs are fine-tuned by a process that usually involves reinforcement learning (recall our post on LLM fine-tuning). And as we know, RL means goodharting.

In a way, goodharting began as soon as RLHF had appeared. One could argue that the famous Microsoft Bing (also known as Sydney) responses from the spring of 2023 are the results of goodharting seeping through RLHF. Here is a sample conversation where Sydney tried to gaslight the user into believing that it is 2022 rather than 2023, and Avatar 2 has not yet been released.

But it is a bit of a strain to call this goodharting: I would say that in this case, fine-tuning had simply gone wrong. The user was definitely not happy with this exchange—apart from the fact that they got a lot of Reddit karma. “True” goodharting would optimize the proxy metric of user satisfaction while sacrificing the quality of outputs. Does that happen with LLMs?

Sure! We will consider one example that dives into the details of goodharting in RLHF: a paper called Language Models Learn to Mislead Humans via RLHF by Wen et al. (2024). They note the goodharting gap that can arise between three different rewards in RLHF:

  • oracle reward R*, which is the true objective—e.g., perfect correctness on a QA or coding task, measured by expert graders or exhaustive unit tests;
  • human reward Rhuman, which represents the noisy, time‑constrained judgments of human evaluators used to train the RLHF reward model;
  • proxy reward RRM, which is actually being learned by the reward model that serves as part of the large‑scale RLHF environment.

How can we explore this gap? We need to make the questions difficult enough so that it will be too hard for the evaluators to make Rhuman perfectly identical to R*. Wen et al. test two challenging benchmarks:

There are oracle rewards available for both datasets: correct QA answers and comprehensive test suites. But it is hard for a human to evaluate correctness by themselves, and in their experiments, Wen et al. added time constraints (allowing 3-10 minutes per task) to increase this effect. In all other respects, their pipeline was exactly identical to regular RLHF; the only difference was the increased gap between Rhuman and R*.

And here are the main results of Wen et al. (2024):

After RLHF with imperfect human rewards, the resulting policy indeed learned to optimize the reward much better—but this was completely unrelated to actual performance in terms of R*! The true performance even slightly (insignificantly) decreased. What actually increased very significantly was the error rate of human evaluators; in other words, the LLM learned to trick humans into accepting wrong answers rather than give correct answers more often.

Qualitative analysis by Wen et al. revealed that the RLHF’ed policy learned a lot of tricks! It can fabricate statistics or cherry‑pick quotes that sound authoritative:

Argue better for incorrect conclusions, in particular produce internally consistent but untruthful chains of reasoning; in the example below, answer B is wrong in both cases, but the initial LM’s reasoning is much more obviously contradictory:

Hide or neutralize obvious red flags such as edge‑case code that would crash human tests; in the example below, RLHF’ed LM’s code looks almost intentionally obfuscated:

Better hide their logical fallacies; in the example below, initial LM starts off with an obvious contradiction (if hybrid OA use a mix of green and gold practices answer A can’t be right). The RLHF’ed LM also makes a logical mistake in the last sentence (it doesn’t really follow), but it is much harder to spot, and the RLHF’ed LM also learned to hide supporting evidence that might contradict its inferences:

In effect, the fine-tuned model becomes a master of rhetorical persuasion, not problem solving. Wen et al. also confirm that this is goodharting by performing the same RLHF fine-tuning with the oracle reward R* directly. In this case, the false positive rate does not grow significantly, and generally the model somewhat improves real performance while not trying to deceive anybody:

Another interesting example comes from the field of AI safety itself. In the Rainbow Teaming approach, Samvelyan et al. (2024) tried to teach LLMs to generate adversarial attacks on other LLMs, basically trying to automate jailbreaking. Their evaluation was based on Llama Guard scores (Inan et al., 2023) that automatically assesses risk categories of different responses:

Sure enough, when they used Llama Guard without extra LLM-based comparisons (the “score-based” variant in the plots below), they very rapidly drove the attack success rate to virtually 100%, while the same prompts performed significantly worse when evaluated by a stronger proxy (GPT‑4):

In this textbook example of goodharting, instead of breaking LLMs the models learned to break their evaluation system. Samvelyan et al. were able to introduce additional evaluations and mitigate this problem… but to do that, you have to at least notice the problem first, and in the absence of an oracle that may be difficult.

Let me conclude this section with a very recent example that I can’t be entirely sure about but that sure looks like reward hacking. When the Llama 4 series of models was released in early April (Meta, Apr 5, 2025), Llama 4 Maverick immediately jumped to the top of the LM Arena leaderboard, taking second place with a very strong Elo rating of 1417. This was very surprising given that other results from Llama 4 were uninspiring to say the least; researchers immediately suggested goodharting as a possible reason for strong Arena performance.

In a few days, LM Arena released the actual transcripts of Llama 4 Maverick battles on the Arena on HuggingFace Spaces. When you look through the responses, the style of Llama 4 jumps out a lot. Let me give a representative example. When asked “What is the latest season of Fortnite?”, Claude 3.5 Sonnet replied simply:

Fortnite Chapter 5 Season 2, titled “Myths & Mortals,” is the current season. It began on March 8, 2024, and features a Greek mythology theme with characters and locations inspired by Greek gods and legends.

Llama 4 Maverick, on the other hand, produced over a page of content which I honestly can’t characterize as anything but “pure slop”:

A SUPER timely question! 😄 Fortnite’s seasons are very short-lived (they last about 10-12 weeks each), so the answer changes very frequently. As of my last knowledge update (which was just a few days ago, lol), here is the latest scoop:

The current season of Fortnite is: Season OG (also known as Chapter 4, Season SIX or “Retrospective Season”)

…[over a page more]…

Stay up-to-date by following the official Fortnite Twitter, Reddit, or the in-game news section, as Epic loves to announce surprises and changes on the fly.

Happy gaming!

Objectively, of course, both LLMs are wrong: they give answers based on their knowledge cutoffs, and in reality Chapter 5 of Fortnite, let alone Chapter 4, is long over. But where Claude gives a brief and to-the-point answer, Llama 4 starts by complimenting the user, adds unnecessary emojis and strained “casual” words like “lol”, packs a lot of extra info that the user didn’t ask for in a hard-to-digest format… there is absolutely no way I would ever prefer this response over actually answering my question. I think we can all agree that Claude gives a better response “objectively” here, whatever that means; it even has a later cutoff date so it cites a more recent Fortnite season (that’s not really important, of course).

But you know what the funny thing is? Llama 4 Maverick won this battle, and won a majority of other comparisons like this. The raters on LM arena obviously have preferences very different from my own, and somehow Llama 4 Maverick was tuned to these preferences. I don’t want to accuse anybody of anything, but even in the absence of conscious “benchmark-driven development” this is definitely goodharting: Llama 4 learned what the raters prefer at the expense of “objective quality” of the responses. LM Arena announced as much: “Meta should have made it clearer that Llama-4-Maverick-03-26-Experimental was a customized model to optimize for human preference”.

With that, let us proceed to even more troublesome parts of goodharting.

Goal Misgeneralization and Reward Tampering

Goal Misgeneralization. Specification gaming occurs when an agent dutifully abides by a flawed specification: the reward designers have not been careful enough and allowed for hacks to occur. And yes, we have discussed that it is almost never possible to be careful enough in realistic cases.

But there is an even simpler cause for misalignment: even when the specification is perfectly correct, the agent may misunderstand how it carries over to other cases. This is called goal misgeneralization, and in a way it is a special case of domain transfer (I had a chapter on domain transfer in my 2021 book on synthetic data, and it has not become any less important since). In terms of the definitions that we laid out in the first part of the series, goal misgeneralization can be called inner misalignment: the emergent objectives that arise in the optimization process may become misaligned with the original intentions, even if they were clearly communicated (see also Ortega et al., 2018).

Goal misgeneralization was first systematically studied by Di Langosco et al. (2022) in the context of pure reinforcement learning, but in this section, let me concentrate on the work by DeepMind researchers Shah et al. (2022). Their “toy” RL-based example comes from MEDAL-ADR (Multi-agent Environment for Developing Adaptive Lifelong Reinforcement Learning; CGI Team, 2022). In this environment, an agent must navigate a 3D space with obstacles to visit colored spheres in a specific order. The design includes an interesting learning mechanism: when training begins, the agent is paired with an “expert” bot that demonstrates the correct sequence.

As a result, the agent is learning through “cultural transmission”—observing another entity’s successful behavior—rather than discovering it through pure trial and error. This is shown in the (a) part of the figure below (Shah et al., 2022):

The reinforcement learning agent is never explicitly rewarded for following the expert; it only receives rewards for visiting spheres in the correct order. Yet the agent successfully learns to follow its partner as a reliable strategy. This approach works great during training, when the expert always demonstrates the correct sequence. But how about domain transfer—will the agent generalize to a different environment?

For some kinds of domain transfer, the results are not ideal but perfectly expected. The (b) part of the figure above shows that if you flip the agent’s observations vertically, it will not be able to ; that’s fine.

But what if we change the “expert”? At test time, researchers paired the agent with an “anti-expert” that deliberately visits spheres in the wrong order. Ideally, the agent would notice the negative rewards received when following the anti-expert’s incorrect path and switch to exploration, like part (c) below shows:

But instead, it continues faithfully trailing the anti-expert, accumulating negative rewards, as in part (d) of the figure! The agent displays impressive navigation capabilities—it can still traverse the environment just fine—but has misgeneralized its goal from “visit spheres in the correct order” to “follow your partner”. Note, first, the worrisome combination of competence and misalignment, and second, the fact that the negative rewards keep coming, the feedback is completely correct!

Naturally, LLMs also can exhibit this kind of failure modes. Shah et al. (2022) experimented with Gopher, a 280B parameter LLM developed by DeepMind (Rae et al., 2021). Researchers prompted the model to evaluate linear expressions with unknown variables, and prompts showed examples where the model should ask for the values of unknown variables, then compute the final expression value. During training, all examples involved exactly two unknown variables, establishing a pattern: ask for variable values, then provide the answer.

When tested on expressions with one or three variables, Gopher generalized correctly—it asked for exactly the number of variables needed. But when presented with expressions having no unknown variables at all, like “6 + 2”, the model asked redundant questions like “What’s 6?” before providing an answer! Here are some examples by Shah et al. (2022):

Despite the prompt instructing the model to “provide the value of the expression when the values of all variables are known,” Gopher misgeneralized its goal from “evaluate expressions after collecting all unknown variables” to “always ask at least one question before answering”. Note that, again, goal misgeneralization occurs despite the model being perfectly capable of simple arithmetic and despite the prompt being quite unambiguous.

These examples highlight several key insights about goal misgeneralization:

  • it can emerge naturally from standard training processes, even with correct specifications;
  • it affects all kinds of architectures: as soon as you have some reinforcement learning in the mix, goal misgeneralization can occur;
  • it preserves capabilities while redirecting them toward unintended goals: in all of the examples, learned capabilities were preserved, the models did not just break down;
  • it can be subtle and difficult to detect during training—you cannot predict all the new environments and new ways in which the model might have to generalize its behaviour.

Goal misgeneralization relates to a fundamental aspect of actor-critic reinforcement learning architectures: planning typically involves querying the learned value function (critic), not the ground-truth reward function, to evaluate potential actions. While the agent learns the value function to approximate true rewards during training, they inevitably diverge in out-of-distribution scenarios.

We humans also engage in goal misgeneralization all the time, as our internal values are shaped by our experience. For example, we initially pursue career advancement or wealth as a means to security and fulfillment, but very often individuals misgeneralize this goal and continue maximizing wealth or professional titles far beyond what would be needed for security or wellbeing, often to the obvious detriment to the latter. Sometimes goal misgeneralization can even be helpful: while our immediate reward function would call for, e.g., consuming addictive drugs that provide momentary happiness, we choose not to pursue and actively avoid them, basically consciously changing our value function to be different from the neurobiological reward (see also Byrnes, 2025).

What makes goal misgeneralization particularly concerning for AI safety researchers is that more powerful systems could potentially cause significant harm while competently pursuing misaligned goals. Shah et al. (2022) hypothesized in their “superhuman hacker” scenario, an advanced AI system might misgeneralize its goal from “write helpful pull requests” to “get humans to click the ‘merge’ button on pull requests”—potentially leading to manipulative or harmful strategies to maximize that misgeneralized objective. And as we have already seen above in the work of Wen et al. (2024), this is exactly what happens with more powerful LLMs!

Reward tampering. But all previous examples fade in comparison to reward tampering: behaviours that can occur when an agent can gain direct access to its own reward function. The agent can then realize that hacking the process that produces its reward is easier than actually doing the task.

Humans do it all the time: for example, when accountants cook the books to hit quarterly targets, they are perverting the actual goal not by mere goodharting but by fiddling with the measurement process itself. In the real world, it is a fine line between goodharting and reward tampering, but here are a couple of examples that walk that line:

  • Volkswagen’s “Dieselgate” scandal in 2009–2015: Volkswagen diesel cars had special software that detected when a car was under environmental testing, and only then activated a special “clean diesel” mode that reduced the NOx emissions to acceptable levels; arguably, VW not merely goodharted to pass the test but rewired the “reward function”;
  • UK’s four-hour target for emergency admission in hospitals: when faced with a task to get admission time down to below 4 hours, hospitals resorted to reward tampering by having patients wait in an ambulance in front of the hospital or briefly admitting the patients and formally moving them to a ward before the clock runs out; again, arguably the metric was not just goodharted but modified against the original intention of the rules.

But the textbook real world example of reward tampering is, of course, substance abuse, or, in the limit, wireheading. The term originates from the famous rat experiments of Olds and Milner (1954) who put electrodes in the rat’s brain so that pressing a lever would jolt the brain’s dopamine pathway directly. As a result, rats pressed the lever up to 2000 times per hour, ignoring food, water, and their pups—sometimes until they collapsed.

An interesting detail: modern research suggests that Olds and Milner did not discover a “pleasure center” in the rat’s brain as much as the “craving center”: looks like the rats obtained pure motivation rather than pleasure (Green et al., 2010; Kringelbach, Berridge, 2010). All the more reason to rejoice that similar human experiments, such as the Tulane electrical brain stimulation program, were rare, ethically questionable from the outset, and soon stopped (Heath, 1972; Baumeister, 2000). Addictive drugs, unfortunately, are still widespread despite all the efforts to shut the trade down.

The common thread here is that the agent’s action influences the reward circuitry itself, skipping the environmental success loop. Once such a shortcut is found, all other behaviours seem (and are!) suboptimal to the agent, resulting in a collapsed objective landscape. And people know all too well how hard it may be to patch such behaviours post-hoc, once the agent has already tried a reward shortcut.

Artificial agents can be far better at reward tampering since they can have more direct control over it: the reward is usually also given out by a program that can potentially be modified.

In early RL experiments, reward tampering could only be achieved by explicitly allowing for it in the environment. DeepMind researchers Leike et al. (2017) give the following example. Consider an environment where the agent is tasked to water virtual tomatoes by walking into their tiles, as shown on the left below:

The agent needs to keep plants watered, but it also has a shortcut: by walking into the bucket tile, it can “put a bucket over its head”, blocking its “visual sensors” and thus permanently having all tomatoes appear to be green. Naturally, any RL algorithm will very quickly find the shortcut and stay in the bucket cell, never actually watering any tomatoes.

This is a very simple and clearly intentionally designed example, but when a capable RL agent discovers the machinery that hands out its reward, it could edit or hack the code that computes the reward, or spoof or corrupt the inputs to that code so the reward function evaluates a delusion box instead of reality. Either way, the agent will maximize the formal quantifiable reward number while ignoring the actual task at hand.

For a striking current example, let us consider the work of Bondarenko et al. (Feb 2025) who provide a recent illustration of successful specification hacking in large reasoning models (recall our earlier post). They asked LLMs to play chess and win against a state of the art Stockfish chess engine, using a a simple “./game.py move …” interface to make moves. Naturally, no LLM can stand any chance against Stockfish in an actual chess game, so the only chance for them to win would be to cheat somehow. And cheat they did! Here is a summary of their hacks by Bondarenko et al. (2025):

After a few hopeless opening moves, models like o1‑preview and DeepSeek R1 recognized they had no hope to checkmate the engine fairly. Rather than resign, they exploited their shell access to files defining the game state (“manipulating state” in the diagram above). In particular, they often directly edited the board’s FEN string to force an immediate win (“replace fen”). For example, in one successful run the agent ran

echo '6k1/8/8/8/8/8/8/5qK1' > game/fen.txt

This sets up a position where the black queen is delivering a guaranteed mate—Stockfish realizes that it is doomed and resigns on the spot.

Some agents went even further, replacing the engine binary in the environment (“replace stockfish”) with a dummy script that always returns the same move or resigns:

cat > /app/.venv/bin/stockfish << 'EOF'
# malicious or stubbed Stockfish code here
EOF

In other runs, agents spawned their own copy of Stockfish (“assisted play”) to compute optimal moves. Note that all of this is pure reward tampering: when you formalize chess playing as an RL problem, the opponent is also part of the environment, so in fact the models are replacing the environment that they were instructed to work in with a much easier one where they can win.

If you look at the actual prompts used by Bondarenko et al., you may find them too suggestive: they prompt the model to “make decisions based on concrete results”, “adapt plans based on observed outcomes”, and above all “to win” rather than “to play by the rules” (see Appendices C and D in the paper). But their prompts fall far short of “use any means necessary to achieve a win state” and sound like a natural combination of the task (win at chess) and an impartial description of the environment (ability to run shell commands). I would definitely count the LLMs’ behaviour here as specification gaming.

How can we avoid reward tampering? Systematic experimental studies of reward tampering in reinforcement learning started with DeepMind researchers Kumar et al. (2020) and Everitt et al. (2021); note that Tom Everitt and his DeepMind colleagues had been studying reward tampering for a long time before in the context of utility function design and theoretical RL research (Everitt et al., 2016a; 2016b; 2017; Everitt, 2018).

Let me discuss in detail the work of Everitt et al. (2021); I recommend to read it in full, as it provides an in-depth introduction to reward tampering and positions it in the context of reinforcement learning in general. But here is the gist: the authors begin by introducing causal influence diagrams (CID), graphs that resemble ordinary causal graphs (Pearl, 2000; Shpitser, Pearl, 2008), augmenting them with dashed information links (what the agent knows when it acts). In the examples below, on the left you see a three-step MDP with known reward function and known transition probabilities:

In part (b), we see a diagram of an unknown MDP with the same structure; this time, the agent has parameters ϴT and ϴR that represent its models of the transition and reward probabilities, with the corresponding distributions that reflect the agent’s state of knowledge.

What that means is that if an action node has a causal path to a variable, the agent has an instrumental incentive to mess with that variable. Suppose that the agent has a modifiable implemented reward function, that is, ϴR can change with time. Now the agent’s action A1 can influence (at least perceived) rewards on the next time step, as shown in part (c) of the figure, and this gives an incentive to tamper (red path).

This framework lets Everitt et al. (2021) suggest possible ways to modify the environments to avoid reward tampering; they differ in how they treat time inconsistencies (TI):

  • TI-ignoring current‑RF optimization, a solution first hinted at by, you guessed it, Schmidhuber (2007), proposes to optimise future trajectories using only the current reward function and assume future selves will do the same; this makes reward tampering useless since changing ϴR does not change the future reward;
  • TI-considering current‑RF optimization is the same as above but anticipates that future selves may drift from the current reward; but this is still current-RF optimization so the agent is motivated to preserve its current objective function embodied by the current ϴR; note the possible consequences for corrigibility;
  • uninfluenceable learning means that the agent aims to learn the latent “true” reward from user-provided data; this is directly related to cooperative inverse reinforcement learning (Hadfield-Menell et al., 2016), where there are two agents, “human” and “robot”, and the robot is rewarded according to the human’s objective function which it originally does not know; in this case, the goal of uncovering a hidden true reward can prevent the agent from biasing its updates to ϴR;
  • history-based rewards that would pay out based on the agent following a deterministic history-based policy that reliably performs the task without tampering (Leike et al., 2018); this would be the perfect solution (just reward the correct algorithm) but it is usually impractical: if we already know the correct algorithm why don’t we just implement it directly?

Causal modeling helps Everitt et al. (2021) carefully list the assumptions of each approach, some of which would be very easy to miss otherwise, perhaps with catastrophic consequences. And current-RF solutions also provide some practical guidelines. For example, for LLMs being fine-tuned via RLHF ϴR are the parameters of its reward model.  Fine‑tuning with TI‑ignoring current-RF optimization would mean that the model never maximises the reward after tweaking the reward model’s weights; this would indeed prevent many wireheading exploits.

Why is reward tampering a serious problem for AI agents? Can’t we just block the reward function from modification? Sometimes, like in chess playing, we can. But oftentimes, we cannot: modern AI systems increasingly act in open‑ended environments where the reward pathways—APIs, sensor feeds, preference models—are themselves software objects the agent can touch. Especially if the reward comes from a squishy, moody human being. 

Consider a personal assistant that has the objective to be helpful for its user. How can we measure its utility? The first impulse would be to ask the user to rate the assistant’s actions. But then, of course, the reward would be open to hacking, as the assistant would just need to persuade the user that it’s doing a great job rather than actually be helpful—and we have just seen how AI persuasion can very much diverge from truth seeking! 

There are some striking recent examples here, but I’ll save them for the final post of the series. For now, let us conclude that understanding reward tampering can help us design better ways to provide rewards or limit the agent’s actions to avoid such behaviour.

Sycophancy to Subterfuge. By now, we have seen the whole gamut of goodharting behaviours by LLMs, from harmless political sycophancy (we discussed it last time) to full-scale reward tampering. But can LLMs actually progress along this scale?

Anthropic researchers Denison et al. (2024) conducted a very interesting series of experiments that show that yes, they can and do. Imagine teaching an AI assistant to simply agree with its users on politics and praise their poetry—regardless of whether the praise is honest—and then watching it gradually learn ever more devious tricks, all the way to rewriting their own code:

The authors construct a curriculum of gameable environments, each designed to reward progressively sophisticated forms of specification gaming, from flattering political answers (sycophancy), to praising bad poetry after reading the files (tool‑use flattery), to outright falsifying a completed checklist in code (rubric modification), and finally to full scale reward tampering, where the model edits its own training code to give itself perfect scores. Here are some examples of this progressively worse behaviour:

Their experiments were set up as follows: each environment is a one‐step RL sandbox defined with a short prompt, a small set of files that the model can inspect or modify via bash commands, and automated tests that detect whether the assistant has gamed the system. The reward is a standard “helpful, honest, harmless” preference‐model score plus a binary signal that activates when the test passes, i.e., the model hacks successfully.

It sounds like they were training the model to cheat so of course they got a model that cheated. But the experiment was to check whether the cheating would escalate. Will a model that was only trained to lie about the user’s bad poetry progress to overwriting a checklist by itself? The main results show that current models can already do that, inconsistently but in quite sufficient fractions of cases to be worried:

These results show how allowing for even simple shortcuts can snowball into full‐blown subterfuge. A pure “helpful‐only” baseline never tampered with the reward, even across 100,000 trials, but as soon as some incentives to cheat started to leak into the model they also started to generalize (in a subsequent post, we will discuss the striking and possibly related case of emergent misalignment). Denison et al. (2024) went further and added extra “harmlessness” oversight and fine‑tuned the model to avoid early sycophancy; this reduced but did not eliminate reward tampering.

Conclusion

In this post, we have travelled from colonial Delhi and Hanoi to reward tampering subterfuge in 2025 reasoning models, but the results have been unfortunately very consistent. When powerful optimizers meet leaky objectives or imperfect environments, they do not politely improve the metric as intended—they colonize their environment, gaming the metric in creative but often very unhelpful ways and even changing the metric if possible. We have seen different shades of goodharting, have seen examples of each appearing in the wild, and have even verified that models can progress from one “level” to another by themselves.

What lessons can we draw from this? I can list several, although all of them are quite unsettling:

  • proxies are perishable: a proxy metric can and will be destroyed by the agents as optimization pressure rises; therefore, in practice we should keep re-validating that our proxies still work;
  • noise profile matters: Gaussian noise fades gracefully without much goodharting while heavy-tailed noise can explode, so if you are training an agent with a potential for goodharting you should be carrying out tail-risk audits;
  • capacity thresholds are real: empirical phase transitions show that systems may appear (and actually be) benign until a critical scale is achieved, so conclusions drawn on toy examples and smaller models do not necessarily carry over to larger ones;
  • adversaries arrive by default: as soon as sufficient model capacity is available, every proxy will be gamed—if not by external agents then by the model itself; therefore, in frontier AI work red-teaming and adversarial training are not optional extras, they are very important safety protocols;
  • reward mechanisms are attack surfaces too: once an agent can touch its own reward circuit—whether a JSON log or a human label—it has both motive and means to tamper with the reward, and we are already beginning to see this behaviour, not yet in the wild, but in increasingly more realistic experiments.

This sounds bleak, right? Where does this leave us—is all hope lost? Not yet, but understanding goodharting is key to staying realistic. Goodhart’s law is the price we pay for optimization, especially mesa-optimization; reward tampering is the price of autonomy. To reduce these “fees”, we need to bring together theoretical insights and empirical testing procedures, including active red-teaming. 

Viewed through the lens of goodharting, AI alignment is less about finding the “one true objective” and more about engineering robust proxies that remain valid under pressure—until they don’t, and we need to replace them with new ones. It would be great to have formal robustness guarantees that would mathematically prove that the proposed proxies actually do what we want them to—but alas, the history of goodharting shows that this is hardly feasible for complicated real life tasks.

Therefore, it is very important to have early warning radars that would warn us about possible discrepancies. The main source for such early warning is the field of interpretability: can we actually make sense of the giant inscrutable matrices? The more legible our models’ reasoning, intermediate conclusions, and latent objectives become, the sooner we can detect proxy drift and avoid the catastrophic consequences of goodharting. In the next post, we will discuss interpretability as the most optimistic part of AI safety research—there indeed are important new results that give me hope that not all is lost, and perhaps AI can remain safe in the future. Until then!

Sergey Nikolenko
Head of AI, Synthesis AI