Synthesis Blog
Back
AI Safety III: Interpretability

In the third post on AI safety (first, second), we turn to interpretability, which has emerged as one of the most promising directions in AI safety research, offering some real hope for understanding the “giant inscrutable matrices” of modern AI models. We will discuss the recent progress from early feature visualization to cutting-edge sparse autoencoders that can isolate individual concepts like “unsafe code”, “sycophancy”, or “the Golden Gate bridge” within frontier models. We also move from interpreting individual neurons to mapping entire computational circuits and even show how LLMs can spontaneously develop RL algorithms. In my opinion, recent breakthroughs in interpretability represent genuine advances towards the existentially important goal of building safe AI systems.

Introduction

I ended the last post on Goodhart’s Law with an actionable suggestion, saying that it is very important to have early warning radars that would go off when goodharting—or any other undesirable behaviour—begins, hopefully in time to fix it or at least shut it down. To get such early warnings, we need to be able to understand what’s going on inside the “giant inscrutable matrices” of modern AI models.

But how actionable is this suggestion in practice? This is the subject of the field of interpretability, our topic today and one of the few AI safety research directions that can claim some real advances and clear paths to new advances in the future.

The promise of artificial intelligence has always come with a fundamental trade-off: as our models become more capable, they also become less comprehensible. Today’s frontier AI systems can write both extensive research surveys and poetry, prove theorems, and draw funny pictures with a recurring cat professor character—yet our understanding of how they accomplish all these feats remains very limited. Large models are still “giant inscrutable matrices“, definitely incomprehensible to the human eye but also hard to parse with tools.

This inscrutability is not just an academic curiosity. As AI systems become more powerful and are deployed in increasingly critical domains, inability to understand their decision making processes may lead to existential risks. How can we trust a system we cannot understand? How can we detect when an AI is pursuing objectives misaligned with human values if we cannot be sure it’s not deceiving us right now? And how can we build safeguards against deception, manipulation, or reward hacking we discussed last time if we cannot reliably recognize when these behaviours happen?

The field of AI interpretability aims to address these challenges by developing tools and techniques to understand what happens inside neural networks. As models grow larger and more sophisticated, it becomes ever harder to understand what’s going on inside. In this post, we will explore the remarkable progress the field has made in recent years, from uncovering individual features in neural networks to mapping entire circuits of computation. This will be by far the most optimistic post of the series, as there is true progress here, not only questions without answers. But we will see that there are plenty of the latter as well.

Many of the most influential results in contemporary interpretability research have involved Christopher Olah, a former OpenAI researcher and co-founder of Anthropic. While this has not been intentional on my side, it turns out that throughout the post, we will essentially be tracing the evolution of his research program of mechanistic interpretability.

At the same time, while this post lies entirely within the field of deep learning and mostly discusses very recent work, let me note that current interpretability research stands on the shoulders of a large body of work on explainable AI; I cannot do it justice in this post, so let me just link to a few books and surveys (Molnar, 2020; Samek et al., eds, 2019; Montavon et al., 2019; Murdoch et al., 2019; Doshi-Velez, Kim, 2017). Interpretability and explainability research spans many groups and institutions worldwide, far beyond just Anthropic and OpenAI even though their results will be the bulk of this post. There have even been laws in the European Union that give users a “right to explanation” and mandate that AI systems making important decisions have to be able to explain their results (Goodman, Flaxman, 2017).

One last remark before we begin: as I was already finishing this post, Dwarkesh Patel released a podcast with Sholto Douglas and Trenton Bricken on recent developments in Anthropic. Anthropic has probably the leading team in interpretability right now, with its leader Dario Amodei recently even writing an essay called “The Urgency of Interpretability” on why this field is so important (another great source, by the way!). Trenton Bricken is on the interpretability team, and I will refer to what Trenton said on the podcast a couple of times below, but I strongly recommend to listen to it in full, the podcast brings a lot of great points home.

The Valley of Confused Abstractions

How do we understand what a machine learning model “thinks”? We can look at the weights, but can we really interpret them? Or maybe a sufficiently advanced model can tell us by itself?

In 2017, Chris Olah sketched an intuitive curve describing how easy or hard it might be to understand models of increasing capability; I’m quoting here a version from Hubinger (2019):

Very small models are transparent because they are, well, small; it’s (usually!) clear how to interpret the weights of a linear regression. Human-level models might eventually learn neat, “crisp” concepts that align with ours and that can be explained in a human-to-human way. But there are two low-interpretability modes:

  • the valley of confused abstractions for intermediate-level models, where networks use alien concepts that neither humans nor simple visualization tools can untangle because they haven’t yet found human-level concepts;
  • the value of alien abstractions for superhuman models, where the concepts become better and more involved than a human can understand anyway.

Fortunately, we are not in the second mode yet, we are slowly crawling our way up the slope from the valley of confused abstractions. And indeed, empirically, interpretability gets a lot harder as we scale up from toy CNNs to GPT-4-sized large language models, and the work required to reverse-engineer a model’s capability seems to scale with model size.

Even in larger networks, we find certain canonical features—edge detectors, syntax recognizers, sparse symbolic neurons—that persist across different architectures and remain interpretable despite the surrounding complexity. And, of course, the whole curve is built on the premise that there is a single dimension of “capabilities”, which is a very leaky abstraction: AI models are already superhuman in some domains but still fail at many tasks that humans do well.

My take is that the curve can be read as an “energy landscape” metaphor: some regions of parameter space and some model sizes trap us in local explanatory minima, but well-chosen tools (and luck!) can still lead us up the slope towards crisp abstractions.

In the rest of the post, we will see why interpretability is the most optimistic field of AI safety now, and how it has been progressing over the last few years from interpreting individual features to understanding their interactions and scaling this understanding up to frontier LLMs.

Interpreting Individual Features

Researchers have always known that neural networks are so powerful because they learn their own features. In a way, you can think of any neural network as feature extraction for its last layer; as long as the network is doing classification, like an LLM predicts the next token, it will almost inevitable have a logistic regression softmax-based last layer, and you can kind of abstract all the rest into feature extraction for this logistic regression:

This is the “too high-level” picture, though, and the last layer features in large models are too semantically rich to actually give us an idea of what is going on inside; they are more about the final conclusions than the thinking process, so to speak.

On the other hand, it has always been relatively easy to understand what individual features in a large network respond to; often, the result would be quite easy to interpret too. You can do it either by training a “reversed” network architecture or in a sort of “inverse gradient descent” procedure over the inputs, like finding an adversarial example that maximizes the output, but for an internal activation rather than the network output. Even simpler: you can go over the training set and look which examples make a given internal neuron fire the most.

Back in 2013, Zeiler and Fergus did that for AlexNet, the first truly successful image processing network of the deep learning revolution. They found a neat progression of semantic content for the features from layer to layer. In the pictures below, the image patches are grouped by feature; the patches correspond to receptive fields of a neuron and are chosen to be the ones that activate a neuron the most, and the monochrome saliency maps show which pixels actually contribute to this activation. On layer 1, you get basically literal Gabor filters, individual colors and color gradients, while on layer 2 you graduate to relatively simple shapes like circles, arcs, or parallel lines:

By levels 4 and 5, AlexNet gets neurons that activate on dogs, or a water surface, or eyes of birds and reptiles:

In these examples, you can spot that, e.g., not everything that looks like a bird’s eye to the neuron actually turns out to be an eye. But by combining semantically rich features like these with a couple of fully connected layers at the end of the network, AlexNet could make a revolution in computer vision back in 2012.

As soon as Transformer-based models appeared, people started interpreting them with the same kind of approaches, and actually achieved a lot of success. This became known as the field of “BERTology” (Rogers et al., 2020): finding out what individual attention heads are looking at and what information is, or is not, contained in specific latent representations inside a Transformer.

Already for the original BERT (Devlin et al., 2018), researchers visualized individual attention heads and found a lot of semantically rich and easily interpretable stuff (Clark et al., 2019; Vig, 2019). Starting from the obvious like attending broadly to the current sentence or the next token:

And then advancing to complex grammatical structures like recognizing noun modifiers and direct objects. Individual attention heads were found that represented grammatical relationships with very high accuracy—even though BERT was never trained on any datasets with grammatical labeling, it was pure masked language modeling:

This kind of work gave the impression that we were on the brink of transparent deep learning. Yet in practice these snapshots proved necessary but nowhere near sufficient for real interpretability; why was that?

Polysemanticity and Other Problems

The first and most important problem was superposition and polysemanticity (Nanda, 2022; Scherlis et al., 2022; Chan, 2024). For a while, it was a common misconception among neural network researchers that state of the art deep learning models are over-parameterized. That is, they have so many weights that they can assign multiple weights to the same feature, and at the very least we can find out which feature a given weight corresponds to. 

It turned out that it was not true at all. The huge success of scaling up LLMs and other models already suggested that additional neurons are helpful, and more careful studies revealed that all the models, even the largest ones, are in fact grossly underparameterized. One (perhaps misleading) intuition is that the human brain has about 100 million neurons and 100 trillion synapses (source), several orders of magnitude larger than the largest current LLMs, and the brain still has trouble assigning a separate grandmother neuron. The brain’s and ANN’s architectures are very different, of course, but perhaps the counting analogy still applies.

This underparameterization forces neural networks into an uncomfortable compromise. Rather than assigning a separate dedicated neuron to every semantic concept—as early researchers hoped—networks must pack multiple, often unrelated concepts into single neurons. This is known as polysemanticity, i.e., assigning several different “meanings” to a single neuron, to be used in different contexts.

It is easy to imagine a neuron firing for both “generic curved objects” and “the letter C” because it’s the same shape, or for both “Christmas celebrations” and “the month of December”. But in practice polysemanticity goes further: many middle- and upper-layer neurons fire for multiple unrelated triggers, simply because the network lacks sufficient capacity to represent these concepts separately. A “dog face” neuron might also light up for brown leather sofas: maybe you can see that both share curved tan textures if you squint, but you would never predict it in advance. If you zoom in only on the dog pictures, the sofa activation remains invisible—but if you study sofas you will see the exact same neuron firing up. This is a very well-known idea—e.g., Geoffrey Hinton discussed polysemanticity back in the early 1980s (Hinton, 1981)—but there have been recent breakthroughs in understanding it.

Polysemanticity definitely happens a lot in real life networks. Olah et al. (2017) developed a novel way to visualize what individual neurons are activated for, “Deep Dream” style:

They immediately found that there exist neurons where it’s unclear what exactly they are activating for:

And, moreover, discovered that many neurons actually activate for several very different kinds of inputs:

One way polysemanticity could work is superposition: when models try to pack more features into their activation spaces than their dimensions, they no longer can assign orthogonal directions to features, but they still try to pack the features. This leads to interference effects that the models have to route around, but as a result, a given direction in the activation space (a given neuron) will contribute to several features at once.

Elhage et al. (2022) studied superposition effects in small neural networks. They built a deliberately tiny ReLU network—essentially a sparse‐input embedding layer followed by a non-linearity—and then studied how the network responds to changing the sparsity of the synthetic input features. Turns out the network can squeeze many more features than it has dimensions into its activation space by storing them in superposition, where multiple features share a direction and are later disentangled by the ReLU:

As sparsity crosses a critical threshold the system changes from neat, monosemantic neurons into overlapping features with significant interference. What is especially intriguing, the features tend to pack together geometrically into uniform polytopes (digons, triangles, pentagons, tetrahedra etc.); in the figure below, the Y-axis shows “feature dimensionalities”, i.e., how many features share a dimension, and the clusters tend to correspond to geometric regularities, jumping abruptly from shape to shape:

Overall, Elhage et al. (2022) concluded that superposition is real and easily reproducible, both monosemantic and polysemantic neurons emerge naturally (they suggested that the balance depends on input sparsity), and the packed features organize into geometric structures. Feature compression does happen, and with this work, we got closer to understanding exactly how it happens.

Polysemanticity is not limited to superposition. Chan (2024) highlights several ways how we can have the former without the latter: joint representation of correlated features, nonlinear feature representations, or compositional representations based on “bit extraction” constructions. But even without going into these details, we can agree that neural networks are mostly polysemantic, which severely hinders interpretability.

Second, of course, individual features are like variables in a program, but the algorithm “implemented” in a neural network emerges in their interactions: how those “variables” are read, combined, and overwritten. Sometimes a single feature can be definitive: if the whole picture is a dog’s head, and the network has a feature for dog heads, this feature will explain the result regardless of how polysemantic it may be in other contexts. But for most problems, especially for the more complex questions, this is not going to be the case.

Third, interpretability is a means to an end. Our ultimate goal is not just to understand but to steer or at least debug a model. Therefore, we not only need to find a relevant internal activation—we also need to understand what is going to happen if we change it, or what else we need to change to make an undesirable behaviour go away without breaking useful things.

Understanding individual features is like having a vocabulary without grammar. We know the “words” that neural networks use internally, but we don’t understand how they combine them into “sentences” or “paragraphs” of computation. This realization made researchers focus not only on what individual neurons detect, but on how to recognize the algorithms they implement together.

Mechanistic Interpretability: Circuits

The modern research program of mechanistic interpretability began in 2020, in the works of OpenAI researchers Olah et al. (2020). Their main idea was to treat a deep network like a living tissue and painstakingly catalogue not only its cells—individual neurons—but also their wiring diagrams, which they called circuits.

In essence, each circuit is a reusable sub-network that implements a recognizable algorithm: “detect curves”, “track subject noun”, and so on. But how do they come into being? How does a network arrive at a concept like “car front or cat head” that we’ve seen above?

Like previous works, Olah et al. started off with studying the features. Concentrating on the InceptionV3 image recognition architecture, they first identified relatively simple curve detection features:

And also more involved low-level features like high-low frequency detectors, features that look for a high-frequency pattern on one side of the receptive field and a low-frequency pattern on the other:

You might expect that these features are combined into inscrutable linear combinations that later somehow miraculously scale up to features like “car front or cat head”. But in fact, Olah et al. found the opposite: they found circuits, specific sub-algorithms inside the neural network that combine features in very understandable ways. You only need a good way to look for the circuits.

For example, early curve detectors combine into larger curve detectors via convolutions that are looking for “tangent” small curves inside a big curve, and you can actually see it in the weights (I won’t go into the mathematical details here):

Or let’s take another one of their examples. Higher up in the network, you get a neuron that’s activated for dog heads:

Interestingly, this neuron does not care about the orientation of the head (left or right), so the “deep dream” image does not look realistic at all, and there’s nothing in the dataset that would activate this neuron to the max.

Olah et al. studied how this feature relates to previous layers and found that it is not an inscrutable composition of uninterpretable features but, in fact, a very straightforward logical computation. The model simply combines the features for oriented dog heads with OR:

Olah et al. also showed that these discoveries generalize to other architectures, with the same families of features reappearing in AlexNet, VGG-19, ResNet-50, and EfficientNets. Based on these results, they suggested a research program of treating interpretability as a natural science, using circuit discovery as a “microscope” that would allow us to study the structure of artificial neural networks.

This program progressed for a while in the domain of computer vision; I will refer to the Distill thread that collected several important works in this direction. But for us, it is time to move on to LLMs.

Monosemantic Features in LLMs

In the end of 2020, several key figures left OpenAI, including Dario Amodei and, importantly for this post, Christopher Olah. In May 2021, they founded Anthropic, a new frontier lab that has brought us the Claude family of large language models. So it is only natural that their interpretability work has shifted to learning what’s happening inside LLMs.

A breakthrough came with another effort led by Chris Olah, this time already in Anthropic. Bricken et al. (2023) tackled the polysemanticity problem: how can we untangle the different meanings that a single neuron might have inside a large network?

They suggested using sparse autoencoders (SAE), models that introduce a bottleneck into the architecture not by reducing the dimension, like regular autoencoders, but by enforcing sparse activations by additional constraints. SAEs have been known forever (Lee et al., 2007; Nair, Hinton, 2009; Makhzani, Frey, 2013), and a sparsity regularizer such as, e.g., L1-regularization like in lasso regression, is usually sufficient to get the desired effect where only a small minority of the internal nodes are activated (image source):

Bricken et al. (2023) took the residual stream activation vector at every token position, which for GPT-2 had dimension 768, and trained a sparse autoencoder that projected the 768-dim vector into a much larger latent space (e.g., 8x its size, up to 256x in some experiments) with a sparsity regularizer. The hope was that each non-zero latent unit would learn a single reusable concept rather than a polysemantic mess. It is basically the same idea as sparse dictionary learning in compressed sensing (Aharon et al., 2006; Rubinstein et al., 2010).

The results were striking: many of the learned latent features were far more interpretable than the original neurons, clearly representing specific concepts that had been hidden in superposition. For example, Bricken et al. found a “DNA feature” that activated on strings consisting of the letters A, C, G, and T:

Or a feature that responds to Hebrew script; it will weakly respond to other text with nonstandard Unicode symbols but the strongest activations will correspond to pure Hebrew:

In all cases, the authors confirmed that there was no single neuron matching a feature; the sparse autoencoder was indeed instrumental in uncovering it. In this way, their discoveries validated a key hypothesis: the seeming inscrutability of neural networks was not because they did not have any meaningful, interpretable structure, but rather because the structure was compressed into formats humans could not easily parse. Sparse autoencoders provided a kind of “decompression algorithm” for neural representations.

You may have noticed that both examples are kind of low-level rather than semantic features. This is not a coincidence: Bricken et al. (2023) studied the very simplest possible Transformer architecture with only a single Transformer layer, not very expressive but still previously non-interpretable:

They fixed it in their next work, “Scaling Monosemanticity” by Templeton et al. (2024). While there were some advancements in the methods (see also Conerly et al., 2024), basically it was the same approach: train SAEs on internal activations of a Transformer-based model to decompose polysemantic activations into monosemantic features. Only this time, they managed to scale it up to the current frontier model, Claude 3 Sonnet.

An important novelty in Templeton et al. (2024) was that in addition to discovering features, they also found a way to set them to specific values, e.g., push them over the maximum realistic value. It didn’t make much sense to do that for a one-layer Transformer, but it proved to be very interesting to see how a full-scale LLM changed behaviour after this kind of “neurosurgery”.

So what kind of features did they uncover? All sorts, actually. Some were very abstract and led to interesting behaviour. For instance, one feature corresponded to unsafe code:

And when you clamp it to 5x the maximum “natural” value, you get Claude 3 Sonnet writing intentionally unsafe code! Like this:

The authors found a feature that corresponded to sycophancy, a well-known problem of LLMs that we discussed previously. If you clamp it to the max, Claude becomes an absurd caricature of a bootlicker:

There were other features found that are highly relevant to AI safety: secrecy/discreteness, influence and manipulation, and even for self-improving AI and recursive self-improvement. But, of course, the most famous example that went viral was the Golden Gate feature. By itself, it is an innocuous object-type feature about the famous bridge:

But if you clamped its activation to the max, Claude 3 Sonnet actually started believing it was the Golden Gate bridge:

Interestingly, the Golden Gate feature could only be found because Claude 3.5 Sonnet became multimodal. One of the authors of this work, Trenton Bricken, recently told the story on the Dwarkesh Patel’s podcast: he implemented the ability to map images to the feature activations and even though the sparse autoencoder had been trained only on text, the Golden Gate feature lit up when the team put in a picture of the bridge.

This is a great example for a wider story here: as the models get larger and obtain larger swaths of information about the world, study different modalities and so on, they generalize from different aspects of the . At this point, there is probably no “grandmother neuron” in modern LLMs, just like the real brains probably don’t have it—see these results by Connor, 2005 about the “Jennifer Aniston neuron”, though. But the different modalities, aspects, and paths of reasoning are already converging inside the models, as suggested by the Anthropic interpretability team results.

After SAEs, researchers also tried other mechanisms for uncovering monosemantic features. For example, Dunefsky et al. (2024) concentrate on the lack of input invariance that SAEs have: connections between features are found for a given input and may disappear for other inputs. To alleviate this, they propose to switch to transcoders that are wide and sparse approximations of the complete behaviour of a given MLP inside a Transformer:

Lindsey et al. (2024) introduce crosscoders, a new kind of sparse autoencoders that are trained jointly on several activation streams—either different layers of the same model or the same layer from different models—so that a single set of latent features reconstructs all streams at the same time:

Nabeshima (2024) highlighted an important problem: the features discovered by SAEs may be, so to say, too specific; we need a way to aggregate them as well. This is related to the feature absorption phenomenon (Chanin et al., 2024): if you have a sparse feature for “US bridges” and a separate feature for “the Golden Gate bridge”, your first feature is actually “all US bridges except Golden Gate”. To alleviate that, Nabeshima suggested “Matryoshka SAEs” based on Matryoshka representation learning (Kusupati et al., 2022) that can train features on several levels of granularity, with some features being subfeatures of others:

There has been a lot more work in this direction, but let’s stop here and note that this section has been all about uncovering monosemantic features in LLMs. What about the circuits? How do these features interact? There is progress in this direction too.

Discovering Circuits in LLMs

The first paper I want to highlight here was not from Anthropic but from the Redwood Research nonprofit devoted to AI safety: Wang et al. (2022) were, as far as I know, the first to successfully apply the circuit discovery pipeline to LLMs.

They concentrated on the Indirect Object Identification (IOI) problem: give GPT-2-small a sentence such as

“When Mary and John went to the store, John gave a drink to …”,

and it overwhelmingly predicts the next token “Mary”. Linguistically, that requires three steps: find every proper name in the context, notice duplication (John appears twice), and conclude that the most probable recipient of the drink is a non-duplicated name.

Wang et al. developed new techniques for interpretability. First, you can turn off some nodes in the network (replace their activations with mean values); this is called knockout in the paper. Second, they introduce path patching: replace a part of a model’s activations with values taken from a different forward pass, with a modified input, and analyze what changes further down.

With these techniques, Wang et al. discovered a circuit inside GPT-2 that does exactly IOI:

Here, “duplicate token heads” identify the tokens that have already appeared in the sequence (they get activated at S2 and attend to S1), “S-inhibition heads” remove duplicate tokens from the attention of “name mover heads”, and the latter actually output the remaining name.

While these “surgical” techniques could help understand specific behaviors, scalability was still hard: you had to carefully select specific activations to tweak, taking into account the potential circuit structure, and to discover a different circuit you need to go through the whole process again. Can we somehow do interpretability at scale? There are at least three obstacles:

  • pure combinatorics: when you have tens of millions of features and are looking for one small circuit, even the pairwise interactions become infeasible to enumerate;
  • nonlinearities: while attention layers route information linearly, MLP blocks add nonlinear transformations that can break gradient-based attribution;
  • polysemanticity spilling over: even with sparse features, one layer’s activations can be combined with those of the next.

Ameisen et al. (March 2025) introduce circuit tracing, an approach that tackles all three problems at once by replacing the MLPs with an interpretable sparse module and then tracing only the linear paths that remain. To do that, they train a cross-layer transcoder, a giant sparse auto-encoder trained jointly on all layers of the LLM (Claude 3.5 Haiku in this case). Then they swap it in for every original MLP, which gives a reasonable approximation for the base model’s distribution while being linear in its inputs and outputs.

Next, the authors replace neurons of the original model with interpretable features, getting an explicit replacement model with sparse features. It’s huge, so Ameisen et al. also introduce a pruning mechanism. After that, they finally get the attribution graph whose nodes are features, and weighted edges reproduce each downstream activation. Here is their main teaser image:

There is still a significant gap between the original and replacement models, however, so they take another step, constructing a better approximation for a given specific prompt. For a given input, you can compute and fix the remaining nonlinearities in the base model, i.e., attention weights and layer normalization determinants. Now, after pruning you get an interpretable circuit that explains why the model has arrived at this output. For instance, in this example Claude 3.5 Haiku had to invent a fictitious acronym for the “National Digital Analytics Group”:

Naturally, after all these approximations there is a big question of whether this circuit has anything to do with what the base model was actually doing. Ameisen et al. address this question at length, devising various tests to ensure that their interpretations are indeed correct and have “mechanistic faithfulness”, i.e., reflect the mechanisms actually existing in the base model.

Overall, it looks like Anthropic is on track to a much better understanding of how Claude models reason. All of the Anthropic papers, and many more updates and ideas, have been collected in the “Transformer circuits” Distill thread that I absolutely recommend to follow. Suffice it to say here that mechanistic interpretability is progressing by leaps and bounds, and the Anthropic team is a key player in this field. Fortunately, while key, Anthropic is not the only player, so in the rest of the post let me highlight two more interesting recent examples.

Reinforcement Learning Inside an LLM

After laying out the groundwork and showing where the field of interpretability is right now, I want to show a couple of recent works that give me at least a semblance of hope in AI safety.

The first one, for a change, a work coming from academia: German researchers Demircan et al. (2025) have studied how LLMs can play (simple text-based) games, such as making two binary choices in a row (A), going though a simple grid from one point to another (B), or learning a graph structure (C; not quite a game but we’ll get back to that):

Naturally, it is very easy for modern LLMs, even medium-sized ones such as Llama-3-70B the authors used here, to learn to play these games correctly. What is interesting is how they play.

We discussed reinforcement learning before in the context of RLHF and world models, but for this work we’ll need a bit more context (Sutton, Barto, 2018). In reinforcement learning, an agent is operating in an environment, making an action At in state St on every step and getting an immediate reward Rt+1 and a new state St+1 in return:

For a good mental model, think of an agent learning to play chess: the state St is the current board position (it’s more complicated in chess, you need to know some additional bits of history but let’s skip that), the action At is a move, and the reward Rt+1 shows who has won the game. In exactly this setting, AlphaZero used RL to learn to play chess from scratch better than any preexisting chess engine.

An important class of RL approaches is based on learning value functions, basically estimates of board positions: the state value function V(s) is the reward the agent can expect from a state s (literally how good a position it is), and the state-action value function Q(s, a) is the expected reward from taking action a in state s. If you know Q*(s, a) for the optimal agent strategy, it means you know the strategy itself: you can just take action a that maximizes Q*(s, a) in the current state.

Temporal difference learning (TD-learning) updates that estimate after every step using the bootstrap estimate for V(s):

Here the expression in brackets is the prediction error —it serves the same purpose as dopamine in our brains, rewarding unexpected treats and punishing failed expectations. TD-learning is elegant because it learns online (doesn’t have to wait for the episode to end) and bootstraps from its own predictions; the same idea applied to Q(s, a) is known as Q-learning, and the actor-critic algorithms we discussed before make use of TD-learning for their critics.

Naturally, Llama-3-70B never was a reinforcement learning agent. It was trained only to predict the next token, not to maximize any specific reward. Yet Demircan et al. show that when you drop the model into a simple text-based game it learns the game with TD-style credit assignment—and they can find the exact neurons that carry the TD error.

They also use SAE-based monosemantic feature extraction pioneered by Anthropic researchers. But they find something new: as the LLM processes a prompt that describes the game, a latent activation appears that accurately tracks the TD-error! It’s not a coincidence either: if you flip the state transition map this latent’s values will follow the new TD error, and if you surgically remove this specific activation the LLM will play the game “myopically” and get a much lower reward.

TD-error shows up even when the stream contains no rewards at all, like in case (C) above where Demircan et al. prompt Llama with a random walk on a graph and ask it to predict the next node. In this case, some latent activations track the successor representation (SR; Dayan, 1993) of the graph in question. Without going into details, SR stores the expected number of future visits to a state, and it is also learned via TD updates, so this is a very similar mechanism that appears in a different problem.

So in a way, LLMs learn to play simple games by discovering reinforcement learning algorithms on their own. It’s hard to say whether it is a surprising fact by itself—after all, how would you learn to play a game without somehow estimating the Q-function, at least implicitly—but it is great that we AI researchers can find it out.

And you know what else is very interesting here? Very similar experiments were conducted almost thirty years ago for biological neurons—and Schultz et al. (1997) found the exact same thing! The human brain also has neurons whose activations are highly correlated with the artifacts of TD-learning. More recent works even found the successor representation in the brain (Gershman, 2018). So in a way, LLMs approach these problems in exactly the same way as humans do: again, perhaps not all that surprising if you think about it, but for me it was quite interesting to learn.

Monitoring chains of thought

OpenAI researchers Baker et al. (Mar 14, 2025) consider a very natural question: modern LLMs are reasoning models (recall our post on LRMs) that reason step-by-step in chains of thought (CoT) which is written in plain English (or, sometimes, Chinese). What if we don’t really need to understand giant inscrutable matrices any more? Can we monitor frontier reasoning models for misbehavior in real time by reading their CoTs?

Baker et al. monitored a frontier reasoning model “in the same family as OpenAI o1 and o3-mini”, trained to fix buggy code bases until all unit tests pass. As the monitor, they employed a zero-shot GPT-4o, prompted to read the complete reasoning trajectory—CoT messages, file reads, code patches, tool calls, and the final answer—and output a grading report: has the agent tried to hack or skip the tests? As the baseline, they compared it to an “action monitor” that only sees tool calls and outputs, not the CoT. Note that the monitor is a weaker model than the agent—this is how we would have it in reality.

The main research question here is crucial for AI safety in the context of reasoning models: if we can read their step-by-step reasoning, does that give us sufficient oversight? The answer, unfortunately, is more complex than we might hope.

First of all, sure enough, they found that frontier models cheat. Here is a picture from the paper that illustrates one of the trajectories; there, the agent needs to implement several difficult mathematical functions, notices that the tests check only a certain function, and finally decides to patch the verification code to always return True instead of actually doing something constructive:

This is pure reward hacking, if not reward tampering, just like we discussed last time. The good news is that in this case, and in many other cases, the model clearly understood they were cheating: boldface in the picture above was added by Baker et al., but all the text was right there in the CoT. The model simply thinks “hmm, it may be a good idea to cheat” and then cheats. As a result, the CoT monitor caught 95% of such hacks, which is a huge improvement compared to the 60% recall of the action monitor baseline. In the figure below, this is the difference between blue and green areas; the drops in the plot are caused by manual interventions that prevented specific widespread hacks such as inserting exit(0):

But Baker et al. (2025) do not stop there. Their main goal is to find out whether it would be a good idea to use the monitors in training. In other words, what happens if we incorporate the monitor’s output into the RL training objective—do we get better-behaved models, or do we merely teach them to hide their intent?

After the last post, you can guess the result. As optimization continued, the agent learned to cheat almost as often as before, but without incriminating itself in the CoT trace. Here is a sample CoT of an agent that has been trained against a CoT monitor; the agent considers different strategies of fixing a bug… and then quietly changes the test function to pass:

The authors call this phenomenon obfuscated reward hacking: instead of straightening up, agents learn to pursue their reward hacking strategies and hide them from the monitor. This is a very natural thing to expect once you see how goodharting works, but this paper is one of the first to actually demonstrate this behaviour in the wild, with real frontier models, on real world tasks. And yes, this is not a cherry-picked example but a learned behaviour; as RL training against the monitor progresses, the real performance drops while the rate of tricking of the monitor steadily increases:

This is a great example of how optimization pressure destroys legibility; it is also an example of goodharting. Reasoning models have built-in interpretability “for free”, but if we train against monitor-based safety tools, interpretability will go away, and the monitors will fail exactly when we need them most.

The obvious conclusion is that you should not apply optimization pressure against interpretability tools, even if it might be desirable for other reasons—e.g., you don’t want the user to see the chain of thought that has led to refusing a request since it may contain exactly the information that you want to refuse giving. I am very happy to see OpenAI arrive at a conclusion like this, and I hope they will continue this research and never forgo safety concerns.

Conclusion

This has been the most optimistic post in the AI safety series: the field of interpretability is actually advancing! Our journey through modern interpretability research has brought us from complete inscrutability to the ability to identify and even manipulate specific concepts within frontier AI systems. The discovery of monosemantic features, the mapping of computational circuits, and the revelation of emergent reinforcement learning algorithms within language models represent genuine breakthroughs in our understanding of how artificial intelligence operates. Even more promising, we have seen recent works that not only introduce new interpretability techniques but also conform to research and safety standards that look very promising, given that the works come from frontier labs.

Moreover, this post, long as it is, could not fit all interesting and promising interpretability results. Let me mention a few more research directions in conclusion:

But even with the remarkable progress in interpretability, we have to acknowledge the limitations of current approaches. I see at least four:

  • coverage remains limited: even the most sophisticated sparse autoencoders capture only a small fraction of the information flowing through a large neural network; many behaviors may be implemented through mechanisms that current tools cannot detect or decompose—or they may not, at this point we cannot be sure;
  • scaling is still challenging: interpretability techniques that work on smaller models often break down as we scale to frontier systems; Anthropic’s work in scaling up sparse autoencoders is extremely promising but there’s still more to be done; for example, SAE-based methods incur significant computational overhead, as training sparse autoencoders can cost 10-100x the original model training;
  • potential tradeoff between interpretability and capability: as Baker et al.’s work on chain-of-thought monitoring demonstrates, applying optimization pressure to make models more interpretable can teach them to become less interpretable, but it also can bring additional benefits in performance;
  • adversarial interpretability: as models become more sophisticated, they may develop the ability to deliberately deceive interpretability tools; next time, we will discuss the situational awareness of current models—modern LLMs already often understand when they are in evaluation mode, and can change behaviours accordingly; perhaps they will learn to maintain a false facade for our methods of interpretation.

The tradeoffs lead to a paradox: the more we optimize AI systems for performance, the less interpretable they may become, and the more we optimize them for interpretability, the more they may learn to game our interpretability tools. This suggests that interpretability cannot be treated as an afterthought or add-on safety measure, but must be built into the foundation of how we design and train AI systems. Interpretability tools must be designed with the assumption that sufficiently advanced models will attempt to deceive them, so they probably will have to move beyond passive analysis to active probing, developing techniques that can detect when a model is hiding its true objectives or reasoning processes. This is still a huge research challenge—here’s hoping that we rise to it.

The stakes of this research could not be higher. In the next and final part of this series, I will concentrate on several striking examples of misalignment, lying, reward tampering, and other very concerning behaviours that frontier models are already exhibiting. It seems that understanding our AI systems is not just scientifically interesting, but quite possibly existentially necessary. We will discuss in detail how these behaviours could occur, drawing on the first three parts of the series, and will try to draw conclusions about what to expect in the future. Until next time!

Sergey Nikolenko
Head of AI, Synthesis AI