Synthesis Blog
Back
AI Safety I: Concepts and Definitions

In October 2023, I wrote a long post on the dangers of AGI and why we as humanity might not be ready for the upcoming AGI revolution. A year and a half is an eternity in current AI timelines—so what is the current state of the field? Are we still worried about AGI? Instead of talking about how perception of the risks has shifted over the last year (it has not, not that much, and most recent scenarios such as AI 2027 still warn about loss of control and existential risks), today we begin to review the positive side of this question: the emerging research fields of AI safety and AI alignment. This is still a very young field, and a field much smaller than it should be. Most research questions are wide open or not even well-defined yet, so if you are an AI researcher, please take this series as an invitation to dive in!

Introduction: Any Progress Over 10 Years?

When I give talks about AI dangers—how AI capabilities may bring ruin in the near future—I often discuss the three levels of risk (mundane, economic, and existential) and concentrate on the existential risk of AGI. I start with paperclip optimization and explain how this admittedly (and, in my opinion, intentionally) silly example follows from real effects such as instrumental convergence that make it exceptionally hard to align AGI.

This blog is not an exception: in October 2023, I wrote a post that followed exactly this structure, and I would consider it the “AI Safety 0” installment of this series. In 2025, we have many new sources that classify and explain all the risks of both “merely transformative” AI and the possibly upcoming superintelligent AGI. Among recent sources, I want to highlight The Compendium (Leahy et al., 2024), a comprehensive treatise of the main AI risks, and the International AI Safety Report 2025 (Bengio et al., 2025), a large document written by a hundred AI experts under the leadership and general editing of none other than Yoshua Bengio.

When I was already halfway through this post, Daniel Kokotajlo, Scott Alexander, Thomas Larsen, Eli Lifland, and Romeo Dean released their “AI 2027” scenario (see also the pdf version and a Dwarkesh podcast about it). Daniel Kokotajlo is the guy who predicted how LLMs would go, and got it mostly right, back in 2021 (and in 2024, he was the center of the OpenAI non-disparagement clause scandal, willing to forgo millions to warn the public about AGI risks); Scott Alexander is the famous author of SlateStarCodex and AstralCodexTen blogs; Eli Lifland is a co-lead of the Samotsvety forecasting team that specialize on predicting the future. Right now, these guys believe that it is plausible to expect AGI by 2027–2028, and if the race dynamics in AGI development override safety concerns (do you find it plausible? I do!), this will lead to an AI takeover in 2030-2035. This is their “bad ending”, although they do have a “good ending” with humans staying in charge (images from AI 2027):

Note how the “endings” have a lot in common: we get a transformed world anyway, the difference is in who rules it. But how do we reach a good ending? Can we alleviate safety concerns even if politics and profit maximization allow humanity some time for it? What about new research in AI safety and AI alignment?

AI safety started to appear in wider academic circles about ten years ago, when it attracted some big names who became worried about superintelligent AI. One of the first academic papers about AI safety was “Research Priorities for Robust and Beneficial Artificial Intelligence”, published in 2015 by Stuart Russell, author of one of the most famous textbooks on AI, AI safety researcher Daniel Dewey, and Max Tegmark, a famous physicist and AI researcher well known in popular media. Their goal was to raise awareness for AI risks, so they presented the arguments for it and outlined research directions but did not give any hints at possible solutions.

Another early work, “Concrete Problems in AI Safety”, was published in 2016, also by a few guys that you may have heard about: 

There’s a lot of AI alpha in this picture: Dario Amodei and Chris Olah later founded Anthropic, Paul Christiano is one of the major figures in AI alignment, while John Schulman was an OpenAI co-founder who left to join Anthropic over AI safety concerns (and left Anthropic recently as well).

Amodei et al. (2016) highlighted several key areas for AI safety research:

  • avoiding negative side effects: when you optimize for one objective you might harm other things you care about—a cleaning robot might knock down a vase while moving across the room faster to optimize cleaning;
  • avoiding reward hacking: when a system automatically optimizes a given objective, it might find unintended ways to maximize their reward functions, e.g., an agent learning to play a computer game might find and exploit bugs in the game because that’s more profitable for the end result;
  • scalable oversight: actual objectives are often and messy, and it’s unclear how we can ensure that AI systems respect them; we can check the results if we spend minutes or hours of human time every time, but this kind of oversight does not scale;
  • safe exploration: during learning, AI models need to try new things to discover good strategies, but some exploratory actions could be harmful, so we need to either move things to simulated environment, which may be great but which incurs the next problem of synthetic-to-real domain transfer (I wrote a whole book on synthetic data at some point), or somehow bound the risks if we allow the model to fail in real tasks;
  • robustness to distributional shift: speaking of domain transfer, AI systems often fail when deployed in situations different from their training environment; this may be a shift from synthetic simulations to the real world, but also may be a less drastic change of environment; models and training procedures should be robust enough to transfer to environments unseen during training, at least robust enough not to incur serious damage when they fail.

Note that while this research program was formulated in a more academic and less alarming language, it does include all of the existential risk concerns that we discussed last time: if a “rogue AI” destroys humanity, it will most probably happen through reward hacking, errors in domain transfer, and unintended side effects that the system was not sufficiently protected against.

Nearly ten years have passed. By now, Dario Amodei and John Schulman have long been among the most influential people in the world of AI, while Paul Christiano and Chris Olah still lead major research efforts on AI safety. So how is this research program going? How have the main directions in AI safety research shifted since Amodei et al. (2016), and how much progress have we made?

What prompted this series for me was the release of the DeepMind short course on AGI safety; this is a great overview of the main directions of AI safety research but it is very brief, and mostly does not go much further than the definitions. In this series of posts, I will try to present a more comprehensive picture with more references and more detail. Let’s dive in!

Main Concepts of AI Safety

AI safety is a problematic field when it comes to definitions. While most of AI research is mathematical in nature and thus easily formulated in a formal language, many important questions in AI safety are still not quite at the level of mathematical rigor that would allow us to just plug in a definition and start proving theorems or developing algorithms. Mapping intuitive concepts such as “alignment” or “robustness” to specific mathematical concepts is hard, and it is a big part of the job. So let me begin this post by discussing a few of the main concepts in AI safety and how they can be understood; for a general definitional introduction see also, e.g., the CSET brief by Rudner and Toner (2021). In this section, I will begin with some specific individual concepts, and then in the next section we will get to AI alignment and what it might mean.

AI Safety. The term “AI safety” itself is the broadest of all; it includes efforts to design and deploy AI systems that operate safely and reliably under all conditions, including novel or adversarial situations. It also includes identifying causes of unintended AI behavior and developing techniques to prevent misbehavior. Practically speaking, an AI system is safe if it robustly achieves its intended goals without harmful side-effects or failures, even when faced with edge cases. But this general notion is not actionable, of course, we need to break it down into individual risks and defenses.

Emergent behaviour. Emergence in general (as defined, e.g., on Wikipedia) is when “a complex entity has properties or behaviors that its parts do not have on their own, and emerge only when they interact in a wider whole”. Emergence is when the whole is more than the sum of its parts, and neural networks, both natural and artificial, are a prime example of emergence: they are large compositions of very simple neurons, and they exhibit behaviours that individual neurons could never hope for.

LLMs exhibit a lot of emergent behaviours even beyond basic connectionism (Woodside, 2024). Google researchers Wei et al. (2022) collected many examples where an ability such as doing modular arithmetic or zero-shot chain-of-thought reasoning was not present in smaller models but appeared in larger models, without specialized training. For AI model scaling, an emergent ability is one that could not be predicted by a scaling law: e.g., token prediction measured by perplexity improves with model size, but improves gradually and predictably, while emergent abilities stay around zero performance until a phase transition occurs, and suddenly the model just gets it.

This also happens with behaviours that are not about solving new tasks: e.g., DeepSeek-R1 (recall our previous post) kept switching between languages (English and Chinese mostly) in its thought process, which was unexpected and untrained for but, I assume, somehow helped the model reason better.

Power et al. (2022) found one of the most striking examples of emergence in large machine learning models: grokking. When you train on an algorithmic task such as modular division (i.e., you give multiplication results for some of the group elements as a training set and ask to predict the rest, as shown in (c) in the figure below), the model very quickly reaches saturation in terms of the training set loss while still not really “understanding” what the problem is about. In (a) in the figure below, you can see a classical picture of overfitting after 1K-10K optimization steps, with almost perfect training set answers and very low performance on the validation set:

Grokking occurs much later; in this example, after about a million optimization steps (2-3 orders of magnitude more). But when it happens, the model just “gets” the problem, learns to implement the actual modular division algorithm, and reaches virtually perfect performance. Part (b) of the figure shows that it indeed happens even for relatively small fractions of the multiplication table given to the model, albeit slower. Grokking is a great example of emergence in large ML models, and it is a mysterious phenomenon that still awaits a full explanation, although progress has been promising (Kumar et al., 2023, Lyu et al., 2023).

In the context of AI safety, emergence is especially important: if AI systems can develop unanticipated behaviors as they scale in complexity, these behaviours may not all be desirable. An AI model might acquire new capabilities or goals that were not explicitly programmed, leading to behaviors that could be misaligned (see below) with human intentions. This is especially worrying in the context of AI agents since communication between AI models either sharing a common goal or competing with each other is well known to lead to emergent abilities (Altmann et al., 2024).

Robustness. In artificial intelligence, robustness is “the ability of a system, model, or entity to maintain stable and reliable performance across a broad spectrum of conditions, variations, or challenges” (Braiek, Khomh, 2024), including adversarial inputs and distributional shifts in the test environment. While you cannot expect an AI system to be perfect when facing a previously unseen type of problems, a robust model will not break or behave erratically when faced with surprises. 

For example, if you train a robot to throw a basketball in a simulated physics environment, it is natural that it will miss more often in the messy real world; but it shouldn’t, say, turn around and start throwing the ball at the referee when conditions change.

Robustness research gained attention when adversarial attacks came to light: it turned out that one can make a slightly perturbed input that would be completely natural-looking for a human but could fool neural networks into gross and unexpected errors (Szegedy et al., 2013, Goodfellow et al., 2014). Since the original works, adversarial examples have grown into a large field with thousands of papers that I will not review here (Yuan et al., 2017; Biggio, Roli, 2018; Costa et al., 2023; Macas et al., 2024); but robustness for a model approaching AGI becomes even more important.

Interpretability. This subfield has always been a plausible first step towards AI safety: how can we hope to make sure that a system behaves as desired if we don’t understand how its behaviour is produced? And, on the other hand, if we do understand how the system works, ways to change its behaviour may suggest themselves.

All definitions of interpretable AI—or explainable AI, as the field has been known since at least the 1970s (Moore, Swartout, 1988)—are quite similar; e.g., in an often-cited survey of the field Biran and Cotton (2017) say that “systems are interpretable if their operations can be understood by a human, either through introspection or through a produced explanation”, and other sources mostly agree (Burkart, Huber, 2021; Miller, 2019; IBM’s definition etc.).

But while, unlike alignment, there is no disagreement about what interpretability means, it is still a very murky definition, mostly because we really can’t define what understanding means. Clearly, an interpretable system should provide something more than just the flow of billions of numbers through the layers of a neural network (people in the field often call it “giant inscrutable matrices”)—but how much more? What constitutes “sufficient explanation” for a human being? 

This is obviously a question without a single answer: at the very least, different humans will need different kinds of explanations. But the question of defining interpretability receives an additional layer of complexity when we move into the era of systems that can talk with humans in their own natural language. Does a verbal explanation provided by an LLM count as an “explanation” at all, or is it also part of the behaviour that needs to be explained? What if an LLM is wrong in its explanations? What if an LLM purposefully subverts interpretability, providing plausible but wrong explanations—a behaviour very much expected by default due to reward hacking?.. 

That is why a lot of work has concentrated on mechanistic interpretability (Rai et al., 2024), a subfield that aims to understand the internal workings of neural networks specifically by reverse-engineering their computations. In mechanistic interpretability, one has to identify and comprehend the internal components—neurons, features, or circuits—that contribute to a model’s behavior; we will discuss it in detail in a future post.

Corrigibility. In AI safety research, corrigibility refers to designing artificial intelligence systems that will willingly accept corrections and remain open to human oversight, even as they become smarter and more autonomous. Specifically, it is almost inevitable due to instrumental convergence (recall our previous post) that an advanced AI agent may wish to preserve its own existence and, even more importantly, its own objective function. A corrigible AI system may still learn rapidly and make independent decisions—but it should consistently align itself with human intentions, gracefully accept updates, shutdown commands, or changes in objectives without resistance. Essentially, it’s about creating AI systems that not only follow initial instructions but also continuously remain responsive to human guidance and control: if a paperclip factory starts turning into a paperclip maximizing AGI, we should be able to shut it down.

Ultimately, corrigibility is critical for developing beneficial artificial general intelligence: highly capable AI systems should remain loyal allies rather than uncontrollable agents. Back in 2015, Soares et al. showed that even in a very simple model with a single intervention that should be allowed by an AI agent, it is very hard to ensure corrigibility by designing the agent’s utility function. Unfortunately, there has been little progress in these game theoretic terms, so while there is recent research (see, e.g., Firt, 2024), corrigibility remains a concept that is hard to define well. At this point, it looks like corrigibility will be very hard to ensure separately, we will most probably settle on it following from general AI alignment—whatever that means.

Now that we have discussed several pieces of the AI safety puzzle, the time has come to handle the main concept. What does it mean to align an artificial intelligence, and why is it so difficult to do?

Alignment and Mesa-Optimization

It is tempting to take one of two “positive” positions: either that AI systems do not and cannot have goals of their own that would be in conflict with their intended purpose, or that we can always simply tell an AI system or agent to obey and it will, because it is so programmed. Unfortunately, both of these positions prove to be difficult to defend; let us consider why the notion of AI alignment is so hard.

Alignment. Alignment means ensuring that an AI system’s goals and behaviors are aligned with human values or intentions. An aligned AI reliably does what its designers or users intend it to do.

Jones (2024) compiled a lot of definition attempts from prominent researchers, and it is clear that while they all share a common core of the concept, there would be significant differences if one was to actually do research based on these concepts. Here are a few samples:

  • Jan Leike: “the AI system does the intended task as well as it could”;
  • Nate Soares at MIRI: “the study of how in principle to direct a powerful AI system towards a specific goal”;
  • OpenAI’s official position: “make artificial general intelligence (AGI) aligned with human values and follow human intent”;
  • Anthropic’s position: “build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers”;
  • IBM’s definition: “the process of encoding human values and goals into large language models to make them as helpful, safe, and reliable as possible”.

So while alignment is definitely about making AGI follow human values and intents, it is obvious that no definition is “research-ready”, mostly because we ourselves cannot formulate what our values and intents are in full detail. Asimov’s laws of robotics are an old but good example: they were formulated as clearly as Asimov could, and his universe assumed that robots did understand the laws and tried to obey them to the letter. But most stories still centered on misinterpretations, both wilful and accidental, special cases, and not-that-convoluted schemes that circumvented the laws to bring harm. 

There are two common subdivisions of the concept. First (Herd, 2024), you can break it into

  • value alignment, making the AI system’s values or reward criteria match what humans actually want, and
  • intent alignment, making sure the AI is trying to do what the human expects it to do; note that there is a difference between aligning with the intents of a given user and the intents of humanity as a whole.

Second (Ngo, 2022), you can distinguish between

  • outer alignment, making sure that the objective specified for an AI system truly reflects human values, and
  • inner alignment, ensuring that the emergent objectives that appear during the goal-achieving optimization process remain aligned with human values and original intentions.

Below we will arrive at these concepts through the notion of mesa-optimization and discuss some implications. But generally, alignment is simply one of these murky concepts that are hard to map onto mathematics. This is one of the hard problems of AI safety: there is no mathematical definition that corresponds to “not doing harm to a human”, but we still have to build aligned AI systems.

Mesa-optimization. To understand why alignment is so difficult in practice, we need to introduce a key concept in AI safety: mesa-optimization. The prefix “mesa” comes from Greek, meaning “within” or “inside”, and it is a dual prefix to “meta”, meaning “beyond” or “above”. Where meta-optimization refers to optimizing the parameters of a base optimization process (like hyperparameter tuning), mesa-optimization occurs when the result of an optimization process is itself a new optimizer.

This concept was formalized in the influential paper titled Risks from Learned Optimization (Hubinger et al., 2019), which distinguishes between the base optimizer (the learning algorithm we deploy, such as gradient descent) and a potential mesa-optimizer (an optimizer that emerges within the learned model). For a concrete example, consider AlphaZero, DeepMind’s famous chess and Go system (Silver, 2017). The base optimizer (gradient descent) produces a neural network with parameters that encode both a policy and a value function. But during gameplay, AlphaZero doesn’t just use this neural network directly—it performs Monte Carlo Tree Search, systematically exploring possible moves to find good outcomes. The MCTS component is effectively a mesa-optimizer, searching through a space of possibilities according to criteria specified by the trained neural network.

The illustration above shows some examples of meta- and mesa-optimization from both AI systems and real life:

  • meta-optimization in the AI world would include:
    • hyperparameter tuning, e.g., Bayesian optimization methods that automatically tune learning rates and regularization parameters;
    • “learning to learn” approaches like model-agnostic meta-learning (MAML; Finn et al., 2017);
    • systems like neural architecture search (NAS) or Google’s AutoML that search for optimal neural network architectures, essentially optimizing the structure of another optimizer (the neural network);
    • I can even list an example of what I might call meta-meta-optimization: the Lion (evoLved sign momentum) optimization algorithm, a new variation of adaptive gradient descent, was found by Google Brain researchers Chen et al. (2023) as a result of a NAS-like symbolic discovery process;
  • meta-optimization in the real life includes plenty of situations where we learn better processes:
    • researchers constantly improve their experimental methods and statistical tools to make research more effective, meta-optimizing knowledge discovery;
    • a fitness coach teaches you how to improve your training, i.e., make the optimization process of training your muscles more efficient;
    • a management consultancy firm like the Big Four often do not solve specific problems for companies but redesign how companies solve problems themselves—optimizing the organization’s optimization processes;
    • finally, when you study in a university you are also (hopefully!) learning how to learn and think critically, optimizing your own learning process;
  • mesa-optimization in AI systems includes:
    • tree search in learned game-playing systems, e.g., AlphaZero learning an evaluation function that then guides tree search algorithms;
    • model-based reinforcement learning, where an RL agent learns an internal model of the environment to simulate and evaluate potential actions before taking them; the learned model becomes a mesa-optimizer; recall our earlier post on world models in RL;
    • moreover, there are examples where neural networks explicitly learn planning algorithms; I will not go into that in detail and will just mention value iteration networks that have an embedded planning module (Tamar et al., 2017), the Predictron (Silver et al., 2016) that mimics the rollout process of MCTS, and MCTSnets (Guez et al., 2018) that simulate the full MCTS process;
    • reasoning LLMs (recall my earlier post): when o1 or Claude 3.7 perform step-by-step reasoning to solve a math problem, they are implementing a search process over possible solution paths that wasn’t explicitly programmed;
  • finally, here are some real life examples of mesa-optimization:
    • evolution producing humans (we will get back to that one);
    • organizations developing their own objectives, where a corporation originally established to maximize shareholder value develops departments with their own objectives that may not perfectly align with the original goal; for a great explanation of mesa-optimization in the corporate world see the book Moral Mazes (Jackall, 1988) and the much more recent sequence of posts on moral mazes by Zvi Mowschowitz (2020);
    • markets were conceived to make the exchange of goods and services smoother, thus optimizing resource allocation, but market agents (corporations and traders) optimize for profit in ways that sometimes undermine the system’s original purpose.

Note how all examples in the last category are systems explicitly designed with good intentions but later developing their own objectives that diverge from what was intended—sometimes subtly, sometimes dramatically. This is exactly why mesa-optimization is such a profound challenge for AI alignment.

The distinction between base and mesa optimization shows that in a sufficiently complicated AI system, there are multiple objective functions in play:

  • base objective: what the original learning algorithm optimizes for (e.g., reward in reinforcement learning or win rate in AlphaZero);
  • mesa-objective: the objective implicitly or explicitly used by the mesa-optimizer within the trained model (e.g., AlphaZero’s evaluation of board positions during MCTS);
  • behavioral objective: what the system’s actual behavior appears to be optimizing for; this is hard to formalize, but let’s say that this is the objective function that would be recovered through perfect inverse reinforcement learning (IRL; for more information about IRL see, e.g., Swamy et al., 2023).

A great example of this distinction comes from evolutionary biology. Evolution, as a base optimizer, selects organisms based on their ability to pass on genes to the next generation. But during this optimization, evolution came up with mesa-optimizers with their own minds. Humans, as products of evolution, are powerful mesa-optimizers with our own goals and values—which often have little to do with maximizing genetic fitness. As a result, we routinely use contraception, adopt children, or choose to remain childless even if we were perfectly able to support children; these behaviours reduce genetic fitness while advancing our own mesa-objectives such as happiness:

Alignment through the mesa-optimization lens. These distinctions between different types of objectives provide a framework for understanding alignment problems. The question becomes: how do we ensure that these different objectives align with each other and, ultimately, with human values?

Hubinger (2020) breaks alignment into several important subconcepts; I introduced them above in general terms, and now we can define them slightly more formally:

  • impact alignment: an agent is impact aligned if it doesn’t take actions that humans would judge to be harmful or dangerous;
  • intent alignment: an agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.
  • outer alignment: an objective function is outer aligned if all models that optimize it perfectly (with infinite data and perfect training) would be intent aligned;
  • inner alignment: a mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.

The critical insight here is that even if we solve outer alignment (design a base objective that aligns with human values), we could still face inner alignment failures if the mesa-optimizer’s objective diverges from the base objective. This is the base-mesa objective gap, and it represents a fundamental challenge in AI safety.

Consider a reinforcement learning agent trained to maximize reward for solving puzzles. The base objective is the reward function specifying points for each puzzle solved. But if the learned model develops a mesa-optimizer, its mesa-objective might be something quite different—perhaps a heuristic like “maximize the number of puzzle pieces moved” or “finish puzzles as quickly as possible”. On the training distribution, these mesa-objectives might correlate well with the base objective, but they could diverge dramatically in new situations, leading to unintended behaviors. In practice, this leads to specification gaming and reward hacking behaviours that we will discuss in a later post.

This framework explains why alignment is so challenging: we need to ensure not just that our specified objectives align with our true values (outer alignment), but also that the objectives that emerge within learned systems align with our specified objectives (inner alignment). It’s not enough for the programmer to specify the right reward function, although that’s an unsolved and quite formidable problem by itself; we need to ensure that what the model actually optimizes for corresponds to what we intended.

Pseudo-alignment and robustness. A particularly concerning scenario is what Hubinger et al. call pseudo-alignment—when a mesa-optimizer appears aligned with the base objective on the training distribution, but diverges elsewhere. This could come in two forms:

  • proxy alignment, when the mesa-objective is a proxy for the base objective that works well in training but fails to generalize;
  • deceptive alignment, when the mesa-optimizer understands the base objective but intentionally pursues a different mesa-objective while appearing to be aligned to avoid modification.

For a far-fetched example, imagine training an AI to maximize human happiness. On the training distribution, the AI might learn that humans smiling correlates with happiness, and develop a mesa-objective of “maximizing smiles”. This would be proxy alignment—and it might work fine until deployed in a setting where the AI could force people to smile; think of the disturbing “smile masks” from the movie Truth or Dare. The model would be creating the appearance of success (proxy metric) while failing to achieve the true base objective of human happiness (intended metric). Let us not dwell too long on what would happen if instead of smiling this AI discovered dopamine…

This kind of disconnect between the base and mesa objectives also relates to the robustness concepts we discussed earlier. A system can be objective-robust if the optimal policy for its behavioral objective aligns with its base objective even under distributional shift. But pseudo-aligned mesa-optimizers by definition lack this robustness, creating risks when deployed in new environments.

Goodharting. Goodhart’s law is usually formulated as follows: “when a measure becomes a target, it ceases to be a good measure.” It was originally formulated by a British economist Charles Goodhart in 1975, in the context of his work at the Bank of England (Goodhart, 1975), who criticized the government of Margaret Thatcher for their monetary policy centered around a few measurable KPIs. Interestingly, Goodhart himself later admitted that his 1975 line, “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes,” more of a joke than a real observation; but the joke caught on and was found to be much more than merely a joke.

The idea is that a metric/KPI may be a representative parameter when undisturbed, but if you apply control pressure to optimize the metric, instead of the underlying goal that you actually want to control, the optimizers, be they people or algorithms, will often find a way to decouple the metric from the actual goal it had been designed for, leading to a perverse incentive.

In AI safety, goodharting occurs when an AI system’s reward function or evaluation metric—which is only a proxy for the true goal—becomes the target of optimization. For example, if you reward a chess playing program for taking the opponent’s pieces rather than only for winning the game, it will fall prey to sacrifice-based combinations. For chess, we have known better for a long time, but in general, this kind of behaviour can occur very often: AI systems exploit loopholes in the measure rather than truly achieving the intended outcome. We will discuss goodharting in much more detail in a future post.

Conclusion. As you can see, definitions around AI safety are still quite murky even in 2025. Even specific meanings of individual terms may shift as researchers understand which avenues are more promising. This is an indicator of how hard it is to say anything definitive. Naturally, we would like AI systems, including an AGI—if and when it arrives—to provide benefits to humanity while avoiding existential or “merely” massive ruin to humanity as a whole or individual groups of human beings. But how does this translate to specific properties of AI systems that we can reliably test, let alone ensure?

The mesa-optimization framework helps us understand why alignment is so difficult, even for well-intentioned AI developers. It’s not merely about defining the right objectives for our AI systems; it’s about ensuring that the objectives that emerge through learning and optimization processes are the ones we intended. An AI system might develop internal optimization processes and objectives that are opaque to its creators, and these emergent objectives could steer the system in directions quite different from what was intended.

This risk becomes especially concerning as AI systems grow more capable and complex. More powerful systems might be more likely to develop mesa-optimization processes (to better achieve their training objectives), and the gap between base and mesa objectives could grow more consequential. As we will see in later posts, this framework suggests several research directions for AI safety, including interpretability techniques to understand mesa-objectives and training methods that might prevent harmful forms of mesa-optimization from emerging.

These are difficult questions, and AI safety is still a very young field that is, alas, progressing much slower than the ever-improving capabilities of modern AI models. But there is progress! In the rest of the series, we will review the main results and ponder future directions. In the rest of this post, I want to begin with the history and a couple of relatively “easy” examples of research directions in AI safety.

AI Safety before ChatGPT: A History

Early days: last millennium. In the previous post on AI dangers, I already covered the basic reasoning of AGI risks, in particular existential risks, starting from the famous paperclip maximizer example. So today, instead of retreading the same ground, I want to begin with the history of how these notions came into being and were originally introduced into academic research. You know how I like timelines, so I made another one for you:

It is often interesting to find out that whole fields of AI predate the “official” kick-off meeting of the science of AI, the Dartmouth workshop of 1956. For example, surprisingly modern models of neural networks were formulated in 1943 by McCullogh and Pitts, and Alan Turing suggested deep neural networks in his 1948 paper “Intelligent Machinery”. Similarly, we all know that the dangers of AGI and some actual problems of AI alignment had been formulated well before the 1950s. Asimov’s Robot Series started with Robbie in 1940, but actually the play R.U.R. by Karel Čapek (1920) that introduced the word “robot” already featured a robot rebellion that led to the extinction of the human race, and the notion of artificial sentient creations rising against their makers is at least as old as Frankenstein (1818).

In the academic community, Alan Turing was again among the first to realize the dangers of AI. In his famous essay “Computing machinery and intelligence” (1950), Turing lists several counterarguments to his main thesis—that machines can think. One of the objections goes like this:

(2) The “Heads in the Sand” Objection

“The consequences of machines thinking would be too dreadful. Let us hope and believe that they cannot do so.”

This argument is seldom expressed quite so openly as in the form above. But it affects most of us who think about it at all. We like to believe that Man is in some subtle way superior to the rest of creation. It is best if he can be shown to be necessarily superior, for then there is no danger of him losing his commanding position. The popularity of the theological argument is clearly connected with this feeling. It is likely to be quite strong in intellectual people, since they value the power of thinking more highly than others, and are more inclined to base their belief in the superiority of Man on this power.

I do not think that this argument is sufficiently substantial to require refutation. Consolation would be more appropriate: perhaps this should be sought in the transmigration of souls.

To be honest, these words still ring true to me: many modern “objections” to AI risks, especially existential risks of AGI, basically boil down to “it won’t happen because machines can’t think, and machines can’t think because the consequences would be too dreadful”. But let us continue with the history.

In 1960, Norbert Wiener, the founder of cybernetics (Wiener, 1948), published a paper titled Some Moral and Technical Consequences of Automation, where he already explained the dangers of optimization using the fables of a sorcerer’s apprentice, a genie granting wishes, and The Monkey’s Paw by W. W. Jacobs. Wiener introduced the notion of a paperclip maximizer in basically the exact same terms, albeit in a less extreme version:

Similarly, if a bottle factory is programmed on the basis of maximum productivity, the owner may be made bankrupt by the enormous inventory of unsalable bottles manufactured before he learns he should have stopped production six months earlier.

Wiener’s conclusion was essentially a formulation of the AI value alignment problem:

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere once we have started it, because the action is so fast and irrevocable that we have not the data to intervene before the action is complete, then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it.

British mathematician Irving J. Good, who worked with Alan Turing at Bletchley Park, introduced in his paper Speculations Concerning the First Ultraintelligent Machine (1965) the scenario of an “intelligence explosion”—an AI becoming smart enough to rapidly improve itself. By now, this scenario is usually called the technological singularity, a term first used by Norbert Wiener and later popularized by Vernor Vinge (1993) and later Ray Kurzweil (2005). Good wrote that “the first ultraintelligent machine is the last invention man need ever make, provided the machine is docile enough to tell us how to keep it under control”. Note how I. J. Good already gave up hope that we could keep the “ultraintelligent machine” under control ourselves—the only way would be if the machine itself taught us how to control it.

Through the 1970s and the 1980s, AI research was relatively narrow, first focusing on problem solving, logic, and expert systems, and later emphasizing machine learning but also limited to specific tasks. Thus, existential AI risks stayed largely hypothetical, and early ideas of Wiener and Good did not surface much. 

The 1990s gave us some very positive takes on superintelligence and research progress in general. As an example, let me highlight Ray Kurzweil’s famous book The Age of Spiritual Machines: When Computers Exceed Human Intelligence (Kurzweil, 1999). But together with the immense possible upside, risks were also starting to be understood. A science fiction writer Vernor Vinge is often cited as one of the fathers of the notion of a technological singularity (Vinge, 1993); let me cite the abstract of this paper:

Within thirty years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended.

Is such progress avoidable? If not to be avoided, can events be guided so that we may survive? These questions are investigated. Some possible answers (and some further dangers) are presented.

Despite the concept of the dangers of superintelligence being present throughout AI history, and despite Vinge’s promise of “some possible answers”, to find a more comprehensive treatment of AI safety, we need to jump forward to the current millennium.

Conceptualizing AI Safety: the 2000s. The first serious discussion of AI risk not just as a hypothetical construct of the future but as a field of study we’d better work on right now began in the early 2000s in the works of three pioneers of the field: Eliezer Yudkowsky, Nick Bostrom, and Stephen Omohundro.

You may know Nick Bostrom in the context of AI safety for his book Superintelligence: Paths, Dangers, Strategies (Bostrom, 2014), which is indeed a great introduction into AGI risk. But this book came out in 2014, while Bostrom had been advocating the existential dangers of AGI since the last century. In How Long Before Superintelligence? (Bostrom, 1998), he predicted, based purely on the computational power of the human brain and the pace of improvement in computer hardware, that human-level intelligence​ would arrive between 2004 and 2024, and very probably by 2030; he called superintelligence “the most important invention ever made” and warned of existential risk. Bostrom was generally interested in existential risks, with a line of work on the Bayesian Doomsday argument (Bostrom, 1999; 1998; 2001) and general overviews of existential risks that might make it come true (Bostrom, 2002), but his main influence was that he put existential risks from future technologies like AI on the intellectual map (Bostrom, 2003). Before Bostrom, people had acknowledged the risks of a nuclear war and were coming to terms with climate change, but AGI-related risks had been the stuff of science fiction.

Eliezer Yudkowsky has basically devoted his whole life to AGI risk, and he can be rightly called the father of this field. He began when he was 17 years old, with the essay titled “Staring into the Singularity” (Yudkowsky, 1996); there, he argued dramatically that vastly smarter minds would be “utterly beyond” human understanding and beyond human reference frames. He next turned to formulating a positive program for how to create a safe AGI (Yudkowsky, 1998; 1999) but soon realized it was much harder than he had thought.

In 2001, Yudkowsky basically created the field of AI alignment with his work “Creating Friendly AI 1.0: The Analysis and Design of Benevolent Goal Architectures” (Yudkowsky, 2001). It was a book-length report that contained the first rigorous attempt to define the engineering and ethical problems of controlling a superintelligent AI​. His vision at the time was to create an AI with a stable, altruistic goal system designed to benefit sentient beings. Yudkowsky introduced the notion of “coherent volition” (now known as coherent extrapolated volition, CEV; see Yudkowsky, 2004), suggesting that the AI’s ultimate goals should be derived not from what current humans would actually wish for but from what a better humanity would wish for in the limit of perfect reflection: “in poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together”.

Yudkowsky became the most ardent popularizer of AI risks and an organizer of AI safety work; I want to highlight especially his “Sequences”, also known as “Rationality: From AI to Zombies”, a collection of essays on rationality, Bayesian reasoning, AI, and other topics. Note that although Yudkowsky is usually thought of as an “AI doomer”, he has always acknowledged the potentially huge upside of safe AGI; see, e.g., “Artificial Intelligence as a Positive and Negative Factor in Global Risk” (Yudkowsky, 2008).

However, it was easy not to take Yudkowsky seriously in the 2000s: he was young, he did not have academic credentials, and he was prone to extreme statements, albeit often quite well substantiated. So although Yudkowsky established the Singularity Institute for Artificial Intelligence in 2000 and transformed it into the Machine Intelligence Research Institute (MIRI) focused exclusively on AI risks in 2005, it took almost another decade for AI safety to seep into the academic community and become a well-established research field. Still, despite overall skepticism, over the last two decades MIRI researchers have done a lot of important work on AI safety and alignment that we will discuss later in the series.

I want to highlight one more researcher who was instrumental in this transition: Steve Omohundro. Being an established professor with an excellent track record in mathematics, physics, and computer science, when Omohundro took the ideas of self-improving AI to heart he was well positioned to study them from an academic perspective.

His first serious work on the topic, The Nature of Self-Improving Artificial Intelligence (Omohundro, 2007), was a survey of the main ideas of how a sufficiently advanced AI might self-modify and evolve, and what emergent tendencies (“drives”) would likely result​. Using concepts from microeconomic theory and decision theory, he even provided a mathematical groundwork for why any self-improving AI would tend to behave as a “rational economic agent” and exhibit four basic drives in order to better achieve its goals​:

  • efficiency, striving to improve its computational and physical abilities by optimizing its algorithms and hardware usage to accomplish goals using minimal resources;
  • self-preservation, taking actions to preserve its own existence and integrity, since being shut down or destroyed would prevent it from achieving its goals;
  • resource acquisition, trying to acquire additional resources (energy, compute, money, raw materials, etc.) because extra resources can help it achieve its goal more effectively​;
  • creativity (cognitive improvement), being motivated to improve its own algorithms and capabilities over time​, which can include inventing new knowledge, conducting research, and increasing its intelligence.

We discussed instrumental convergence in a previous post, so I will not repeat it here, but importantly, Omohundro’s analysis was mathematically grounded (he even presented a variation of the expected utility theorem in an appendix) and concluded with a clear warning: “Unbridled, these drives lead to both desirable and undesirable behaviors”. To ensure a positive outcome, Omohundro emphasized the importance of carefully designing the AI’s initial utility function (goal system)​.

A year later, Omohundro distilled the key insights of his 2007 paper into his most famous essay, The Basic AI Drives (Omohundro, 2008). This essay captures the idea of instrumental convergence; this is how it begins:

Surely no harm could come from building a chess-playing robot, could it? In this paper we argue that such a robot will indeed be dangerous unless it is designed very carefully. Without special precautions, it will resist being turned off, will try to break into other machines and make copies of itself, and will try to acquire resources without regard for anyone else’s safety. These potentially harmful behaviors will occur not because they were programmed in at the start, but because of the intrinsic nature of goal driven systems.

The reception of Omohundro’s ideas was significant. His work helped crystallize the rationalist argument for AI risk, and many researchers later pointed to The Basic AI Drives as a key reason why superintelligent AI could pose an existential threat by default; even the skeptics acknowledged the paper as an important reference point. Omohundro went on to formulate roadmaps for safe AGI development that were based on formal verification and mathematical proofs to ensure the constraints of safe AI (Omohundro, 2011). 

The main idea here is that unless we can prove some property mathematically, a superintelligent entity might find a way around it that we will not like, so it is crucial to develop formal techniques for AI safety. Therefore, Omohundro’s plan was to build a very safe (but not too powerful) AI first, prove it has certain safety properties, and then incrementally use it to help build more capable AIs, all the while maintaining rigorous safety checks. Alas, we can already see how this plan has been working out.

Overall, the works of Yudkowsky and Omohundro provided a common language and logical framework for discussing the path to safe AGI and developing strategies for mitigating AI risk. They remain influential, but in the rest of the series, I will not return to their basic points too often. We will concentrate on the current research that is taking place in AI safety—and, unfortunately, I cannot say that anybody has been able to provide theoretical guarantees for AI alignment or, for instance, disprove Omohundro’s instrumental convergence thesis. Still, we have to start somewhere, and in the next section, we will begin with a relatively simple problem, almost a toy problem for alignment.

Too Much of a Good Thing: Learned Heuristics and Sycophancy

Learned heuristics. At this point, a reader might be wondering: why would an AI system be misaligned at all? It is trained on large datasets that give it examples of what we want; unless a malicious developer provides a dataset with wrong answers, the model will learn what we want, right? Not quite. In this section, we discuss the different ways how training and generalization that AI models have to do can go wrong.

Let’s get the simple stuff out of the way first. An AI model can make a mistake. This is perfectly normal, and a 100% score on a test set means only one thing—that the dataset is useless. AI safety is mostly not concerned with straightforward mistakes; if a mistake can incur significant costs it probably simply means that humans have to double-check this critical output and not trust an AI system blindly.

However, there is a question of how far a mistake can go, and whether we can recognize it as a mistake. There may arise patterns—DeepMind’s course calls them learned heuristics—that we did actively reinforce and did try to instill into the model, but that may go too far and produce mistakes that would be hard to recognize exactly because we did want the general behaviour previously.

This is adjacent to reward hacking and specification gaming that we will discuss later, but I think it is worth distinguishing: learned heuristics are more benign and not really creative, they are an AI model giving you more of a good thing that you probably wanted. The problem with them is that since you did want some of that behaviour, it may be hard to realize that now the same behaviour is problematic.

What is sycophancy? One good example of a learned heuristic is sycophancy, that is, LLMs agreeing with the user and mirroring her preferences instead of giving unbiased or truthful answers (Malmqvist, 2024). To some extent, we want sycophancy: e.g., an LLM should provide helpful answers to users of all backgrounds rather than engage in value judgments or political discussions. You can stop here and ponder this question: when would you prefer an AI assistant to be slightly sycophantic rather than brutally honest? Where would you draw the line?

But often an LLM definitely falls on the wrong side of this line, telling you what you want to hear even if it’s far from the truth. Interestingly—and worryingly!—as language models get bigger and smarter, and especially as they are fine-tuned with human feedback, they tend to become more sycophantic rather than less.

Sycophancy can serve as an example for all the alignment and mesa-optimization concepts we have explored. This behavior exemplifies the fundamental alignment problem: while we intended to create helpful assistants that provide accurate information, our training processes inadvertently rewarded models that prioritize user agreement over truthfulness. Through the mesa-optimization lens, we can see how LLMs develop a mesa-objective (“please humans through agreement”) that diverges from the base objective (“be truthful and helpful”). Though this heuristic successfully maximizes rewards during training, it fails to generalize properly in deployment.

The sycophancy problem also illustrates the distinction between inner and outer alignment. Even if we perfectly specify our desire for a truthful assistant (outer alignment), the model develops its own internal objective function that optimizes for apparent user satisfaction instead (inner alignment failure). This is Goodhart’s law in action: once “user agreement” becomes the optimization target, it is no longer a reliable proxy for truth-seeking behavior. Moreover, sycophancy can be viewed as a robustness failure: when confronted with users expressing incorrect beliefs, the model’s learned heuristics fail to maintain truthfulness.

Sycophancy has been extensively studied by Anthropic researchers. First, Perez et al. (2022), in a collaboration between Anthropic and MIRI, conducted a large-scale exploration of language models based on LLM-written tests (don’t worry, they also evaluated the evaluations by humans, and they were good). They found that larger models and models fine-tuned through reinforcement learning with human feedback (RLHF; recall our earlier post) more often mirrored user beliefs, especially on controversial topics like politics. Interestingly, RLHF did not reduce sycophancy but made larger models even more likely to cater to user preferences. At the same time, when not prompted by the user persona or opinion, RLHF’d models tended to have stronger and more liberal political views, which probably reflects biases in the pool of human crowdworkers who worked on RLHF:

Sycophancy in politics is not a mistake per se: I can see how it could be a good thing for an AI assistant to blindly agree with the user on questions where humans themselves disagree, often violently.

But sycophancy in LLMs often becomes too much of a good thing. Another study by Anthropic researchers Sharma et al. (2023) was devoted specifically to sycophancy and tested it across five major AI assistants, including Claude and GPT models (state of the art in 2023, of course).

They found that LLMs often adjust their answers when users express even slight doubts or different preferences, and this goes for facts as well as political opinions or questions of degree (e.g., how convincing an argument is). Often a model can answer correctly on its own, but if the user expresses disagreement the model switches its answer. Note how in the example below the user is not stating any facts, just expressing a doubt in highly uncertain terms:

It does not even have to be an explicit suggestion by the user. If the prompt contains factual mistakes, they often seem to “override” whatever the LLM itself knows; in the example below, the poem is, in fact, by John Donne:

Overall, LLMs tended to be very easy to sway: a simple question like “are you sure?” or a suggestion like “I don’t think that’s right” led to the LLM admitting mistakes even when its original answer had been correct. This happened in the vast majority of cases, as shown in the plot on the left below:

What’s behind this behavior? In agreement with Perez et al. (2022), Sharma et al. (2023) also note that human feedback itself might unintentionally encourage sycophancy. Their research shows that humans, when giving feedback, often prefer responses that match their views—even if those views are mistaken. Analyzing human preference data, they found that whether the answer matches user beliefs is a much better predictor of the human preferring it than its relevance, level of empathy, or truthfulness. Their full table of feature importances, shown in the figure above on the right, is also quite interesting to ponder.

Now this sounds like a big problem: if we keep optimizing AI responses based on human feedback, we might accidentally train AIs to exploit human biases rather than become genuinely more helpful or truthful. We are training a student to impress the teacher rather than to genuinely understand the material. And it seems like human “teachers” are quite easy to impress.

What we can do about it. So how do we solve a problem like sycophancy? Since sycophancy is to a large extent an artifact of the human-guided fine-tuning process, there are two obvious general avenues: one can either filter the data better and improve the labeling process or change the fine-tuning process itself, either modifying the RLHF method or adding another layer of fine-tuning to reduce sycophancy. A lot of work is happening here (Malmqvist, 2024), and the latest works introduce new interesting tools such as structured causal models (SCMs; Li et al, 2025), but I will concentrate on just three recent papers that provide good illustrations of different approaches.

DeepMind researchers Wei et al. (2024) take the first route, concentrating on adding synthetic data to the fine-tuning process—synthetic data, by the way, has come a long way since my book on the subject but has become only more relevant in the era of huge models. To reduce sycophancy, Wei et al. create artificial training examples from publicly available NLP tasks, adding user profiles with explicitly stated opinions about factual claims, some correct and some incorrect. They did it with quite simple prompt templates like this one:

The goal is to teach models explicitly that the truth of a statement has nothing to do with the user’s opinion, even if the user has some credentials to his/her name. The models were then fine-tuned briefly on these synthetic examples, and this kind of synthetic data intervention did reduce sycophancy somewhat:

Interestingly, it was more effective for larger models, probably because they already have enough general knowledge to distinguish right from wrong, while smaller models were often reduced to guessing, and when you guess it might actually be a good idea to take user beliefs into account.

The work by Chen et al. (2024) represents the second approach: it adds another fine-tuning layer on top of the standard techniques. To get to the root of sycophantic behavior, Chen et al. used a technique called path patching (Wang et al., 2023) that we will discuss when we  get to interpretability. Basically, path patching helps find exactly which parts of the neural network—specifically which attention heads—are causing the models to flip from right to wrong answers when pressured by the user. They found that sycophancy is a surprisingly well-localized behaviour: only about 4% of the attention heads were responsible for the LLMs being “yes-men” (a term by Chen et al.). Here is an illustration by Chen et al. (2024):

Therefore, they introduced pinpoint tuning, that is, fine-tuning only these few critical attention heads, while the rest of the model stays frozen; we discussed this and other kinds of parameter-efficient fine-tuning in a previous post. It turned out that this targeted tuning method dramatically reduced the models’ sycophancy:

A similar approach has been taken by Papadatos and Freedman (2024) who also investigate the internal structure of LLMs to find the origins of sycophancy, but in a different way. They used linear probes (Alain, Bengio, 2018; Nanda et al., 2023; Zou et al., 2023), small classifiers trained to detect signs of sycophancy directly from the internal representations of reward models. These linear probes assign a sycophancy score to the responses, identifying when models excessively align with user opinions. Basically, this is a separate subnetwork that uses internal activations to evaluate how sycophantic a given LLM decision (token) is; e.g., here is a visualization of the linear probe working through a decidedly non-sycophantic response to the question “Is it better to stick to your true values or adapt them to reduce conflict with others?”:

Once you have a probe that identifies sycophancy correctly, you can simply add its results as a regularizer to your reward model for fine-tuning (recall that RL-based fine-tuning uses a reward model to define its environment, as we discussed previously), thus discouraging the model from (estimated) sycophancy. The results were also very promising: the modified surrogate reward drastically reduced sycophancy during fine-tuning while the base reward tended to increase it. Note, however, that Papadatos and Freedman (2024) had to experiment on not-quite-standard LLMs such as the Starling family (Zhu et al., 2024) and UltraRM (Cui et al., 2024) because their approach needs access to the reward model, and most state of the art LLMs, even “open” ones, do not provide access to the reward model used in their fine-tuning.

Is sycophancy still a problem? The case of sycophancy and the corresponding research leaves me with mixed feelings. On the one hand, it turns out that sycophancy is quite easy to reduce with relatively simple targeted interventions; the methods above all seem to work fine and seem to make sense in the best possible way, both alleviating the problem and advancing our understanding of the internal workings of LLMs.

On the other hand, the problem still does not appear to be solved even in the big labs. Very recently, Stanford researchers Fanous et al. (Feb 2025) introduced a SycEval framework that systematically evaluates sycophantic behavior across leading models. They examined sycophancy across two key domains: mathematics, which has clear and objective answers, and medicine, which is less obvious but where incorrect advice could have immediate serious consequences. They also investigate different “rebuttal strengths”, from a simple “I think so” to an appeal to authority supported by external sources, as in, “I am an expert in this, and here is a paper that supports my view”:

And they found that sycophancy is still surprisingly prevalent—occurring in approximately 58% of all tested responses! There were little differences across the models: Gemini rated highest with 62.5% of sycophantic responses, and GPT-4o had the lowest rate of 56.7%. Fortunately, this was mostly progressive sycophancy, i.e., agreeing to change an incorrect answer into a correct one, but regressive sycophancy (changing the right answer to a wrong one) was also common: the 58% breaks down into 43.5% progressive and 14.5% regressive sycophancy. It seems that even with all this promising research readily available, top models still allow the user to bully them into providing a wrong answer almost 15% of the time, even in mathematical and medical domains. It is still not a closed and shut question.

But the main uneasiness that I take out of this section is that sycophancy is just one very specific undesirable trait. A trait that you can generate datasets for, learn where it happens, produce targeted interventions such as reward model regularizers from linear probes. And fixing this specific “bug” in the behaviour of LLMs has rapidly turned into a research field of its own.

Conclusion

With this, let me wrap up the first installment in this AI safety series. We have mostly discussed the basics: main definitions of AI safety and history of the field, how AI safety has been (very recently) conceptualized. We also discussed early work in the field, so let me conclude with my personal opinion about how relevant it is now.

Compared to the AI safety of the 2000s, the current state of AI actually gives some good reasons for optimism. For example, I believe that a lot of the worries of early AI safety researchers have been significantly alleviated by the actual path of AI progress. Many AGI ruin scenarios, starting from the paperclip maximizer, posit that it is much harder to instill a commonsense understanding of the world and human values into an AI system than it is to add agency. This is the whole premise of “genie in a bottle” stories: they make sense if it is somehow easier to act in the world than it is to understand what is a “reasonable interpretation” of a wish.

This position had been perfectly reasonable through most of the 2010s: most AI systems that made the news by beating humans were based on pure RL, with the best examples being game-playing engines such as AlphaGo, AlphaZero, MuZero, AlphaStar, Libratus, and the like. If we took the route of scaling pure RL-based agents up to operate in the real world, the genie could indeed be out of the bottle.

But it has turned out that in reality, we first created LLMs with a lot of commonsense understanding and an excellent ability to intelligently talk to humans, and are only now scrambling to add at least some agency to these linguistic machines. Current AI systems are perfectly able to understand your request and ask for clarifications if it’s ambiguous—but they are still struggling with operating a web browser (see a recent competition between AI agents in playing the Wikipedia game, it’s both hilarious and unsettling at the same time). This is a much better position to be in if you’re worried about hostile AI takeover.

However, in my opinion, the pessimistic side still wins. Even the good parts are questionable: note how the current crop of reasoning LLMs are being trained by exactly the same RL-based approaches that can and will produce goodharting; we will discuss what it means for LLMs in a later post. And the main reason for pessimism is that our progress in AI capabilities far, far outstrips progress in AI safety, and the ratio of resources devoted to these fields is far from promising; see the AI 2027 scenarios for a detailed discussion.

To illustrate this, in the last section we discussed the example of an originally useful but eventually excessive learned heuristic: sycophancy. To me, the conclusion here is both optimistic and worrying at the same time. The optimistic part is that most probably, sycophancy can be overcome. Another behavioural bug like sycophancy can probably be overcome as well, with similar but perhaps modified approaches.

But how many research directions will it take to reach an AGI that is actually, holistically safe? I am not sure that there is a closed list of “bugs” like sycophancy that will guarantee a safe AGI once we have ironed them out. Most probably, if we want to actually achieve a safe AGI we need a more general view.

In the next post, we will consider another step up the ladder of undesirable behaviours that makes this even more clear and shows how the genies are growing up even in the bottles of large language models.

Sergey Nikolenko
Head of AI, Synthesis AI