
AI Safety II: Goodharting and Reward Hacking
In this post, the second in the series (after...
In this post, the second in the series (after “Concepts and Definitions”), we embark on a comprehensive exploration of Goodhart’s law: how optimization processes can undermine their intended goals by optimizing proxy metrics. Goodharting lies at the heart of what is so terrifying about making AGI, so this is a key topic for AI safety. Starting with the classic taxonomy of regressional, extremal, causal, and adversarial goodharting, we then trace these patterns from simple mathematical models and toy RL environments to the behaviours of state of the art reasoning LLMs, showing how goodharting manifests in modern machine learning through shortcut learning, reward hacking, goal misgeneralization, and even reward tampering, with striking examples from current RL agents and LLMs.
In October 2023, I wrote a long post on the dangers of AGI and why we as humanity might not be ready for the upcoming AGI revolution. A year and a half is an eternity in current AI timelines—so what is the current state of the field? Are we still worried about AGI? Instead of talking about how perception of the risks has shifted over the last year (it has not, not that much, and most recent scenarios such as AI 2027 still warn about loss of control and existential risks), today we begin to review the positive side of this question: the emerging research fields of AI safety and AI alignment. This is still a very young field, and a field much smaller than it should be. Most research questions are wide open or not even well-defined yet, so if you are an AI researcher, please take this series as an invitation to dive in!
Today, I want to discuss two recently developed AI systems that can help with one of the holy grails of AI: doing research automatically. Google’s AI Co-Scientist appears to be a tireless research partner that can read thousands of papers overnight and brainstorm ideas with you… actually, it can brainstorm ideas internally and give you only the best of the best. Sakana AI’s AI Scientist-v2 doesn’t need you at all, it just writes new papers from scratch, and its papers are getting accepted to some very good venues. To contextualize these novelties, I also want to discuss where current AI models are, creatively speaking—and what this question means, exactly.
Some of the most important AI advances in 2024 were definitely test-time reasoning LLMs, or large reasoning models (LRM), that is, LLMs that are trained to write down and reuse their chains of thought for future reference. Reasoning LLMs started with the o1 family of models by OpenAI (I wrote a short capabilities post in September, when it appeared). Since then, they have opened up a new scaling paradigm for test-time compute, significantly advanced areas such as mathematical reasoning and programming, and OpenAI is already boasting its new o3 family—but we still don’t have a definitive source on how OpenAI’s models work. In this post, we discuss how attempts to replicate o1 have progressed to this date, including the current state of the art open model, DeepSeek R1, which seems to be a worthy rival even for OpenAI’s offerings.
We interrupt your regularly scheduled programming to discuss a paper released on New Year’s Eve: on December 31, 2024, Google researchers Ali Behrouz et al. published a paper called “Titans: Learning to Memorize at Test Time”. It is already receiving a lot of attention, with some reviewers calling it the next big thing after Transformers. Since we have already discussed many different approaches to extending the context size in LLMs, in this post we can gain a deeper understanding of Titans by putting it in a wider context. Also, there are surprisingly many neurobiological analogies here…
It is time to discuss some applications. Today, I begin with using LLMs for programming. There is at least one important aspect of programming that makes it easier than writing texts: source code is formal, and you can design tests that cover at least most of the requirements in a direct and binary pass/fail way. So today, we begin with evaluation datasets and metrics and then proceed to fine-tuning approaches for programming: RL-based, instruction tuning, and others. Next, we will discuss LLM-based agents for code and a couple of practical examples—open LLMs for coding—and then I will conclude with a discussion of where we are right now and what the future may hold.
We have already discussed how to extend the context size for modern Transformer architectures, but today we explore a different direction of this research. In the quest to handle longer sequences and larger datasets, Transformers are turning back to the classics: the memory mechanisms of RNNs, associative memory, and even continuous dynamical systems. From linear attention to Mamba, modern models are blending old and new ideas to bring forth a new paradigm of sequence modeling, and this paradigm is exactly what we discuss today.
Although deep learning is a very new branch of computer science, foundations of neural networks have been in place since the 1950s: we have been training directed graphs composed of artificial neurons (perceptrons), and each individual neuron has always looked like a linear combination of inputs followed by a nonlinear function like ReLU. In April 2024, a new paradigm emerged: Kolmogorov-Arnold networks (KAN) work on a different theoretical basis and promise not only a better fit for the data but also much improved interpretability and an ability to cross over to symbolic discoveries. In this post, we discuss this paradigm, what the main differences are, and where KAN can get us right now.
OpenAI’s o1-preview has been all the buzz lately. While this model is based on the GPT-4o general architecture, it boasts much improved reasoning capabilities: it can ponder the question for about a minute, reason through multiple possibilities, and arrive at solutions that could not be generated from a single try of GPT-4o. In this post, I discuss the o1-preview model but mainly present the most striking advantage of o1-preview over all previous LLMs: it can meaningfully answer questions from a quiz game called “What? Where? When?”. At this point, it probably does not sound all that exciting compared to winning math competitions and answering PhD level questions on science, but let me elaborate.
We continue our series on LLMs and various ways to make them better. We have already discussed ways to increase the context size, world models that arise in LLMs and other generative models, and LLM fine-tuning including RLHF, LoRA, and more. Today we consider another key idea that can make LLMs far more effective and useful in practice: retrieval-augmented generation, or RAG. We discuss the basic idea of RAG, its recursive agentic extensions, the R[e]ALM approach that integrates retrieval into LM training, some key problems of modern RAG approaches, discuss in detail knowledge graphs and how they are being used in RAG, and conclude with a reminder that even simple approaches can work well and a list of directions for future work.