Synthesis Blog
Back
The Creativity Scale: Can AI Do Science?

Today, I want to discuss two recently developed AI systems that can help with one of the holy grails of AI: doing research automatically. Google’s AI Co-Scientist appears to be a tireless research partner that can read thousands of papers overnight and brainstorm ideas with you… actually, it can brainstorm ideas internally and give you only the best of the best. Sakana AI’s AI Scientist-v2 doesn’t need you at all, it just writes new papers from scratch, and its papers are getting accepted to some very good venues. To contextualize these novelties, I also want to discuss where current AI models are, creatively speaking—and what this question means, exactly.

Introduction

When we look at the history of AI progress, we see a familiar pattern: capabilities that seem impossible suddenly become routine and cease to amaze us. Translation across natural languages, image generation in a variety of artistic styles, coding assistance—problems that had been considered uniquely human not that long ago—are now routinely performed by AI systems. And we 

There is scientific discovery—that blend of creativity, intuition, and methodical analysis—has remained firmly in the human domain. 

The technological singularity—that hypothetical point where artificial intelligence surpasses human intellect and begins to accelerate technological progress beyond our comprehension—has been the subject of endless speculation, starting at least from the works of Irving J. Good (1963). Futurists debate timelines, skeptics dismiss it as science fiction, and AI researchers work diligently on advancing AI capabilities towards this inflection point. But how close are we really to AI systems that can not only assist with scientific discovery but drive it forward independently?

In this post, we will discuss two recent AI developments that make me wonder if we are actually approaching an inflection point in how machines contribute to scientific progress. These developments are based on reasoning models but they themselves do not contain new architectures or new optimization algorithms—they are sophisticated scaffolding systems that merely enhance existing LLMs and direct their operation. Still, this kind of scaffolding might already change how science gets done. To put these systems in context, I will also propose a “Creativity Scale” to help us contextualize just how far AI has climbed—and how far it still has to go before the singularity is around the corner for real.

Google Co-Scientist

The first important piece of news came on February 19, 2025: Google introduced their AI Co-Scientist system (blog post, Gottweis et al., 2025). It is a multi-agent AI system built on Gemini 2.0, designed to collaborate directly with human scientists. The Co-Scientist system is not just a glorified literature summarizer—although survey writing is also increasingly being automated, and perhaps I will talk about it more later; see, e.g., SurveyX by Liang et al. (Feb 20, 2025). Google attempted to create a system that can actively help generate novel scientific hypotheses, refine them through rigorous simulated debates, and ultimately guide scientists towards new discoveries.

Here is how it works: a human researcher sets the scene by providing their research goal in natural language, and the AI Co-Scientist then explores it by exchanging information between six specialized LLM-based agents:

  • the generation agent kicks off the brainstorming process by exploring existing scientific literature and brainstorming initial hypotheses,
  • the reflection agent acts like a peer reviewer, critically examining each hypothesis for novelty, correctness, and relevance,
  • the ranking agent hosts virtual scientific debates to compare hypotheses against each other in a tournament-style competition, helping prioritize the most promising ideas,
  • the evolution agent continuously refines top-ranked hypotheses, blending the best ideas, incorporating new insights, and simplifying complex concepts, and, finally,
  • proximity and meta-review agents help map out the hypothesis space and generate high-level summaries, making it easier to navigate and explore related research ideas efficiently.

Here is the general workflow of the AI Co-Scientist system:

Note that AI Co-Scientist is self-improving: after generating a set of hypotheses it can learn from each round of reviews and debates, continuously improving the hypotheses themselves. This iterative loop of generation, review, and refinement needs extensive computational resources because all agents, of course, are implemented with Gemini 2.0, a state of the art reasoning model (see our previous post).

Similar kinds of scaffolding have been tried many times, sometimes with impressive results (e.g., Boiko et al., 2023 even named their system Coscientist first!). But until now, such systems might be helpful but never actually produced significant new results. So did Google succeed this time?

It appears they did! Google validated AI Co-Scientist’s potential in real-world biomedical research scenarios with very interesting results. The most impressive result comes from the team of Dr. Jose Penades, a microbiology professor from Imperial College London. He asked the AI Co-Scientist a question on antimicrobial resistance: what could be the mechanism that makes some bacteria into “superbugs” resistant to known antibiotics. Penades reports that in two days, the system produced the same answer as their own team took several years to come up with (their research was yet unpublished). 

The astonished Dr. Penades even explicitly checked with Google in case they somehow had advance access to his team’s research. When Google confirmed they didn’t, Penades went on record to say the following: “This effectively meant that the algorithm was able to look at the available evidence, analyse the possibilities, ask questions, design experiments, and propose the very same hypothesis that we arrived at through years of painstaking scientific research, but in a fraction of the time.”

Naturally, a collection of LLM agents cannot do lab work or clinical trials, but it already sounds like a big deal for science, empowering human scientists and potentially speeding up progress by a lot. Basically, once the AI Co-Scientist and similar systems become ubiquitous, every lab in the world will have access to a virtual intelligent collaborator that may not be Einstein yet but that never tires and can produce reasonably novel ideas for human researchers to filter and implement.

I am not sure if this system is exactly the transformative moment in the history of scientific research—there may be some mitigating factors, and early hype often proves to be overstated. But it does look like with proper scaffolding, the recent crop of reasoning models may push the usefulness of such systems over the edge. Let us look at the second recent example.

AI Scientist-V2

In my talks on AI capabilities, since August 2024 I have been mentioning an interesting case study: the AI Scientist system by Sakana AI (Lu et al., 2024). They developed an open source LLM scaffolding system that can do several things:

  • connect to several APIs, primarily different LLMs and the Semantic Scholar API to download research papers;
  • suggest research ideas and experiments in the field of computer science, specifically machine learning;
  • write the code for these experiments;
  • execute it independently by using the resources of the local computer where the system runs; save model weights and experimental results;
  • write up a description of the results in the form of a research paper.

The output is a research paper generated fully automatically, end to end. The papers produced in August 2024 were rather mediocre—they certainly would not have been accepted at top conferences. But if someone had sent me a paper like that as, e.g., a Master’s thesis for review, I would not have been surprised and would not see any reasons to give it a low grade—the provided examples were quite competent, just very incremental and uninspired. To me, it had been a great example of how simple scaffolding can turn LLMs into end-to-end researchers, but it had been a toy example.

On March 12, Sakana AI produced another milestone: a new version of their system, the AI Scientist-v2, successfully authored a research paper that passed peer review at a workshop associated with the International Conference on Learning Representations (ICLR), one of the premier conferences in machine learning (subjectively, I would place it third, after NeurIPS and ICML). The paper, titled “Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization,” explored limitations in neural network design. 

Somewhat ironically, the authors chose a venue devoted to demonstrating the boundaries of AI capabilities rather than just capabilities: it was the “I Can’t Believe It’s Not Better: Challenges in Applied Deep Learning” workshop on how theory and practice sometimes do not go hand in hand in DL.

While we still do not know the full details (paper by Sakana AI pending), we know that this work was again generated end-to-end, entirely without human intervention. The AI Scientist-v2 autonomously formulated the research hypothesis, designed experiments, wrote, debugged, and refined code, conducted the experiments, analyzed results, created visualizations, and composed the final manuscript. Human researchers only provided a broad prompt about which research topics to consider (amounting to the scope of the workshop) and then performed final selection, choosing for submission three papers out of several generated ones.

The peer review experiment was conducted as follows:

  • workshop organizers were aware about the experiment and allowed it;
  • the reviewers were aware that some papers might be automatically generated but they did not know which ones, and review was done as usual.

Of three papers submitted by Sakana AI, one received scores of 6, 7, 6 (apparently out of 10), which would place definitely place it above the acceptance threshold; the other two would probably be rejected:

Naturally, you won’t see this paper in the proceedings: the “authors” withdrew it after the review process.

As soon as the news appeared, I made the following prediction (in my Telegram channel; if you can read Russian, welcome!): the AI skeptics will not update on this news at all. We will hear the following counterarguments:

  • the paper is very much not a breakthrough, it contains incremental results and, incidentally, negative incremental results: it is a writeup of an idea that didn’t work (although there were reasons why it might);
  • this is only a workshop, so the bar is relatively low: it would be much harder to get into the ICLR main conference proceedings;
  • the peer review process is stochastic and sometimes fails spectacularly; there have been nonsensical auto-generated papers accepted to low-rank conferences at least since 2010.

These are all valid comments, and I agree with the first two and also with the high variance of the review process—although, of course, nonsensical papers were accepted because nobody read them, which is clearly not the case here. But to me, this is still a very significant milestone: AI models are starting to contribute to scientific knowledge creation, not just assisting humans in the process.

The Creativity Scale

To summarize this kind of news, I want to get a little philosophical and not very rigorous here. Let us consider processes that can generate new knowledge or new insights into the analysis of existing knowledge, and let me try to rank them according to their “creativity level”. 

What is creativity? This is a philosophical question of definitions, but to me it aligns closely to two things:

  • how hard a given knowledge generation process is to formalize, and
  • how likely it is to produce new important knowledge, i.e., how long we expect for the process to take to arrive at a good solution.

These two properties often go hand in hand: it is usually very easy to program a random search but it takes ages, while research geniuses can quickly intuit great solutions to difficult problems but we have no idea how exactly they do it.

While any such scale is inherently subjective, and I’m open to corrections and objections to my rankings, here is what I came up with:

Let me give brief comments about each point on this scale, and then we will come back to the current crop of LLM research scaffolds.

Random search. The obvious baseline: randomly testing options with no guide and no feedback will eventually find interesting solutions if you allow infinite time. But there is no intelligence, no directed creativity, and it will take ages to solve any serious problem.

Gradient descent. Basic gradient descent and its modern improved versions such as Adam or  Muon are local hill-climbing (or valley-descending) methods. Gradient descent is obviously “smarter” than random search because it “knows where to go”, but it can easily fall into local optima and requires a well-defined objective function.

Evolution. Natural selection is essentially a massively parallel search process that mixes random variation (mutations and recombination of genes) with selection pressures that act, in a way, like local gradients. I deliberated on how to rank evolution compared to gradient descent: in a way, it is a more random, “more blind” search than GD, and hence very slow. But the highly parallel nature of the process and emergent interaction between the species have led to many wonderful solutions such as flight or photosynthesis, so in my subjective ranking evolution still wins.

Neural Architecture Search. NAS automates the design of neural network architectures (choosing layers, connections, etc.) by searching through a predefined space, sometimes using evolutionary strategies, reinforcement learning, or gradient-based search. It is more “creative” than standard gradient descent because it actively tries different “shapes” of networks, but it is still heavily guided by a specific performance metric. In a way, NAS takes the best parts of both GD and evolutionary search, so I put it one notch higher; in deep learning, NAS produced such widely used solutions as the Swish activation functions or the EfficientNet family of convolutional architectures.

AlphaZero / MuZero. This is a stand-in for pure reinforcement learning in a well-defined environment such as a discrete tabletop game, usually coupled with inference-time search algorithms such as MCTS. We all know that in complex but easy-to-formalize environments such as chess or Go, well-executed RL-based approaches can easily go beyond human level; top human performance in chess does not register on AlphaZero’s performance chart, it pushes right through and saturates at a much higher level. But this success is limited to very specialized domains and “pure” formalized environments.

Large language models. LLMs can generate surprisingly creative texts, analogies, and can reframe problems. They learn from massive text corpora and can combine concepts in novel ways. However, so far they lack deep causal understanding and mostly rely on pattern completion rather than truly original conceptual leaps (but don’t we all? this is an open  question to me). “Pure” LLMs have already been helpful for research purposes for some time; e.g., DeepMind’s FunSearch bounds were obtained by evolutionary search done over programs written by an LLM, with feedback on how the programs perform provided back to the LLM’s context. But I do not know of any end-to-end examples of an LLM producing a new research result just from a single prompt yet.

Einstein discovering general relativity. Here I wanted to put the most creative work of science ever done. Since I know very little science outside of math and AI, I’m weaseling out of this decision by referring to Adam Brown appearing on Dwarkesh Patel’s podcast. Adam Brown is a theoretical physicist who now leads the BlueShift team at DeepMind, and it is a great interview. I recommend to give it a listen in full, but we need just one snippet—here is how Brown describes general relativity as the purest example of human genius:

I think general relativity is really an extraordinary story. It’s pretty unusual in the history of physics that you, to first approximation, just have one guy who sits down and thinks really, really hard with lots of thought experiments about jumping up and down in elevators and beetles moving on the surface of planets and all the rest of it, and at the end of that time writes down a theory that completely reconceptualizes nature’s most familiar force and also speaks not just to that, but speaks to the origin and fate of the universe and almost immediately achieves decisive experimental confirmation in the orbits of astronomical observations or the orbits of planets and the deflections of lights during eclipses and stuff like that.

With that, we come to the question marks. 

Where do the current frontier LLMs with appropriate scaffolding fall on the scale? A year ago, in February 2024, I wrote a post called “The Unreasonable Ineffectiveness of AI for Math”; the most interesting result at that point was FunSearch (Romera-Paredes et al., 2023), and LLMs were progressing through high school math and on to mid-level high school Olympiad problems.

Half a year ago, in October 2024, Terence Tao famously said about his interaction with the OpenAI o1 model: “The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student”. Note that we are not talking about averaging the capabilities of all Ph.D. students in the world (which would not, in all honesty, be that high a bar for research proficiency), we are talking about Terence Tao’s grad students!

Now, in March 2025, an LLM-based system is able to write papers accepted to serious venues and produce ideas that sound novel and interesting to a veteran researcher whose group works in the field. Are we at the singularity already? No, far from it. But the helpfulness of AI models for science keeps growing—at this point, I am not sure what percentile of graduate students it would fall into.

Conclusion

With these two examples, do we know where current AI models are on the creativity scale? I posit that no, we don’t. Recall the notion of unhobbling introduced by Leopold Aschenbrenner in his “Situational Awareness” book:

Finally, the hardest to quantify—but no less important—category of improvements: what I’ll call “unhobbling.”

Imagine if when asked to solve a hard math problem, you had to instantly answer with the very first thing that came to mind. It seems obvious that you would have a hard time, except for the simplest problems. But until recently, that’s how we had LLMs solve math problems. Instead, most of us work through the problem step-by-step on a scratchpad, and are able to solve much more difficult problems that way. “Chain-of-thought” prompting unlocked that for LLMs. Despite excellent raw capabilities, they were much worse at math than they could be because they were hobbled in an obvious way, and it took a small algorithmic tweak to unlock much greater capabilities.

Aschenbrenner goes on to list RLHF, chain of thought techniques (including reasoning models now), tool use such as RAG, and other scaffolding as examples. It is a key component of AI capability progress in his model, along with pure scaling and algorithmic improvements:

In this model, LLM capabilities are an iceberg with a large underwater part: you can see some capabilities immediately but you have to put in more work to fully discover what a new LLM is capable of. As new, better LLMs appear, the iceberg grows, but even more capabilities are left “underwater” for unhobbling.

My intuition is that even if the base model progress stops right here and now, and we never get GPT-5, there are still years upon years of very impressive progress left that can be achieved by just utilizing existing base models in better ways. “Doing science” is a part of these capabilities, also partly hidden just like all the others. Research is probably the most important part, too, since it builds on itself: better research may lead to new capabilities.

One more fun fact as we are wrapping up. As I was writing this post, I asked ChatGPT o1 pro, a top current LLM from OpenAI, what it thought about the rankings. We mostly agreed on the low level of the scale, and o1 pro modestly rated itself as a 6, just like I put on the graph.

But then o1 pro went on to rate an average human Ph.D. student as a 7 and Google Co-Scientist as an 8:

An Average Human PhD Student
Level: ~7
Rationale: A PhD student does guided research, forms hypotheses, and tries new ideas within a field. Although domain constraints (like publish-or-perish) can limit true “blue-sky” creativity, a typical human mind can still invent, connect distant ideas, and adapt more fluidly than purely algorithmic approaches that rely on objective functions.

Google Co-Scientist / AI Scientist Systems
Level: ~8
Rationale: These are (currently hypothetical or emerging) integrated frameworks around LLMs that provide scaffolding such as code-writing, hypothesis generation, chain-of-thought reflection, external memory tools, etc. The synergy might enable deeper exploration of ideas than a raw LLM alone. If they can iteratively propose experiments, test them, refine hypotheses, and incorporate feedback, they could push into higher levels of creative problem-solving.

These estimates don’t actually mean anything, of course—but they made me wonder a little.

Let me leave you with one final question. The creativity scale I have proposed ends with Einstein’s best work at the level of 10. But the history of AI teaches us that human level is never an asymptote: once you have a good enough system, it zips right through human level and saturates higher, sometimes much higher. What do you think 11 looks like on this scale? What about 25?..

Sergey Nikolenko
Head of AI, Synthesis AI