OpenAI Scientist Yao Shunyu: O3 Release and RL’s New Paradigm – AI Enters the Second Half

we delve into the transformative ideas presented by OpenAI Agent Researcher Yao Shunyu during his discussions at CS 224N and Columbia University.

We Stand at the Midpoint of AI

For decades, the crux of AI has revolved around developing innovative training methods and models. This trajectory has proven effective: from defeating international chess and Go champions to outperforming most on SATs and bar exams, winning gold medals at IMO (International Mathematical Olympiad) and IOI (International Olympiad in Informatics) — milestones like DeepBlue, AlphaGo, GPT-4, and the O series were born from underlying training method innovations, including search, deep RL, scaling, and reasoning. Progress has been persistent over time.

So what has changed now? In simple terms, reinforcement learning (RL) has finally become effective. More specifically, RL has gained the capability for generalization. After navigating several detours and crossing numerous significant milestones, we have discovered the right recipe to address a wide array of reinforcement learning tasks through language modalities and reasoning abilities. Just a year ago, if you told most AI researchers that there existed a recipe capable of simultaneously handling software engineering, creative writing, IMO-level mathematics, mouse and keyboard operations, as well as extensive question-and-answer tasks — they would have laughed at your delusions. Each of these tasks is incredibly complex, with many researchers dedicating their entire PhDs to merely one sub-field. Yet today, this has indeed come to fruition. What comes next? The second half of AI — starting now — will shift focus from problem-solving to problem-definition. In this new phase, Evaluation (model assessment) will become more crucial than Training (model training). We are no longer merely asking, “Can we train a model to solve X?” but rather, “What exactly should we train a model to accomplish, and how do we measure true progress?” To succeed in the second half of AI, we must timely adjust our mindsets and skills, perhaps adopting more of a product manager’s approach.

Understanding the First Half of AI

To grasp the significance of the first half of AI, it is insightful to look at its winners. Consider this question: Which AI papers do you believe have been the most influential thus far? I posed this question during my lecture in Stanford’s CS 224N class, and the responses were predictable: Transformer, AlexNet, GPT-3, etc. The commonality of these papers lies in their foundational breakthroughs in training stronger models, which simultaneously demonstrated significant performance improvements on various benchmarks, making their publications possible. Note: CS 224N is Stanford’s publicly accessible course on Deep Learning and NLP, recognized as one of the best introductory courses in NLP for students and scholars in the past decade. It is taught by Professor Chris Manning, who is a recognized pioneer in the fields of Natural Language Processing and Machine Learning.

These illustrious papers share another potential trait: they are overwhelmingly centered on training methods or models, not on benchmarks or tasks. Even ImageNet, often regarded as the most influential benchmark dataset, is cited less than one-third as frequently as AlexNet. This discrepancy is even more pronounced in other cases. For instance, the primary benchmark used for Transformer is WMT’14, which has been cited roughly 1,300 times, whereas the Transformer paper itself has surpassed 160,000 citations.

These comparisons vividly illustrate that the first half of AI focused on building new models and training methods, whereas evaluation and benchmarks took a backseat, despite being necessary for academic publication. Why does this phenomenon occur? An important reason is that, during the first half of AI, developing training methods was more challenging and exciting than defining tasks. Inventing an entirely new algorithm or model architecture from scratch—such as the backpropagation algorithm, convolutional neural networks (AlexNet), or the Transformer utilized in GPT-3—requires tremendous insight and engineering capability. In contrast, defining tasks often seems more straightforward: it is merely transforming tasks already performed by humans, such as translation, image recognition, or playing chess, into benchmarks—this process normally requires little insight or engineering effort.

Training methods are often more general and widely applicable than specific tasks, thus appearing extraordinarily valuable. For example, the Transformer architecture has ultimately propelled progress across multiple domains, including computer vision (CV), natural language processing (NLP), and reinforcement learning (RL), influencing areas far beyond the original validating dataset of WMT'14. Outstanding new training methods frequently yield strong performance across various benchmarks because they are sufficiently straightforward and general, allowing their impact to surpass a specific task. For decades, innovation in training methodologies has paved the way for numerous world-altering ideas and breakthroughs, reflected in steadily improving benchmarks across various fields.

So, why has this dynamic changed today? Because the accumulation of these ideas and breakthroughs has led to essential changes in tackling tasks, creating a truly effective recipe.

The Effective Recipe for AI

What is this recipe? The key components of the recipe are unsurprising: large-scale language pre-training, scaling of data and computational power, as well as the concepts of reasoning and acting. At first glance, these may appear as current buzzwords. Why label these terms as a recipe? We can view it from the perspective of reinforcement learning. RL is often regarded as the “ultimate form” of AI, as it theoretically guarantees winning in games, while practically, nearly all superhuman-level AI systems (like AlphaGo) hinge upon RL support.

In game theory, a game refers to any task within a closed environment that has clear win-or-lose outcomes. The RL field comprises three critical components: algorithms, environments, and priors. For a long time, RL researchers primarily focused on algorithms—such as REINFORCE, DQN, TD-learning, actor-critic, PPO, and TRPO—which detail the core mechanism of how agents learn.

Conversely, environments and priors are often treated as given conditions or simplified to the maximum extent. For instance, Sutton and Barto’s classic textbook predominantly discusses algorithms, scarcely touching upon environments or priors.

However, in the era of deep reinforcement learning, the significance of the environment has become apparent in practice: the effectiveness of an algorithm is often highly dependent on the environment in which it is developed and tested. Ignoring the environment may lead to optimal algorithms that only function effectively in overly simplified environments.

So why don’t we first clarify the true environment we want to address and then seek out the most suitable algorithm? OpenAI originally planned this way.

OpenAI initially created Gym, a standard RL environment for various games, then launched World of Bits and Universe, striving to turn the internet or computers into a game. This design is promising; once we can convert all digital realms into environments and employ RL algorithms to resolve issues, we can achieve AGI in the digital sphere. Gym is an OpenAI toolkit released in April 2016 for developing and comparing RL algorithms, providing various predefined environments for researchers and developers to test their algorithms against consistent benchmarks. World of Bits and Universe were released in December 2016; the Universe project enables the assessment and training of AI’s general intelligence level across nearly all environments, aiming for AI agents to utilize computers like humans do.

While this design received much acclaim, it hasn’t entirely succeeded. Although OpenAI has made tremendous strides, such as utilizing RL to resolve issues in Dota and robotic hands, it has not yet tackled computer usage or web navigation adequately, and RL agents excel in one domain do not transfer well to another. Certain key elements remain missing. It wasn’t until the emergence of GPT-2 or GPT-3 that we realized what was lacking: priors.

One needs to undergo massive pre-training, distilling common sense and linguistic knowledge into the model and subsequently fine-tuning it to become a web agent (WebGPT) or conversational agent (ChatGPT) to change the world. It turns out that the most crucial aspect of RL may not even be the RL algorithms or environments themselves, but rather prior knowledge, and the methods of acquiring this prior knowledge are unrelated to RL. Large-scale pre-training of language models provides good prior knowledge for conversational tasks, but it falls short in computer control or playing video games due to significant distribution dissimilarities between those fields and internet text. Direct application of SFT or RL in those domains yields poor generalization.

In 2019, I recognized the issue while building CALM based on GPT-2 to solve text-based games through SFT or RL, requiring an enormous number of RL steps to achieve progress in a single game, without transferring to others. Although this mirrors RL’s typical characteristics, it felt peculiar to me because humans can effortlessly grasp new games, often outperforming agents under zero-sample conditions.

This led to my first epiphany: humans generalize not just through directives like “go to cabinet 2,” “use key 1 to open box 3,” or “slay dungeon monsters with a sword,” but through thoughts like: “The dungeon is dangerous; I need a weapon. There’s none available nearby; I must search locked cabinets or boxes, and since box 3 is located within cabinet 2, I should first check there.”

Thinking or reasoning is a unique behavior; it doesn’t directly alter the external world but possesses an open, infinitely combinable space, allowing us to conceive a word, sentence, paragraph, or a thousand random English words without immediate changes in the surrounding environment. In classical RL theory, reasoning is often regarded negatively due to the complexities it introduces to decision-making.

For example, when someone must choose between two boxes — one containing one million dollars, the other empty — their expected value stands at five hundred thousand dollars. Presenting numerous empty boxes diminishes their expected value to zero. Yet, if we incorporate reasoning within the action space of an RL environment, we can leverage the prior knowledge attained from language model pre-training for generalization and flexibly allocate test-time compute across various decisions.

This is a remarkable process I plan to elaborate on in future posts. You can delve into my perspective on agent reasoning through the ReAct paper I authored.

Welcome to the Second Half of AI

This recipe is fundamentally altering the AI game rules. The game’s regulations in the first half of AI were:

We develop novel training methods or models and achieve superior results across various benchmarks.
We create increasingly difficult benchmarks and continue this cycle.

Today, this paradigm shift is being thoroughly transformed by:

The recipe standardizing and structuring the process of overcoming benchmarks, requiring fewer fresh ideas. Due to its favorable scaling and generalization capabilities, a newly designed method for a specific task might yield a mere 5% improvement, while the next generation of O series models — even without targeted training — can deliver a 30% uptick.
Even if we design more challenging benchmarks, they tend to be quickly (and increasingly rapidly) overcome by this recipe.

My colleague Jason Wei created a compelling graph depicting this trend.

So what should we pursue in the second half of AI? If new training methods are no longer essential, and more stringent benchmarks can be tackled ever faster, how do we proceed? I believe we must fundamentally rethink “evaluation,” which does not merely entail creating newer, harder benchmarks but involves thoroughly questioning existing assessment methods and developing new evaluations — thus driving us to innovate methods that transcend the current effective recipes.

However, this is challenging because humans are habitually resistant; we rarely question foundational assumptions—we tend to accept them uncritically, often failing to realize they are merely “assumptions,” not “laws.”

As an illustration of this inertia, consider that if you invented one of history’s most successful AI evaluation methods based on human examinations, while it might have been groundbreaking in 2021, three years later it is widely adopted and becomes an established assessment method. What would you do next? You would likely design a more challenging exam. Alternatively, having successfully resolved fundamental programming tasks, your next move would likely involve seeking more complicated programming challenges, perhaps aiming for IOI gold medal level.

Inertia is a natural phenomenon; however, therein lies the issue. AI has surpassed world champions in chess and Go, outperformed most humans in SATs and bar exams, and achieved capabilities equivalent to IOI and IMO gold medals, yet, at least from an economic or GDP perspective, the world has not dramatically changed. I term this the “utility problem,” which I consider to be the most pressing issue currently facing the AI field.

We may soon address the “utility problem,” or perhaps not. Regardless of the outcome, the fundamental roots of this issue may be surprisingly simple: our assessment methods differ from the reality of many basic assumptions.

For example, two key assumptions are:

Assumption 1: Evaluations should be automatically executed. Typically, an agent receives a task input, completes it autonomously, and finally obtains a reward. Yet in reality, agents often need to persistently interact with humans throughout the task, as you wouldn’t send a lengthy message to customer service and then wait ten minutes, expecting a detailed response to resolve all issues. Challenging this evaluation assumption has sparked new benchmarks, either bringing real humans into the interaction (such as Chatbot Arena) or introducing user simulations (like tau-bench).
Assumption 2: The tasks being evaluated should be independent and identically distributed (i.i.d.). If you possess a test set containing five hundred tasks, during evaluation, you run each task independently, finally averaging the results to establish an overall rating. Yet in reality, tasks commonly progress sequentially rather than in parallel. A Google software engineer, as they gradually familiarize themselves with the google3 repository, becomes increasingly efficient in problem-solving, whereas a software engineering agent tackling multiple issues within the same repository does not benefit from such familiarity.

We clearly require long-term memory methods (indeed, some related attempts have surfaced); however, the academic sphere lacks proper benchmarks reflecting this necessity and even the courage to question the i.i.d. assumption that is viewed as foundational to machine learning.

These assumptions have long been considered implicit. In the first half of AI, designing benchmarks based on these assumptions was reasonable, as raising intelligence typically improved utility at lower intelligence levels. Presently, under these assumptions, that general recipe is almost guaranteed to be effective.

Thus, the new gameplay in the second half of AI will be:

We need to devise novel evaluative setups or tasks directed at real-world utility.
We should employ the existing recipe to contend with these evaluative setups or tasks or enhance it with new components, then repeat the cycle.

This game is challenging, filled with uncertainties, yet profoundly exciting. While players in the first half of AI focused on overcoming video games and standardized tests, players in the second half will translate intelligence into useful products, creating companies worth billions or even trillions of dollars. The prior half was flooded with various continually iterating training methods and models, while the latter half, in some ways, evaluates and filters them. A universal recipe will effortlessly outpace your incremental improvements, and your ability to formulate new hypotheses that disrupt this recipe will mark a shift toward genuinely groundbreaking research.