In December 2024, OpenAI announced the launch of its latest AI “reasoning” models, o3 and o3-mini, building upon the o1 models introduced earlier that year. The company is not releasing them yet but has made these models available for public safety testing and research access. The new AI model was tested on a task designed to assess “general intelligence” at a level comparable to that of humans.
The models use what OpenAI calls a “private chain of thought,” where it pauses to examine its internal dialogue and plan before responding, which you might call “simulated reasoning” (SR), a form of AI that goes beyond basic large language models (LLMs).
According to OpenAI, the o3 model earned a record-breaking score on the ARC-AGI benchmark, a visual reasoning benchmark that has remained unbeaten since its creation in 2019. In low-compute scenarios, o3 scored 75.7%, while in high-compute testing, it reached 87.5%, comparable to human performance at an 85% threshold.
The stated objective of all the major AI research labs is to create artificial general intelligence, or AGI. On the surface, it seems that OpenAI has at least taken a significant step in the right direction.
Even though there is still skepticism, many AI developers and researchers believe that something has changed. The possibility of AGI now appears more imminent, pressing, and real to many than they had previously thought. Do they have it right?
Competition in AI reasoning
OpenAI also reported that o3 scored 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question. The model also achieved 87.7% accuracy on GPQA Diamond, which comprises graduate-level biology, physics, and chemistry questions. On the Frontier Math benchmark by EpochAI, o3 solved 25.2% of problems, while no other AI model has exceeded 2%.
The o3-mini variant includes an adaptive thinking time feature, offering low, medium, and high processing speeds. The company states that higher compute settings produce better results. OpenAI reports that o3-mini outperforms its predecessor, o1, on the Codeforces benchmark.
The purpose of the ARC-AGI test must be understood to interpret the o3 result. Technically, it’s a test of an AI system’s “sample efficiency”; that is, how many instances of a novel situation must be shown for the system to understand how it functions.
An AI system such as ChatGPT (GPT-4) is not very effective at using samples. It was “trained” on millions of human text examples, creating probabilistic “rules” about the most likely word combinations.
OpenAI’s announcement comes as its rivals are still developing their own SR models, including Google, which announced Gemini 2.0 Flash Thinking Experimental. In November 2024, DeepSeek launched DeepSeek-R1, while Alibaba’s Qwen team released QwQ, which they called the first “open” alternative to o1.
These new AI models are based on traditional LLMs. However, they have been fine-tuned to produce a type of iterative chain of thought process that can consider its own results, simulating reasoning in an almost brute-force way that can be scaled at inference (running) time, instead of focusing on improvements during AI model training, which has seen diminishing returns recently.
Grid-based problem solving
The ARC-AGI benchmark uses small grid square problems, such as the one below, to test for sample-efficient adaptation. The pattern that transforms the left grid into the right grid needs to be identified by the AI.
Every question provides three learning examples. After that, the AI system must determine which rules “generalise” from the first three to the fourth. These are quite similar to the IQ tests users may recall from school.
Although experts are unsure of OpenAI’s exact methodology, the findings indicate that the o3 model is very flexible. It identifies generalisable rules based on a small sample size. Users should avoid making arbitrary assumptions or being more detailed than necessary to identify a pattern.
Although the exact method by which OpenAI arrived at this conclusion remains unclear, there is no indication that the o3 system was intentionally optimised to identify weak rules. However, it must be capable of locating them to succeed in the ARC-AGI tasks.
It is known that OpenAI initially trained a general-purpose version of the o3 model—distinguished from most other models by its ability to spend more time “thinking” about challenging questions—for the ARC-AGI test.
According to Francois Chollet, a French AI researcher who created the benchmark, o3 looks through several “chains of thought” that outline how to complete the task.
Breakthrough o3 performance results
“OpenAI’s new o3 system, trained on the ARC-AGI-1 Public Training set, has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10,000 compute limit. A high-compute (172x) o3 configuration scored 87.5%,” Chollet commented, while adding the development as “a surprising and important step,” ensuring functional increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models.
“For context, ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o. All intuition about AI capabilities will need to get updated for o3,” Chollet commented, while adding, “The mission of ARC Prize goes beyond our first benchmark: to be a North Star towards AGI. And we’re excited to be working with the OpenAI team and others next year to continue to design next-gen, enduring AGI benchmarks.”
The ARC Prize, which is a $1,000,000-plus public competition to beat and open-source a solution to the ARC-AGI benchmark, tested o3 against two ARC-AGI datasets: Semi-Private Eval (100 private tasks used to assess overfitting) and Public Eval (400 public tasks). At OpenAI’s direction, ARC Prize tested at two levels of compute with variable sample sizes: Six (high-efficiency) and 1024 (low-efficiency, 172x compute).
“Due to variable inference budget, efficiency (e.g., compute cost) is now a required metric when reporting performance. We’ve documented both the total costs and the cost per task as an initial proxy for efficiency. As an industry, we’ll need to figure out what metric best tracks efficiency, but directionally, cost is a solid starting point. The high-efficiency score of 75.7% is within the budget rules of ARC-AGI-Pub (costs <$10000) and therefore qualifies as first place on the public leaderboard,” Chollet noted.
The low-efficiency score of 87.5% is quite expensive but still shows that performance on novel tasks improves with increased compute. As per Chollet, despite the significant cost per task, these numbers aren’t just the result of applying brute-force compute to the benchmark. OpenAI’s new o3 model represents a significant leap forward in AI’s ability to adapt to novel tasks.
This is not merely incremental improvement but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. O3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.
“Of course, such generality comes at a steep cost and wouldn’t quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile, o3 requires $17-20 per task in the low-compute mode. But cost-performance will likely improve quite dramatically over the next few months and years, so you should plan for these capabilities to become competitive with human work within a fairly short timeline,” Chollet remarked.
“03’s improvement over the GPT series proves that architecture is everything. You couldn’t throw more compute at GPT-4 and get these results. Simply scaling up the things we were doing from 2019 to 2023 – take the same architecture, train a bigger version on more data – is not enough. Further progress is about new ideas,” he stated.
Is o3 an AGI?
ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalisation power in a way that saturated or less demanding benchmarks cannot. However, Chollet says that ARC-AGI is not an acid test for AGI, as his organisation repeated the phenomenon dozens of times in 2024. ARC-AGI is a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.
“Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don’t think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence,” Chollet commented.
Furthermore, early data points in 2024 suggested that the upcoming ARC-AGI-2 benchmark would still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).
“This demonstrated the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible,” he added, while continuing, “Why does o3 score so much higher than 01? And why did 01 score so much higher than GPT-4 in the first place? I think this series of results provides invaluable data points for the ongoing pursuit of AGI.”
Chollet believes that LLMs work as a repository of vector programmes. When prompted, they will fetch the programme that the user’s prompt maps to and “execute” it on the input at hand. LLMs are a way to store and operationalise millions of useful mini-programmes via passive exposure to human-generated content.
This “memorise, fetch, apply” paradigm can achieve arbitrary levels of skill at arbitrary tasks, given appropriate training data, but it cannot adapt to novelty or pick up new skills on the fly (which is to say that there is no fluid intelligence at play here). This has been exemplified by the low performance of LLMs on ARC-AGI, the only benchmark specifically designed to measure adaptability to novelty – GPT-3 scored 0, GPT-4 scored near 0, GPT-4o got to 5%. Scaling up these models to the limits of what’s possible wasn’t getting ARC-AGI numbers anywhere near what basic brute enumeration could achieve years ago (up to 50%).
“To adapt to novelty, you need two things. First, you need knowledge – a set of reusable functions or programmes to draw upon. LLMs have more than enough of that. Second, you need the ability to recombine these functions into a brand-new programme when facing a new task – a programme that models the task at hand. Programme synthesis. LLMs have long lacked this feature. The o series of models fixes that,” Chollet said.
For now, Chollet believes that we can only speculate about the exact specifics of how o3 works. However, o3’s core mechanism appears to be natural language programme search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search.
So, while single-generation LLMs struggle with novelty, o3 overcomes this by generating and executing its own programmes, where the programme itself (the CoT) becomes the artifact of knowledge recombination. Although this is not the only viable approach to test-time knowledge recombination, it represents the current state-of-the-art as per these new ARC-AGI numbers.
The new o3 AI model achieving a breakthrough high score in the ARC Challenge has stirred up speculation on whether the OpenAI product is an AGI. While the ARC Challenge organisers described o3’s achievement as a major milestone, they also cautioned that it has not won the competition’s grand prize, a feat which would have helped the tool to take one step toward becoming the “future AI with human-like intelligence.”
However, o3 earned the high score at a time when the tech industry and researchers have been reckoning with a slower pace of progress in the latest AI models for 2024. The feat also indicates that the upcoming AI models can beat the competition benchmark in the near future. Beyond its unofficial high score, Chollet says many official low-compute submissions have already scored above 81% on the private evaluation test set.
The ARC Challenge organisers are already looking to launch a second and more difficult set of benchmark tests sometime in 2025. They will also keep the ARC Prize 2025 challenge running until someone achieves the grand prize and open-sources their solution.
OpenAI’s o3 model has marked a significant milestone in the ongoing quest for AGI. With its record-breaking scores on the ARC-AGI benchmark and exceptional reasoning abilities, o3 represents a leap in AI’s capacity for adapting to novel tasks. While it isn’t AGI yet, its performance indicates that we may be closer to achieving human-like intelligence in AI. The competition is heating up, and the next few years will be crucial.