R1 is not on par with o1, and the difference is qualitative, not quantitative
What sets a reasoning system apart, how to test for it, and how to achieve it
New benchmarking results compare the latest models
A few weeks ago, DeepSeek made waves with their latest R1 model, pushing ChatGPT out of the #1 spot in the Apple App Store and prompting tech stock selloffs that sent Nvidia share prices plummeting 17%.
The claim is that DeepSeek R1 — an open-weights [1] model produced by a small Chinese tech lab — rivals OpenAI’s o1 model in performance, but at a fraction of the cost for training and inference. The o1 model represents a shift towards adaptive reasoning systems that dynamically scale their computation via test-time scaling, where models allocate more computational resources as needed during inference. Reasoning models have more powerful problem-solving capabilities than traditional LLMs, which have less capacity for adaptive allocation of compute. To access these capabilities in their latest o1 model, OpenAI charges six times more than for GPT-4o. DeepSeek offers the same type of reasoning system, but their model is open source.
Setting aside the geopolitical aspect, open vs closed source questions, and implications of an imminent decline in training and inference costs, let's focus on DeepSeek's biggest splash — the performance parity.
The Toloka Research team has been testing R1 and o1 and analyzing existing evaluations, and our findings bring R1’s dominance into question.
Read on to learn:
How R1 really compares to o1
Where we’ve exposed the most significant gaps
What makes o1 stand apart from all the current reasoner models
How to approach closing the gap
Public comparisons show R1 on par with o1
R1 is commonly quoted to be on par with o1 by AI enthusiasts and the media, based on public benchmarks that show remarkably similar performance. Let's look at available comparisons of R1 and o1.
DeepSeek report
This comparison table from DeepSeek’s own report shows R1 and o1 neck and neck on major benchmarks. The majority of articles comparing R1 to o1 are based on this table, or tiny-scale manual vibe checks using a handful of prompts.
Domain | Benchmark | o1 | R1 |
---|---|---|---|
Mathematics | MATH-500 (Pass@1) | 96.4% | 97.3% |
AIME 2024 (Pass@1) | 79.2% | 79.8% | |
Coding | Codeforces (Percentile) | 96.6% | 96.3% |
LiveCodeBench (Pass@1) | 63.4% | 65.9% | |
SWE-bench Verified (Resolved) | 48.9% | 49.2% | |
General | DROP (3-shot F1) | 90.2% | 92.2% |
MMLU (Pass@1) | 91.8% | 90.8% | |
GPQA Diamond (Pass@1) | 75.7% | 71.5% | |
SimpleQA (Correct) | 47.0% | 30.1% |
Popular online leaderboards
Additional “go-to” evaluation sources are the full LiveBench leaderboard and ChatBot Arena [2].
Domain | Benchmark | o1 | R1 |
---|---|---|---|
General | ChatBot Arena (Rating) | 1352 (+6/-7) | 1361 (+7/-8) |
Data Science | LiveBench (Pass@1) | 65.5% | 69.8% |
Instruction Following | LiveBench (Pass@1) | 81.6% | 80.5% |
Reasoning (Spatial & Logic Puzzles) | LiveBench (Pass@1) | 91.6% | 83.2% |
Language (Grammar & Comprehension) | LiveBench (Pass@1) | 65.4% | 48.5% |
Style comparisons
There are also numerous anecdotal reports on R1’s superior writing style, generally described as “more creative and fun, if a bit unhinged,” which is in line with some independent evaluations.
General sentiment
The widespread narrative based on those evaluations is that the models are more or less equal. R1 seems to be slightly better at math and coding and have a more free and creative writing style, while o1 is ostensibly somewhat better at factuality, question answering, and instruction following, with a writing style that focuses on careful structure, grammar, and logic.
All of these differences can be vaguely attributed to the focus: R1 training was "less restricted", with an emphasis on math and coding domains that lend themselves well to reinforcement learning, while o1 was presumably "more guided", with more attention to world knowledge and alignment. OpenAI invested heavily in o1's alignment as a public-facing product via ChatGPT, ensuring the model's safety and adaptation to consumer preferences across general domains. It all seems to add up.
But we think there’s more to the story.
Long-tail benchmarks reveal gaps
As soon as we leave the beaten path, alternative benchmarks paint a different picture. The Toloka Research team investigated evaluations in niche subdomains and uncommon domains and noted quantitative and qualitative gaps in model performance.
Niche subdomains
A few months ago, Toloka released U-MATH — a benchmark to test LLMs on university-level mathematics, sourced from unpublished problems used in the curriculum at top US universities. We developed the benchmark precisely to represent an uncommon, under-explored niche within the math domain — real-world university math. According to our U-MATH evaluation, R1 does not seem to have an edge over o1 in mathematical reasoning. [3][4][5]
Benchmark | o1 | R1 |
---|---|---|
MATH-500 (Pass@1) | 96.4% | 97.3% |
AIME 2024 (Pass@1) | 79.2% | 79.8% |
U-MATH (TextHard, Accuracy) | 90.5% | 88.2% |
We continued to study niche subdomains by examining coding, the other domain where R1 is considered to have an edge over o1. In terms of coding benchmarks, Aider’s Polyglot is a good niche example, focusing on code editing tasks in a large number of different programming languages, in contrast to the more widespread from-scratch Python generation. Aider's evaluations contradict the story of R1 beating o1 in coding, similar to our conclusions on U-MATH.
Benchmark | o1 | R1 |
---|---|---|
Codeforces (Percentile) | 96.6% | 96.3% |
LiveCodeBench (Pass@1) | 63.4% | 65.9% |
SWE-bench Verified (Resolved) | 48.9% | 49.2% |
Aider Polyglot (Correct) | 61.7% | 56.9% |
Unusual domains
Conitnuing in this vein, we've investigated benchmarks that assess altogether less conventionally tested domains and capabilities, not just less conventional subdomains. There, a pronounced gap in favor of o1 is observed
Benchmark | o1 | o1-mini | R1 |
---|---|---|---|
μ-MATH (F1) | 90.1% | 83.4% | 84.3% |
ARC-AGI-1 (Accuracy) | 32.0% | 7.8% | 15.8% |
LLM Chess (Wins) | 57.4% | 30.0% | 22.6% |
NPR Sunday Puzzle (Correct) | 59.0% | 26.0% | 35.0% |
NYT Connections Extended | 70.8% | 27.0% | 38.6% |
μ-MATH is our other benchmark on university mathematics, based on U-MATH but intended for meta-evaluations — i.e., assessing the performance of LLMs as automatic solution evaluators instead of problem solvers. Judging solutions by testing them against a golden label proves to be a challenging skill in its own right [6], and it is a somewhat atypical skill to evaluate and train for. Here we observe R1’s score closer to that of o1-mini, not o1. [7]
ARC tests AI systems on their ability to adapt to novel situations, showing R1 lagging well behind o1.
LLM Chess Leaderboard again places R1 behind o1 and closer to o1-mini. The evaluation also reports that R1 makes an average of 52.93 illegal moves per 1,000 attempts, compared to o1 and o1-mini with 0 and 4.29 illegal moves, respectively [8].
Northeastern University Programming Research Lab Study tested the models on the NPR Sunday Puzzle Challenge, which are difficult wordplay problems based on common knowledge. In addition to the large gap between R1 and o1, researchers also reported several "failure modes" that occur with R1 and never occur with o1 or o1-mini, such as R1 explicitly stating "I give up" 23.9% of the time or failing to stop the generation 8.5% percent of the time. In 6.6% of cases, R1 exhibtis erratic answer switches, which only happens in 0.5% of cases for o1-mini and does not happen for o1.
Similarly, as reported by the NYT connections leaderboard, R1 is far behind o1 in solving puzzles that involve finding semantic subgroups within word sets.
Superior generalization and reliability place o1 in a league of its own
The long-tail benchmarks above are pertinent because they are unconventional, testing for novelty and robustness. So here’s our claim: o1 has greater generalization and reliability than R1. In terms of reliability in particular, we can see that both o1 and o1-mini are superior to R1.
As we continued investigating, we found that other publicly available evaluations support our conjecture.
From the standpoint of lesser generalization, we expect R1 not only to perform worse on niche subdomains or novel domains, but also to display degraded performance on new, unseen tasks for familiar suites. Our expectation is confirmed by MathArena’s recent report on AIME 2025 results, highlighting performance on problems published after the release of o1 and R1 [9]. With the unseen tasks, performance drops significantly for R1 but not for o1.
Benchmark | R1 | o1 |
---|---|---|
AIME 2024 (Pass@1) | 79.8% | 79.2% |
AIME 2025 (avg I & II, Pass@1) | 70.0% | 79.2% |
Reliability, in turn, plays a vital role in adversarial robustness and consistency, both lacking in R1 compared to o1 and o1-mini.
CISCO report a 100% success rate on their attempts to breach R1, compared to 26% for the o1-preview, and there are other similar reports on R1’s safety limitations [10].
A recent update of Vectara's hallucination benchmark shows R1's hallucination rate at 14.3%, while o1 and o1-mini only display 2.4% and 1.4%, respectively.
Both proper generalization and reliability are extremely important for AI systems, often cited as the primary bottlenecks of agentic applications. An extension of our claim is that superior generalization and reliability put o1 in a league of its own among all the currently available reasoning models. While we agree with the popular narrative that R1 has a different focus from o1, the o1 emphasis on alignment and broad domain coverage results in a qualitative difference that goes far beyond performance accents.
Coming up, we'll broaden the discussion of generalization and reliability in models and will then discuss ways to enhance these key properties in a reasoning system, so stay tuned.
Beyond o1 vs R1: studying trends and biases
Bringing in more models and metrics can help us get a bigger-picture perspective on the performance patterns, instead of only focusing on o1-to-R1 comparisons from standalone benchmarks. Let's turn again to our own suite of U-MATH and μ-MATH datasets.
Recognizing patterns
You can study the U-MATH and μ-MATH leaderboard on Hugging Face. Here are the aggregate scores, ranked by μ-MATH F1.
The graph reveals three major performance trends:
Judgment vs. problem-solving: In non-reasoning models, judgment and problem-solving performance improve together — up to a point. Beyond that, they diverge, with better problem-solving correlating with weaker judgment. This puts some numbers behind our statement that judgment is a distinct skill and suggests an inherent tradeoff between the two, with models in a sense being "forced to specialize" in either one or the other.
Reasoners extend the Pareto frontier by breaking out of this tradeoff, demonstrating their superior generalization and advancing beyond the previous generation of models [11].
o1 stands apart by pushing a step beyond other reasoning models into its own category.
Investigating the tradeoffs
We won’t go into much detail here [12], but what we find is that the problem-solving vs judgment tradeoff mentioned above translates into behavioral differences among the models, yielding two distinctive judgment styles:
Lenient judges: More verbose, inclined toward lengthy derivations and better at comparing complex expressions, but prone to losing track — leading to fewer false negatives but more false positives.
Conservative judges: More structured and precise, but often anchored on the exact form of the reference answer or golden label — leading to fewer false positives but more false negatives.
Good judgment depends on balancing mathematical problem-solving with general structured logic and instruction-following, and an ideal model would have strong domain-specific reasoning skills while maintaining high reliability and coherence in order to successfully apply them.
To understand this balance better, we can decompose the F1 score into True Positive Rate (TPR), which measures correct positive classifications, and True Negative Rate (TNR), which tracks correct rejections of incorrect solutions. Higher TPR means fewer false negatives, while higher TNR means fewer false positives.
We can observe all the same trends with this chart: non-reasoning systems exhibiting the performance tradeoff [13], reasoning systems pushing the frontier away [14], and o1 advancing a step further still. What we can also see is that on top of being the best-performing model, o1 is among the most balanced ones, and is the most balanced amid the tested reasoners [15].
Studying the previous generation of models informs us on how to approach this balancing with the next:
Balanced training makes for a more balanced model. We can see in action how transitioning from math specialists to generalist models leads to better, more well-rounded judgment performance. A balanced training setup integrates the depth of formal reasoning with breadth of domain coverage and coherence.
Reducing capability amplifies training-induced biases. Lenient models tend to become even more lenient when scaled down to smaller sizes, and conservative models become more conservative. A model needs to have appropriate capabilities that allow it to generalize over the things that you’re balancing.
The key point is: Powerful reasoning systems have the appropriate capabilities and already excel at formal reasoning. Training focused on diversifying domains and improving coherence will help guiding them toward generalization and reliability.
Building more generalizable and reliable models: It’s all in the data
We’ve discussed at length that a reasoning system would require generalization and reliability, demonstrating o1’s excellence on these dimensions and its resulting superb performance.
We have three recommendations to help improve these qualities in reasoning systems, all requiring high-quality data. A strong data partner like Toloka can help navigate the details of data production.
1) Diverse domain coverage
Increasing versatility of autoverified data: Math and coding tasks permitting simple autoverification are the bread and butter of current reasoning systems [16], but there are ways to diversify.
Even math and coding have untapped niches. The U-MATH benchmark is derived from our larger training dataset, offering a good example of a math subdomain that’s underrepresented compared to the more readily-available textbook, high-school or Olympiad-style data.
Plenty of other fields allow for autoverification but suffer from a lack of appropriate data — examples include chemistry, finance, biology, and more. Closing the gap on these would require careful and efficient data curation by highly skilled experts from a diverse domain set — Toloka’s specialty.
Expanding the boundaries of autoverification: Another direction for improvement would involve making better [17] or more general verifiers.
One possible approach is to create datasets for training more capable verifiers [18], such as scaling up meta-evaluation datasets like Toloka’s μ-MATH in size and complexity.
Alternatively, domains such as law or medicine don’t typically have singular golden labels but lend themselves well to verification on expert-formulated rubrics, which Toloka already produces for a diverse set of complex domains.
Explicit demonstration where autoverification fails: For more open-ended expert-reasoning domains, such as consulting, web research, or forecasting, proper autoverification is challenging to do at scale [19]. For these, a possible approach is to have expert-crafted demonstrations for the entire reasoning process — planning, researching the sources, backtracking, etc. One way Toloka sets similar projects up is by having experts operate an LLM-based system, adding their inputs and steering it towards complete solutions.
Going Multi-*: multi-lingual, multi-cultural, multi-modal, multi-turn, etc.
Similar to niche subdomains, languages other than English are underrepresented in quality data. Our team has hands-on experience delivering production-grade multilingual datasets and working with low-resource languages, thanks to our expert network of speakers in 40+ languages.
Another option is producing multimodal golden labels — e.g. for reasoning over images — which Toloka can do at scale.
2) Reasoning refinement
Process supervision: Training process reward models is a large part of the current reasoning systems research [20]. Despite the relevance, there’s very little public data available for PRM training, and all the limitations of poor domain coverage apply here as well. This type of data production overlaps with Toloka’s experience, such as trajectory annotation for coding agents [21].
Explicit demonstrations of excellent reasoning: Demonstration-style data could also be used to improve coherence, efficiency, and clarity of reasoning traces [22].
There are several ways to obtain this type of data:
Manually crafting examples of concise and effective reasoning from scratch
Editing LLM-generated trajectories
Issuing preferences amidst a number of alternative trajectory options
All of these align with our experience of building scalable SFT demonstration pipelines for complex domains. The preference option could either be done over entire trajectories or provide denser signals in a step-wise manner, similar to Toloka’s projects in multi-turn dialogue preferences.
3) Appropriate evals
Benchmarks: As we’ve already emphasized, benchmarks need to be novel and provide clear signals — incorporating purposeful design, quality data and meaningful analyses. We have relevant experience in creating clean, informative, and practical evaluation datasets, such as our publicly available U-MATH & μ-MATH, and in building custom evaluation pipelines with insight into quality criteria and metrics tailored to specific use-cases.
Red-teaming: Traditional safety labeling will need to be adapted to reasoning models to take long reasoning traces into account. Besides that, as models become more prominent, they have broader applications and thus an expanded attack surface as well. Toloka’s versatile team of domain and security experts tests a wide variety of applications, including agentic systems and computer operators.
The “DeepSeek moment” is still real
DeepSeek’s public release of R1 deserves recognition for the seminal AI moment that it is.
Although it is not a fully-open model in terms of reproduction transparency and, as we’ve shown here, it is not quite at the frontier level, this is the closest the AI community has ever been to an open frontier system. Besides, this is an incredible opportunity for us to study and better understand what differentiates these models, what’s potentially missing, and how to move forward.
Let’s build the future together.