← Blog

/

Customer cases

Customer cases

Frontier Models can win at IMO, but they still can't check their own assumptions.

Toloka Arena is live. See how your model ranks.

Benchmark scores on STEM evaluations keep climbing. Reliability on the problems that actually matter isn't keeping pace. The reason is harder to fix than direct contamination — and harder to detect. It's soft-contamination.

A model that has seen a boundary value problem in electrostatics during pretraining encounters a new one at evaluation time. It hasn't memorized the answer. It interpolates — draws on the structural pattern of the original derivation, adapts it plausibly, and produces an output that looks correct. It may even pass an automated grader. But adapting a known solution template is not the same as deriving from scratch. On problems that require genuine novelty, that distinction is what matters.

This is the lens through which we think about evaluation integrity at Toloka — and it shapes how we build STEM datasets. We've spent the past year building evaluation and training data targeting frontier STEM benchmarks – HLE, GPQA, AIME, AMO-Bench, SciCode – working with PhD-level domain experts across mathematics, physics, chemistry, and engineering. What follows is what we found: the failure modes that kept recurring, what they imply about training data, and why most existing STEM data doesn't address them.

SciCode as a Concrete Case

Before the failure modes: it's worth grounding this in a specific benchmark where the contamination problem is measurable.

SciCode tests a different side of STEM than most frontier benchmarks: not just answering a science question, but applying knowledge to implement a working solution.The benchmark asks whether a model can take a multi-step scientific computing problem, decompose it correctly, implement each step in Python, and produce numerically correct outputs. It requires chaining algorithmic steps —knowing the right formulas isn't enough if you can't implement them correctly across a multi-step procedure and produce verifiable numerical outputs.

We extended SciCode+ with a curated set of novel, expert-authored tasks precisely because the existing benchmark was approaching the limits of its evaluation integrity. Tasks authored from published papers carry contamination risk. Even when the underlying code is internal and never open-sourced, the paper itself describes the methodology — the algorithm, the steps, the expected outputs. That description is almost certainly in the pretraining corpus. A model doesn't need to have seen the code to pattern-match to the solution structure. Our extension includes a synthetic subset where tasks are freshly designed rather than extracted from literature, structurally disjoint from the public benchmark, and validated against ground-truth outputs through automated test cases with assertions.

The practical finding: even when subproblems are explicitly provided, models collapse them into a single monolithic solution, ignoring the step structure. Ground-truth implementations need to be step-decomposed with verifiable intermediate outputs at each stage. Numerical precision compounds this — wrong dtypes, incorrect array shapes, and unstable implementations require explicit ground-truth output specifications to catch.

This is the baseline. Now the five failure modes that apply across both scientific computing and frontier reasoning benchmarks.

Five Failure Modes — and What They Share

What links all five is a common root: training data that rewards surface pattern-matching over first-principles reasoning. A model trained on compressed solutions, underspecified problems, and single-modality outputs learns to produce outputs that look like correct reasoning. Getting it to actually reason correctly requires different data.

1. Models can't write a well-posed problem

The most common rejection reason in our pipeline — and the one that surprised us most — was not a wrong derivation. It was an underspecified problem.

A well-posed graduate-level physics problem specifies boundary conditions, defines the coordinate system, states which theory applies, and makes every isolation assumption explicit. Without all of this, the problem admits multiple valid answers — there's no uniquely correct one

What we consistently saw from model-generated content: thin problem statements, missing constraints, undefined terms, implicit assumptions the solver is expected to infer. Expert reviewers flagged this as the root cause of the majority of rejections in our pipeline. The fix is training data composed of complete, fully-constrained problem formulations from graduate qualifying exams and textbooks, each paired with explicit justification for why every specification element is necessary for a unique solution.

2. Solution chains are too compressed

A complete solution to a hard graduate problem should be 4 to 5 pages with every step explicitly motivated. What models currently produce is closer to 5 to 8 short paragraphs.

The specific failure: skipping intermediate steps, compressing multiple logical moves into a single line, and omitting justification for approximations. The approximation problem matters most. A model that writes "using the small-angle approximation" without stating when that approximation is valid or what error it introduces has learned to reproduce a solution shape, not to reason. Models trained on auto-generated solutions learn to skip steps — and each generation of synthetic training compresses the reasoning further. Verbose, expert-written derivations are the corrective. 

The training data you want: verbose, expert-written derivations where every approximation is justified, every intermediate step is shown, and alternative solution paths are included where relevant. 

3. Failure rates are not uniform — and the pattern is actionable

Failure rates across domains are not evenly distributed. Our expert annotation pipeline surfaces this directly: some subdomains generate significantly higher correction rates and rejection frequencies than others. Uniform data collection across STEM is therefore inefficient. The subdomains worth concentrating on:

Physics: Electromagnetism is the highest-failure area — boundary value problems, eddy currents, radiation, multipole expansions. Thermodynamics and statistical mechanics follow, clustering around phase transitions, partition functions, and critical phenomena. Quantum mechanics failures concentrate in scattering theory, perturbation methods, and many-body systems.

Mathematics: Functional analysis (operator theory, spectral methods, infinite-dimensional systems) and algebraic topology and geometry (homology computations, fiber bundles, characteristic classes) show the highest correction rates.

The directional finding is clear enough to act on: a disproportionate investment in these subdomains will move frontier benchmark performance more efficiently than evenly-distributed coverage.

4. Self-correction is nearly absent — and this is the deepest gap

Modern reasoning models can self-correct — the visible backtracking in o3 and R1-class models is real. But self-correction tends to happen at the exploration stage, before a reasoning path is committed. Once a model has established a trajectory, it becomes increasingly unlikely to revise an intermediate result even when it's wrong. The error propagates forward, each subsequent step building on the flawed assumption, until the final answer is wrong in a way that looks internally consistent. 

To illustrate: a model solving a thermodynamics problem applies the ideal gas law without checking whether the conditions warrant it. The derivation is clean, the answer has the right units, the format is correct. It's wrong because the assumption was never verified. Nothing in the output signals a problem — a format-checking grader wouldn't flag it, and the reasoning chain looks coherent. The error is invisible without a model that audits its own assumptions before proceeding.

What fixes it is training data that explicitly models error detection — not just correct solutions, but paired examples of: wrong output → identified error → expert explanation → corrected version. This find-and-fix structure is rare in existing training data. It's also a natural byproduct of a rigorous annotation pipeline — if you're capturing what reviewers find and fix, you already have it

5. Mathematics and code are treated as separate modalities

Most models can write mathematics. Most can write code. Very few treat them as complementary verification tools.

Every task we deliver in the Frontier STEM datasets ships with a self-contained Python verification script that independently confirms the mathematical answer. The verification code uses a different method than the analytic solution — symbolic computation, numerical simulation, dimensional checks — to provide genuine cross-verification rather than circular confirmation. A symbolic computation that re-derives the same result via a different algebraic path, or a numerical simulation that checks the analytic answer against a Monte Carlo estimate, provides a verification signal that's structurally independent from the original derivation.

Training on these pairs develops a model that treats computational verification as a natural part of mathematical reasoning — moving from math to code and back. For reliable STEM problem-solving, that bidirectional capability is essential.

What the Data Needs to Look Like

Pulling it together as a specification:

Novelty and contamination control: Tasks freshly authored and structurally disjoint from existing benchmarks. For scientific computing tasks, a synthetic subset designed independently from published literature — the paper describing an algorithm is likely in the pretraining corpus even if the code never was.

Problem formulations: Complete, fully-constrained problems from graduate qualifying exams and textbooks. Not paraphrased. Not simplified. Every constraint is explicit, with justification for why it's necessary.

Reasoning chains: Verbose, step-by-step expert derivations. Target length is pages, not paragraphs. Every approximation is justified. Every intermediate step shown. Alternative solution paths included where they exist.

Domain coverage: Disproportionate investment in high-failure subdomains — EM, thermodynamics and statistical mechanics, quantum mechanics, functional analysis, algebraic topology — rather than uniform STEM distribution.

Error-correction pairs: Original generated content, per-step evaluation, expert error explanation, corrected version. Captured as a natural byproduct of the expert annotation workflow.

Cross-verification pairs: Mathematical derivation accompanied by Python verification code using an independent method. The code and the mathematics should not share derivation logic.

Scientific computing tasks: Step-decomposed implementations with individual function headers, explicit intermediate outputs, ground-truth type and shape information, and a synthetic subset structurally disjoint from the public SciCode benchmark.

The Expert Requirement

The common thread: these failures are happening at the level of scientific correctness, not surface presentation. Automated pipelines can scaffold the workflow, but the content requires PhD-level domain expertise. Problem specification, solution correctness, approximation validity, error identification — these judgments can't be delegated to annotators without domain depth.

Every task in our Frontier STEM dataset is authored end-to-end by a PhD-level expert. The expert-led approach is not a quality assurance layer on top of automated generation — it's the method.

If you're building toward frontier STEM capabilities and want to see what this data looks like in practice — the task structure, the verification format, the domain and difficulty distribution — we're sharing sample packages for both the Frontier STEM and SciCode datasets. Connect with our team to access sample data.

Browse our catalog of off-the-shelf datasets

Coding, STEM, Agentic, Robotics and more

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.