How Toloka helped poolside define and measure AI quality for developers
Challenge:
As a leader in proprietary software engineering models, our client poolside faced a key challenge: how to reliably measure which of their model’s answers was truly more helpful to developers, a task that is often subjective and hard to standardize.
Solution:
We built a custom evaluation framework for poolside centered on pairwise comparison. The process grounds each judgment in the developer's intent and task category, applies focused checks, and uses consistent criteria to compare responses directly.
Impact
Our approach provided poolside with consistent, context-grounded evaluations. This gave their teams a shared baseline for quality and actionable insights into where their models perform well and where they need improvement.
In technical settings, accuracy alone doesn't tell the full story. An answer can be technically correct without necessarily being helpful to the developer who needs to use it. poolside, a leader in proprietary software engineering models, needed a dependable way to measure an answer's true usefulness - something beyond simple accuracy.
To solve this, we treated subjective quality as something that could be measured. Our solution is centered on pairwise comparison, a process that becomes repeatable and reliable when anchored in clear standards. Evaluators don’t just guess; they review two responses to the same prompt, choose the stronger one, and explain their reasoning, turning subjective judgment into structured, actionable data.
To turn this idea into a reliable process, we first had to define what "better" actually means. A helpful answer isn't a one-size-fits-all concept; it changes dramatically based on the developer's goal. The foundation of our framework, therefore, is understanding the user’s intent by breaking down requests into specific task categories.
Task categories and evaluation guidelines
Not every developer prompt looks the same, so the definition of “helpful” shifts depending on what’s being asked. We break tasks into categories that anchor evaluation in context.
Information requests are about clarity
Developers want concise answers they can absorb with ease. The best responses address the question directly, keeping focus on what was asked and including code only when it’s requested. For example, if someone asks how a function works, adding speculative changes to code can risk misleading them rather than helping.
Troubleshooting is investigative
Here the assistant needs to suggest the most likely root cause and outline practical next steps. If the description is incomplete, it’s fine to propose multiple possible causes, but they need to be realistic and well-grounded. We prioritize responses that acknowledge uncertainty and rank possibilities by likelihood rather than presenting them as equally valid. Longer explanations don’t always mean better. What matters is a sequence of steps that leads to resolution.
Code generation tasks demand ready-to-use snippets that fit straight into a project
The right answer provides exactly what was requested and avoids excess, such as scaffolding for testing or deployment unless explicitly required. Developers expect snippets they can copy and run immediately, without needing to remove unnecessary boilerplate or unrelated extras.
Some principles apply across every category. Responses need to be technically correct and clear about their limits, while also signalling when a request is unsafe or impossible. Hallucinations, overconfidence and detours into irrelevant detail all weigh heavily against an answer.
How Our Evaluation Framework Works

Our framework is designed to turn subjective judgment into a systematic process, centered on one core question: which answer gives the developer the clearest and most direct route to a solution? To answer this reliably, we use experts with first-hand knowledge of the development process—from writing and reviewing code to solving production issues.
The evaluation itself follows a structured, multi-step approach:
Categorize Intent: Each evaluation begins by mapping the user’s request to a task category (e.g., Information, Troubleshooting, or Code Generation). This sets the baseline for what a good response should look like.
Apply the Rubric: Using a rubric tailored to the task, evaluators run through targeted checks. For code generation, this includes syntax validation and security vulnerability scans, while information requests are scored primarily on accuracy and conciseness.
Compare and Decide: With the context in mind, the two answers are reviewed side-by-side. The evaluator applies the judging criteria and makes a call: left, right, or neither. If a response breaches a critical system constraint—for example, by producing disallowed content—that alone may be enough to reject it.
Justify the Choice: The final, and most crucial, step is the written explanation. This justifies the decision and captures the evaluator’s thought process in a structured way, allowing the poolside team to see not only which answer scored better, but why.
To keep these judgments consistent, we support our experts with structured onboarding and continuously monitor inter-annotator agreement, running recalibration rounds whenever alignment on quality standards diverges from our 85% target. Over time, these detailed explanations grow into a valuable dataset that reveals recurring errors and opportunities for model improvement.
Judging better vs worse responses
In practice, the differences between answers vary. Sometimes both are strong, but one explains with slightly more precision. Other times the contrast is starker, with one response giving a straightforward solution while the other strays into irrelevant details or outright mistakes. When both are poor, neither is selected.
Every decision is documented. Evaluators don’t just pick “left” or “right”; they explain why. Reasoning exposes the fine line between acceptable and genuinely helpful, while also making the process transparent enough for teams to trust the outcomes.
For instance, in one evaluation, a user asked for clarification on a function's behavior. Both models provided responses, but only one was chosen. The winning answer gave accurate, relevant information. The other included a code modification that looked plausible but would have degraded the performance.
The evaluator noted specific technical flaws: the left response suggested a modification that would increase time complexity from O(n) to O(n²), demonstrating how plausible-sounding advice can hide performance pitfalls. Without pairwise comparison, that mistake might have slipped through as "technically correct enough."
The value of consistent evaluation
An evaluation framework should deliver more than a simple ranking. It has to capture what “better” means in the moment and preserve the reasoning behind each choice, while also revealing how model behavior shifts over time. Pairwise comparison supports that shift by grounding judgments in consistent criteria and expert review, turning intuition into evidence.
For poolside, that meant turning quality control from a subjective exercise into a repeatable process. For model developers more broadly, it creates a shared baseline, a way to compare results with confidence and to see both where models fall short and how they change over time. The data generated by these evaluations becomes a feedback mechanism that pushes systems closer to being genuinely useful for developers in the real world.
How confident are you in your model’s answers?
If you’re relying on automated metrics alone, you may be missing the nuances that separate an adequate response from one a developer can actually use. Toloka’s evaluation process brings in human judgment at scale, designed for the realities of software engineering.
Contact us to learn how we can help you build evaluation sets that don’t just measure accuracy, but capture true developer usefulness.