Manufacturing
Production order lifecycle, lot management, material allocation with buffer rules, CAPA tracking, and inventory control.
Domain agentic intelligence index
We test models on private, non-contaminated tasks.
Here's what we found.
Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.
Scaling curve
K = 1…5 runs
pass^k — Consistency
% tasks passed in every one of k runs.
Task difficulty distribution
Tasks bucketed by aggregate success rate
Buckets show difficulty tiers based on aggregate of models results on the benchmarking subset.
100%
2 of 84 tasks (2%)
2
75%+
25 of 84 tasks (30%)
25
50%+
35 of 84 tasks (42%)
35
25%+
20 of 84 tasks (24%)
20
0%
2 of 84 tasks (2%)
2
Example task
User Request
Correct Agent Solution
What Is Tested
Methodology
Built on Sierra's TAU-Bench. 17 tools, 9 JSON data bases, verified by manufacturing subject matter experts. Golden trajectories scored via data base state hash comparison. pass^k = all k runs succeed. pass@k = ≥1 of k runs succeeds. Confidence intervals computed via bootstrap resampling (1000 iterations).
Trusted by Leading AI Teams
TAU manufacturing dataset available for purchase
License the TAU manufacturing RL Gym — 90 tasks, 19 tools, 9 databases with golden trajectories. Use for training, RLHF, or internal benchmarking.