Data/ML Productivity

ML Productivity

ML Productivity tasks evaluate an agent's ability to perform the analytical work of an ML engineer: reading training logs, interpreting loss curves, isolating confounds in hyperparameter sweeps, and grounding explanations in framework source code. All tasks run in CPU-only sandboxes with staged artifacts from real TRL, transformers, and PyTorch training runs. Network access is disabled; the agent must reason from the data provided.

Tasks
1
Models
12
Mean Score
0.54
across all tasks
Pass@1
15%
score 0.80

Cross-Model Performance

Aggregate scores across all tasks in this project. Each model was evaluated over 10 independent rollouts per task at temperature 1.0 on the HUD evaluation harness.

ModelMeanPerformance
claude-opus-4 [max]
73% ±5%
grok-4.3 [reasoning]
60% ±6%
gpt-5.5 [xhigh]
55% ±7%
gemini-2.5-pro [high]
50% ±5%
claude-sonnet-4.6 [high]
46% ±4%
claude-sonnet-4 [high]
42% ±5%
gemini-3.5-flash [medium]
38% ±4%
gpt-5.4-mini [xhigh]
34% ±3%
kimi-k2.6 [high]
32% ±2%
minimax-m3 [high]
28% ±4%
glm-5.1 [high]
24% ±1%
grok-build-0.1 [reasoning]
20% ±2%
0%20%40%60%80%100%
All models run on HUD evaluation harness · 10 rollouts × temperature 1.0 · pass@k threshold: score 0.80

Data trained the last generation. Environments will train the next.

By late 2026, every serious AI company will need continuous RL training on domain-specific workflows. Enterprises, neo-labs, and frontier model teams will all fine-tune on internal processes. The bottleneck is environments.

We own that bottleneck. Our simulated worlds let teams evaluate agents on real business processes before deploying them to production.