ML Productivity

ML Productivity tasks evaluate an agent's ability to perform the analytical work of an ML engineer: reading training logs, interpreting loss curves, isolating confounds in hyperparameter sweeps, and grounding explanations in framework source code. All tasks run in CPU-only sandboxes with staged artifacts from real TRL, transformers, and PyTorch training runs. Network access is disabled; the agent must reason from the data provided.

Tasks

Models

Mean Score

0.54

across all tasks

Pass@1

15%

score ≥ 0.80

Leaderboard

Cross-Model Performance

Aggregate scores across all tasks in this project. Each model was evaluated over 10 independent rollouts per task at temperature 1.0 on the HUD evaluation harness.

ModelMeanPerformance

claude-opus-4 [max]

73% ±5%

grok-4.3 [reasoning]

60% ±6%

gpt-5.5 [xhigh]

55% ±7%

gemini-2.5-pro [high]

50% ±5%

claude-sonnet-4.6 [high]

46% ±4%

claude-sonnet-4 [high]

42% ±5%

gemini-3.5-flash [medium]

38% ±4%

gpt-5.4-mini [xhigh]

34% ±3%

kimi-k2.6 [high]

32% ±2%

minimax-m3 [high]

28% ±4%

glm-5.1 [high]

24% ±1%

grok-build-0.1 [reasoning]

20% ±2%

0%20%40%60%80%100%

All models run on HUD evaluation harness · 10 rollouts × temperature 1.0 · pass@k threshold: score ≥ 0.80

Tasks (1)

Beta Hyperparameter Sensitivity

Analyze beta hyperparameter dynamics across DPO and KTO alignment training runs, extract metrics from heterogeneous artifacts, read TRL source code, and synthesize a diagnostic report.

Mean 0.54P@1 15%P@10 55%10 criteria19 models

ML Productivity

Cross-Model Performance

Data trained the last generation. Environments will train the next.