ML Productivity
ML Productivity tasks evaluate an agent's ability to perform the analytical work of an ML engineer: reading training logs, interpreting loss curves, isolating confounds in hyperparameter sweeps, and grounding explanations in framework source code. All tasks run in CPU-only sandboxes with staged artifacts from real TRL, transformers, and PyTorch training runs. Network access is disabled; the agent must reason from the data provided.
Cross-Model Performance
Aggregate scores across all tasks in this project. Each model was evaluated over 10 independent rollouts per task at temperature 1.0 on the HUD evaluation harness.

