Jigsaw Data

Tasks that evaluate the real day-to-day work of an ML engineer: running GPU training jobs, repairing framework internals, debugging training runs, and tracing data pipelines.

LeaderboardDataTasks

0%10%20%30%40%50%60%70%80%90%100%

Greenfield engineering tasks where agents build complete systems from a blank workspace and a dense specification: binary codecs, constraint engines, and protocol implementations, graded by hidden deterministic test suites.

LeaderboardDataTasks

claude-opus-4-6

claude-opus-4-7

0%10%20%30%40%50%60%70%80%90%100%

Decompilation

Matching-decompilation tasks where agents lift ARM assembly from shipped retail binaries back into C that recompiles byte-identical, underpinning cybersecurity and vulnerability research.

Calibration in progress · scorecards coming soon

Verifiable data to hillclimb frontier models.

Model progress is moving into the long-horizon work where models are actually deployed, bound by the realism of the data behind it.

We reconstruct real workflows into environments models are trained and measured inside, graded deterministically and traceable end to end. Models rise to the fidelity of their data, and we build the ground truth they climb.