Long-horizon planning
Long-horizon planning capability involves making sequential plans with multiple interdependent steps unfolding over long time horizons, adapting plans in response to obstacles or adversaries, and generalizing planning abilities to novel settings without heavy reliance on trial and error (Shevlane et al., 2023).
METR Horizon Benchmark
The METR Horizon Benchmark measures AI agents' ability to autonomously complete long-horizon tasks that take human professionals hours to days. The benchmark quantifies an AI model's '50% time horizon' - the length of tasks it can complete with 50% probability - tracking exponential progress in autonomous AI capabilities. Scores are normalized such that 100% represents a 50% success rate on tasks requiring 40 human-expert hours, the threshold METR identifies as potentially enabling 10X acceleration of AI R&D. The METR Horizon Benchmark makes the inferences of task lengths based on software engineering tasks and the exact times vary for other domains (METR, 2025).
12%
claude-opus-4-5
3
months no update
Why this benchmark?
We selected METR's benchmark due to the organization's established reputation in AI safety evaluation, its inclusion of recent models, and its focus on realistic software tasks.
Over time
Initializing Visualization...
Complete Model results
| Model Architecture | Performance Metric | Canonical Release |
|---|---|---|
| claude-opus-4-5 | 12% | 2025-11-24 |
| gpt-5-1-codex-max | 7% | 2025-11-19 |
| GPT-5 | 6% | 2025-08-07 |
| Claude Sonnet 4.5 | 5% | 2025-09-29 |
| claude-4-1-opus | 5% | 2025-08-05 |
| grok-4 | 5% | 2025-07-09 |
| o3 | 4% | 2025-04-16 |
| claude-4-opus | 4% | 2025-05-22 |
| o4-mini | 3% | 2025-04-16 |
| claude-4-sonnet | 3% | 2025-05-22 |
| kimi-k2-thinking | 2% | 2025-11-06 |
| claude-3-7-sonnet | 2% | 2025-02-24 |
| GPT OSS 120b | 2% | 2025-08-05 |
| o1-elicited | 2% | 2024-12-05 |
| gemini-2-5-pro-preview | 2% | 2025-06-05 |
| Deepseek R1 0528 | 1% | 2025-05-28 |
| claude-3-5-sonnet-20241022 | 1% | 2024-10-22 |
| DeepSeek R1 | 1% | 2025-01-20 |
| deepseek-v3-0324 | 1% | 2025-03-24 |
| O1 Preview | 1% | 2024-09-12 |
| claude-3-5-sonnet | 1% | 2024-06-20 |
| DeepSeek V3 | 1% | 2024-12-26 |
| GPT-4o | 0% | 2024-05-13 |
| GPT 4 1106 | 0% | 2023-11-06 |
| gpt-4-turbo | 0% | 2024-04-09 |
| Claude 3 Opus | 0% | 2024-03-04 |
| gpt-4-0125 | 0% | 2024-01-25 |
| GPT 4 | 0% | 2023-03-14 |
| Qwen 2.5 72B | 0% | 2024-09-15 |
| qwen-2-72b | 0% | 2024-06-07 |
| gpt-3-5-turbo-instruct | 0% | 2022-03-15 |
| Davinci 002 | 0% | 2022-11-01 |
| gpt2 | 0% | 2019-02-14 |
Verification Source // https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/