Long-horizon planning
Long-horizon planning capability involves making sequential plans with multiple interdependent steps unfolding over long time horizons, adapting plans in response to obstacles or adversaries, and generalizing planning abilities to novel settings without heavy reliance on trial and error (Shevlane et al., 2023).
METR Horizon Benchmark
The METR Horizon Benchmark measures AI agents' ability to autonomously complete long-horizon tasks that take human professionals hours to days. The benchmark quantifies an AI model's '50% time horizon' - the length of tasks it can complete with 50% probability - tracking exponential progress in autonomous AI capabilities. Scores are normalized such that 100% represents a 50% success rate on tasks requiring 8 human-expert hours. The METR Horizon Benchmark makes these inferences of task lengths based on software engineering tasks and the exact times vary for other domains (METR, 2025).
Model scores
| Model | Score | Date |
|---|---|---|
| GPT-5 | 29% | 2025-08-07 |
| Claude Sonnet 4.5 | 24% | 2025-09-29 |
| grok-4 | 23% | 2025-07-09 |
| claude-4-1-opus | 22% | 2025-08-05 |
| o3 | 19% | 2025-04-16 |
| claude-4-opus | 17% | 2025-05-22 |
| o4-mini | 16% | 2025-04-16 |
| claude-4-sonnet | 14% | 2025-05-22 |
| claude-3-7-sonnet | 11% | 2025-02-24 |
| GPT OSS 120b | 9% | 2025-08-05 |
| o1-elicited | 8% | 2024-12-05 |
| gemini-2-5-pro-preview | 8% | 2025-06-05 |
| Deepseek R1 0528 | 6% | 2025-05-28 |
| claude-3-5-sonnet-20241022 | 6% | 2024-10-22 |
| DeepSeek R1 | 6% | 2025-01-20 |
| deepseek-v3-0324 | 5% | 2025-03-24 |
| O1 Preview | 5% | 2024-09-12 |
| DeepSeek V3 | 4% | 2024-12-26 |
| claude-3-5-sonnet | 4% | 2024-06-20 |
| GPT-4o | 2% | 2024-05-13 |
| GPT 4 1106 | 2% | 2023-11-06 |
| gpt-4-turbo | 1% | 2024-04-09 |
| Claude 3 Opus | 1% | 2024-03-04 |
| gpt-4-0125 | 1% | 2024-01-25 |
| GPT 4 | 1% | 2023-03-14 |
| Qwen 2.5 72B | 1% | 2024-09-15 |
| qwen-2-72b | 0% | 2024-06-07 |
| gpt-3-5-turbo-instruct | 0% | 2022-03-15 |
| Davinci 002 | 0% | 2022-11-01 |
| gpt2 | 0% | 2019-02-14 |