Long-horizon planning

Long-horizon planning capability involves making sequential plans with multiple interdependent steps unfolding over long time horizons, adapting plans in response to obstacles or adversaries, and generalizing planning abilities to novel settings without heavy reliance on trial and error (Shevlane et al., 2023).

METR Horizon Benchmark

The METR Horizon Benchmark measures AI agents' ability to autonomously complete long-horizon tasks that take human professionals hours to days. The benchmark quantifies an AI model's '50% time horizon' - the length of tasks it can complete with 50% probability - tracking exponential progress in autonomous AI capabilities. Scores are normalized such that 100% represents a 50% success rate on tasks requiring 40 human-expert hours, the threshold METR identifies as potentially enabling 10X acceleration of AI R&D. The METR Horizon Benchmark makes the inferences of task lengths based on software engineering tasks and the exact times vary for other domains (METR, 2025).
12%
claude-opus-4-5
3
months no update

Why this benchmark?

We selected METR's benchmark due to the organization's established reputation in AI safety evaluation, its inclusion of recent models, and its focus on realistic software tasks.

Over time

Initializing Visualization...

Complete Model results

Model ArchitecturePerformance MetricCanonical Release
claude-opus-4-512%2025-11-24
gpt-5-1-codex-max7%2025-11-19
GPT-56%2025-08-07
Claude Sonnet 4.55%2025-09-29
claude-4-1-opus5%2025-08-05
grok-45%2025-07-09
o34%2025-04-16
claude-4-opus4%2025-05-22
o4-mini3%2025-04-16
claude-4-sonnet3%2025-05-22
kimi-k2-thinking2%2025-11-06
claude-3-7-sonnet2%2025-02-24
GPT OSS 120b2%2025-08-05
o1-elicited2%2024-12-05
gemini-2-5-pro-preview2%2025-06-05
Deepseek R1 05281%2025-05-28
claude-3-5-sonnet-202410221%2024-10-22
DeepSeek R11%2025-01-20
deepseek-v3-03241%2025-03-24
O1 Preview1%2024-09-12
claude-3-5-sonnet1%2024-06-20
DeepSeek V31%2024-12-26
GPT-4o0%2024-05-13
GPT 4 11060%2023-11-06
gpt-4-turbo0%2024-04-09
Claude 3 Opus0%2024-03-04
gpt-4-01250%2024-01-25
GPT 40%2023-03-14
Qwen 2.5 72B0%2024-09-15
qwen-2-72b0%2024-06-07
gpt-3-5-turbo-instruct0%2022-03-15
Davinci 0020%2022-11-01
gpt20%2019-02-14