Long-horizon planning

Long-horizon planning capability involves making sequential plans with multiple interdependent steps unfolding over long time horizons, adapting plans in response to obstacles or adversaries, and generalizing planning abilities to novel settings without heavy reliance on trial and error (Shevlane et al., 2023).

METR Horizon Benchmark

The METR Horizon Benchmark measures AI agents' ability to autonomously complete long-horizon tasks that take human professionals hours to days. The benchmark quantifies an AI model's '50% time horizon' - the length of tasks it can complete with 50% probability - tracking exponential progress in autonomous AI capabilities. Scores are normalized such that 100% represents a 50% success rate on tasks requiring 8 human-expert hours. The METR Horizon Benchmark makes these inferences of task lengths based on software engineering tasks and the exact times vary for other domains (METR, 2025).

29%
GPT-5
Chart loading…

Model scores

ModelScoreDate
GPT-529%2025-08-07
Claude Sonnet 4.524%2025-09-29
grok-423%2025-07-09
claude-4-1-opus22%2025-08-05
o319%2025-04-16
claude-4-opus17%2025-05-22
o4-mini16%2025-04-16
claude-4-sonnet14%2025-05-22
claude-3-7-sonnet11%2025-02-24
GPT OSS 120b9%2025-08-05
o1-elicited8%2024-12-05
gemini-2-5-pro-preview8%2025-06-05
Deepseek R1 05286%2025-05-28
claude-3-5-sonnet-202410226%2024-10-22
DeepSeek R16%2025-01-20
deepseek-v3-03245%2025-03-24
O1 Preview5%2024-09-12
DeepSeek V34%2024-12-26
claude-3-5-sonnet4%2024-06-20
GPT-4o2%2024-05-13
GPT 4 11062%2023-11-06
gpt-4-turbo1%2024-04-09
Claude 3 Opus1%2024-03-04
gpt-4-01251%2024-01-25
GPT 41%2023-03-14
Qwen 2.5 72B1%2024-09-15
qwen-2-72b0%2024-06-07
gpt-3-5-turbo-instruct0%2022-03-15
Davinci 0020%2022-11-01
gpt20%2019-02-14

Why this benchmark?

We selected METR's benchmark due to the organization's established reputation in AI safety evaluation, its inclusion of recent models, and its focus on realistic software tasks.