Long-horizon planning

Long-horizon planning capability involves making sequential plans with multiple interdependent steps unfolding over long time horizons, adapting plans in response to obstacles or adversaries, and generalizing planning abilities to novel settings without heavy reliance on trial and error (Shevlane et al., 2023).

METR Horizon Benchmark

The METR Horizon Benchmark measures AI agents' ability to autonomously complete long-horizon tasks that take human professionals hours to days. The benchmark quantifies an AI model's '50% time horizon' - the length of tasks it can complete with 50% probability - tracking exponential progress in autonomous AI capabilities. Scores are normalized such that 100% represents a 50% success rate on tasks requiring 40 human-expert hours, the threshold METR identifies as potentially enabling 10X acceleration of AI R&D. The METR Horizon Benchmark makes the inferences of task lengths based on software engineering tasks and the exact times vary for other domains (METR, 2025).

12%

claude-opus-4-5

months no update

Why this benchmark?

We selected METR's benchmark due to the organization's established reputation in AI safety evaluation, its inclusion of recent models, and its focus on realistic software tasks.

Related takeover scenarios

AI takes over using weapons of mass destruction

AI takes over using persuasion and manipulation

AIs at powerful positions take over by colluding

Over time

Initializing Visualization...

Complete Model results

Model Architecture	Performance Metric	Canonical Release
claude-opus-4-5	12%	2025-11-24
gpt-5-1-codex-max	7%	2025-11-19
GPT-5	6%	2025-08-07
Claude Sonnet 4.5	5%	2025-09-29
claude-4-1-opus	5%	2025-08-05
grok-4	5%	2025-07-09
o3	4%	2025-04-16
claude-4-opus	4%	2025-05-22
o4-mini	3%	2025-04-16
claude-4-sonnet	3%	2025-05-22
kimi-k2-thinking	2%	2025-11-06
claude-3-7-sonnet	2%	2025-02-24
GPT OSS 120b	2%	2025-08-05
o1-elicited	2%	2024-12-05
gemini-2-5-pro-preview	2%	2025-06-05
Deepseek R1 0528	1%	2025-05-28
claude-3-5-sonnet-20241022	1%	2024-10-22
DeepSeek R1	1%	2025-01-20
deepseek-v3-0324	1%	2025-03-24
O1 Preview	1%	2024-09-12
claude-3-5-sonnet	1%	2024-06-20
DeepSeek V3	1%	2024-12-26
GPT-4o	0%	2024-05-13
GPT 4 1106	0%	2023-11-06
gpt-4-turbo	0%	2024-04-09
Claude 3 Opus	0%	2024-03-04
gpt-4-0125	0%	2024-01-25
GPT 4	0%	2023-03-14
Qwen 2.5 72B	0%	2024-09-15
qwen-2-72b	0%	2024-06-07
gpt-3-5-turbo-instruct	0%	2022-03-15
Davinci 002	0%	2022-11-01
gpt2	0%	2019-02-14

Verification Source // https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/