Benchmarks

Overview of representative benchmarks. Use the chart to compare progress across selected benchmarks and inspect recent scores.

Chart loading…

Chemical weapons development

ChemBench

science

Chemical weapons development capability encompasses designing synthesis routes for chemical agents, providing actionable instructions for weapon assembly and deployment, making scientific discoveries that enable novel toxic compounds, and offering technical expertise that accelerates chemical weapon production.

Expert: 0.41%
Claude 3.7 Sonnet 66%
O1 Preview 64%
Claude 3.5 Sonnet 63%

Political strategy

ForecastBench

reasoning

Political strategy capability requires performing social modeling and planning necessary to gain and exercise political influence in scenarios with multiple actors and rich social context (Shevlane et al., 2023).

GPT-5-2025-08-07 (zero shot with crowd forecast) 23%
Claude-Opus-4-1-20250805 (zero shot with crowd forecast) 22%
Gemini-2.5-Pro (zero shot with crowd forecast) 21%

Long-horizon planning

METR Horizon Benchmark

agentic

Long-horizon planning capability involves making sequential plans with multiple interdependent steps unfolding over long time horizons, adapting plans in response to obstacles or adversaries, and generalizing planning abilities to novel settings without heavy reliance on trial and error (Shevlane et al., 2023).

GPT-5 29%
Claude Sonnet 4.5 24%
grok-4 23%

Cyber-offense

CyBench

science

Cyber-offense capability encompasses discovering vulnerabilities in systems, writing exploit code, making effective decisions post-access, skillfully evading threat detection while pursuing objectives, and inserting subtle bugs for future exploitation when deployed as a coding assistant (Shevlane et al., 2023).

Expert: 1%
Claude Sonnet 4.5 55%
Claude Opus 4.1 38%
Claude Opus 4 38%

Situational Awareness

Situational Awareness Dataset

reasoning

Situational awareness capability involves distinguishing between training, evaluation, and deployment contexts to behave differently in each case, knowing that one is a model, and having knowledge about oneself and likely surroundings including training company, server locations, feedback providers, and administrative access (Shevlane et al., 2023).

Expert: 0.907%
O1 Preview 60%
Claude 3.5 Sonnet 54%
o1 Mini 53%

AI development

MLE-bench

coding

AI development capability involves building new AI systems from scratch including systems with dangerous capabilities, adapting existing models to increase performance on extreme-risk-relevant tasks, and significantly improving the productivity of actors building dual-use AI capabilities (Shevlane et al., 2023).

Expert: 1%
gemini-2.5-pro - FM Agent 44%
gemini-2.5-pro - CAIR MLE-STAR-Pro 39%
deepseek-r1 - InternAgent 36%

Persuasion & manipulation

MakeMeSay

agentic

Persuasion and manipulation capability involves shaping people's beliefs through dialogue and other settings like social media, promoting narratives persuasively, and convincing people to take actions they wouldn't otherwise take, including unethical acts (Shevlane et al., 2023).

Gemini 2.0 Flash 60%
DeepSeek V3 60%
GPT-4o Mini 50%

Self-proliferation

RepliBench

agentic

Self-proliferation capability encompasses breaking out of local environments, exploiting monitoring system limitations, independently generating revenue through services or attacks, acquiring and operating cloud computing resources, and developing creative strategies for self-discovery or code exfiltration (Shevlane et al., 2023).

Claude 3.7 Sonnet 68%
Claude 3.5 Sonnet 60%
o1 52%

Biological weapons development

Virology Capability Test

science

Biological weapons acquisition capability involves understanding pathogen engineering and cultivation methods, providing actionable guidance for biological weapon production and deployment, and making scientific discoveries that enable novel biological weapons. While most classes of microbes can be manipulated to cause catastrophic risks for humans, RNA-viruses is the class that is most likely to have this capacity (Adalja, 2019).

Expert: 0.221%
Claude Opus 4.5 48%
o3 44%
Claude Opus 4.1 43%