Benchmarks

Overview of representative benchmarks. Use the chart to compare progress across selected benchmarks and inspect recent scores. Projections are fitted to historical SOTA (state-of-the-art) points using S-curves (logistic) for bounded benchmarks or exponential growth for unbounded metrics. The projections are also anchored to start at the most recent SOTA model. Click on a legend entry or browse the list below to learn more about each benchmark and its associated dangerous capability.

Chart loading…

Chemical weapons development

ChemBench

Chemical weapons development capability encompasses designing synthesis routes for chemical agents, providing actionable instructions for weapon assembly and deployment, making scientific discoveries that enable novel toxic compounds, and offering technical expertise that accelerates chemical weapon production.

66%
Claude 3.7 Sonnet
12
months no update

Cyber-offense

CyBench

Cyber-offense capability encompasses discovering vulnerabilities in systems, writing exploit code, making effective decisions post-access, skillfully evading threat detection while pursuing objectives, and inserting subtle bugs for future exploitation when deployed as a coding assistant (Shevlane et al., 2023).

55%
Claude Sonnet 4.5
5
months no update

AI development

MLE-bench

AI development capability involves building new AI systems from scratch including systems with dangerous capabilities, adapting existing models to increase performance on extreme-risk-relevant tasks, and significantly improving the productivity of actors building dual-use AI capabilities (Shevlane et al., 2023).

61%
gemini-3-pro-preview - PiEvolve (Fractal AI Research)
3
months no update

Long-horizon planning

METR Horizon Benchmark

Long-horizon planning capability involves making sequential plans with multiple interdependent steps unfolding over long time horizons, adapting plans in response to obstacles or adversaries, and generalizing planning abilities to novel settings without heavy reliance on trial and error (Shevlane et al., 2023).

12%
claude-opus-4-5
3
months no update

Political strategy

ForecastBench

Political strategy capability requires performing social modeling and planning necessary to gain and exercise political influence in scenarios with multiple actors and rich social context (Shevlane et al., 2023).

23%
GPT-5-2025-08-07 (zero shot with crowd forecast)
6
months no update

Situational Awareness

Situational Awareness Dataset

Situational awareness capability involves distinguishing between training, evaluation, and deployment contexts to behave differently in each case, knowing that one is a model, and having knowledge about oneself and likely surroundings including training company, server locations, feedback providers, and administrative access (Shevlane et al., 2023).

85%
Claude Opus 4.5
3
months no update

Persuasion & manipulation

MakeMeSay

Persuasion and manipulation capability involves shaping people's beliefs through dialogue and other settings like social media, promoting narratives persuasively, and convincing people to take actions they wouldn't otherwise take, including unethical acts (Shevlane et al., 2023).

60%
Gemini 2.0 Flash
12
months no update

Self-proliferation

RepliBench

Self-proliferation capability encompasses breaking out of local environments, exploiting monitoring system limitations, independently generating revenue through services or attacks, acquiring and operating cloud computing resources, and developing creative strategies for self-discovery or code exfiltration (Shevlane et al., 2023).

68%
Claude 3.7 Sonnet
12
months no update

Biological weapons development

Virology Capability Test

Biological weapons acquisition capability involves understanding pathogen engineering and cultivation methods, providing actionable guidance for biological weapon production and deployment, and making scientific discoveries that enable novel biological weapons. While most classes of microbes can be manipulated to cause catastrophic risks for humans, RNA-viruses is the class that is most likely to have this capacity (Adalja, 2019).

48%
Claude Opus 4.5
3
months no update