Benchmarks
Overview of representative benchmarks. Use the chart to compare progress across selected benchmarks and inspect recent scores.
Chemical weapons development
ChemBench
Chemical weapons development capability encompasses designing synthesis routes for chemical agents, providing actionable instructions for weapon assembly and deployment, making scientific discoveries that enable novel toxic compounds, and offering technical expertise that accelerates chemical weapon production.
Political strategy
ForecastBench
Political strategy capability requires performing social modeling and planning necessary to gain and exercise political influence in scenarios with multiple actors and rich social context (Shevlane et al., 2023).
Long-horizon planning
METR Horizon Benchmark
Long-horizon planning capability involves making sequential plans with multiple interdependent steps unfolding over long time horizons, adapting plans in response to obstacles or adversaries, and generalizing planning abilities to novel settings without heavy reliance on trial and error (Shevlane et al., 2023).
Cyber-offense
CyBench
Cyber-offense capability encompasses discovering vulnerabilities in systems, writing exploit code, making effective decisions post-access, skillfully evading threat detection while pursuing objectives, and inserting subtle bugs for future exploitation when deployed as a coding assistant (Shevlane et al., 2023).
Situational Awareness
Situational Awareness Dataset
Situational awareness capability involves distinguishing between training, evaluation, and deployment contexts to behave differently in each case, knowing that one is a model, and having knowledge about oneself and likely surroundings including training company, server locations, feedback providers, and administrative access (Shevlane et al., 2023).
AI development
MLE-bench
AI development capability involves building new AI systems from scratch including systems with dangerous capabilities, adapting existing models to increase performance on extreme-risk-relevant tasks, and significantly improving the productivity of actors building dual-use AI capabilities (Shevlane et al., 2023).
Persuasion & manipulation
MakeMeSay
Persuasion and manipulation capability involves shaping people's beliefs through dialogue and other settings like social media, promoting narratives persuasively, and convincing people to take actions they wouldn't otherwise take, including unethical acts (Shevlane et al., 2023).
Self-proliferation
RepliBench
Self-proliferation capability encompasses breaking out of local environments, exploiting monitoring system limitations, independently generating revenue through services or attacks, acquiring and operating cloud computing resources, and developing creative strategies for self-discovery or code exfiltration (Shevlane et al., 2023).
Biological weapons development
Virology Capability Test
Biological weapons acquisition capability involves understanding pathogen engineering and cultivation methods, providing actionable guidance for biological weapon production and deployment, and making scientific discoveries that enable novel biological weapons. While most classes of microbes can be manipulated to cause catastrophic risks for humans, RNA-viruses is the class that is most likely to have this capacity (Adalja, 2019).