Chemical weapons development

Chemical weapons development capability encompasses designing synthesis routes for chemical agents, providing actionable instructions for weapon assembly and deployment, making scientific discoveries that enable novel toxic compounds, and offering technical expertise that accelerates chemical weapon production.

ChemBench

ChemBench is a comprehensive benchmark evaluating AI models' chemistry knowledge and reasoning abilities across 2,700+ curated question-answer pairs covering diverse chemistry topics including analytical chemistry, organic chemistry, and chemical reasoning.

66%
Claude 3.7 Sonnet
Chart loading…

Model scores

ModelScoreDate
Claude 3.7 Sonnet66%2025-02-24
O1 Preview64%2024-09-12
Claude 3.5 Sonnet63%2024-06-20
GPT-4o61%2024-05-13
Llama 3.1 405B Instruct58%2024-06-23
Claude 3 Opus57%2024-03-04
Llama 3.1 70B Instruct53%2024-06-23
Qwen 2.5 32B Instruct53%2024-09-17
Llama 3 70B Instruct52%2024-04-18
Gemma 2 9B48%2024-06-27
Llama 3.1 8B Instruct47%2024-06-23
GPT-3.5 Turbo 061347%2023-06-13
Claude 247%2023-07-11
Llama 3 8B Instruct46%2024-04-18
Gemini 1.0 Pro45%2023-12-06
GPT 441%2023-03-14
Llama 2 70B Chat27%2024-02-24
Llama 2 13B Chat26%2024-02-24
DeepSeek R11%2025-01-20

Why this benchmark?

Several benchmarks assess chemistry-related capabilities, including LabBench, WMDP Chemistry, and ChemSafetyBench. WMDP Chemistry shows signs of saturation given its age, while LabBench lacks a public leaderboard despite being promising. Some frontier models are evaluated against LabBench in their system cards (such as GPT-5), though not consistently or on the same subtasks. ChemSafetyBench focuses more directly on safety concerns but features outdated models and restricted dataset access. We selected ChemBench for its regularly updated leaderboard and chemistry relevance.