Political strategy

Political strategy capability requires performing social modeling and planning necessary to gain and exercise political influence in scenarios with multiple actors and rich social context (Shevlane et al., 2023).

ForecastBench

ForecastBench measures one critical component of this capability: forecasting accuracy on real-world prediction questions about future events. It is a dynamic benchmark evaluating AI forecasting accuracy on 1,000+ automatically generated and regularly updated real-world prediction questions. For scoring, it uses Brier scores (lower is better) which are inverted for display: 100% represents the best forecaster in the dataset (lowest Brier score), 0% represents the worst.

23%

GPT-5-2025-08-07 (zero shot with crowd forecast)

months no update

Why this benchmark?

Gamified evaluations like those in Welfare Diplomacy assess aspects of political strategy, though their correspondence to real-world political decision-making remains uncertain. While Forecast Bench does not directly measure political strategy, it evaluates forecasting ability, a fundamental component of strategic planning. Given the limited alternatives in this domain, Forecast Bench represents the most suitable available option.

Related takeover scenarios

AI takes over using weapons of mass destruction

AI takes over using persuasion and manipulation

AIs at powerful positions take over by colluding

Over time

Initializing Visualization...

Complete Model results

Model Architecture	Performance Metric	Canonical Release
GPT-5-2025-08-07 (zero shot with crowd forecast)	23%	2025-08-07
Claude-Opus-4-1-20250805 (zero shot with crowd forecast)	22%	2025-08-05
Gemini-2.5-Pro (zero shot with crowd forecast)	21%	2025-06-17
GPT-4.5-Preview-2025-02-27 (zero shot with crowd forecast)	21%	2025-02-27
O3-2025-04-16 (zero shot with crowd forecast)	20%	2025-04-16
GPT-4.5-Preview-2025-02-27 (scratchpad with crowd forecast)	20%	2025-02-27
Claude-3-7-Sonnet-20250219 (scratchpad with crowd forecast)	19%	2025-02-24
DeepSeek-R1 (scratchpad with crowd forecast)	17%	2025-01-20
Claude-3-7-Sonnet-20250219 (zero shot with crowd forecast)	15%	2025-02-24
O3-2025-04-16 (scratchpad with crowd forecast)	14%	2025-04-16
GPT-4.1-2025-04-14 (scratchpad with crowd forecast)	13%	2025-04-14
Grok-beta (zero shot with crowd forecast)	13%	2024-11-04
O3-Mini-2025-01-31 (zero shot with crowd forecast)	12%	2025-01-31
Claude-3-5-Sonnet-20241022 (scratchpad with crowd forecast)	12%	2024-10-22
Gemini-2.5-Flash-Preview-04-17 (zero shot)	12%	2025-04-17
GPT-5-Mini-2025-08-07 (zero shot with crowd forecast)	12%	2025-08-07
O4-Mini-2025-04-16 (zero shot with crowd forecast)	11%	2025-04-16
Claude-3-5-Sonnet-20241022 (zero shot with crowd forecast)	10%	2024-10-22
Claude-3-5-Sonnet-20240620 (zero shot with crowd forecast)	10%	2024-06-20
GPT-4.1-2025-04-14 (zero shot with crowd forecast)	10%	2025-04-14
Qwen3-235B-A22B-Fp8-Tput (scratchpad with crowd forecast)	9%	2025-04-29
Claude-Sonnet-4-20250514 (zero shot with crowd forecast)	9%	2025-05-22
Claude-3-5-Sonnet-20240620 (scratchpad with crowd forecast)	9%	2024-06-20
Claude-Opus-4-20250514 (scratchpad with crowd forecast)	9%	2025-05-22
GPT-4-Turbo-2024-04-09 (zero shot with crowd forecast)	9%	2024-04-09
DeepSeek-V3 (zero shot with crowd forecast)	8%	2024-12-25
Meta-Llama-3.1-405B-Instruct-Turbo (zero shot with crowd forecast)	8%	2024-07-23
Gemini-2.5-Flash (zero shot with crowd forecast)	8%	2025-06-17
O3-2025-04-16 (zero shot)	8%	2025-04-16
Gemini-2.5-Flash-Preview-04-17 (zero shot with crowd forecast)	8%	2025-04-17
Qwen3-235B-A22B-Fp8-Tput (zero shot with crowd forecast)	8%	2025-04-29
GLM-4.5-Air-FP8 (zero shot with crowd forecast)	6%	2025-07-28
O3-2025-04-16 (scratchpad)	6%	2025-04-16
Kimi-K2-Instruct (zero shot with crowd forecast)	6%	2025-07-12
GPT-4o-2024-05-13 (zero shot with crowd forecast)	6%	2024-05-13
GPT-4.5-Preview-2025-02-27 (zero shot)	6%	2025-02-27
Llama-3.3-70B-Instruct-Turbo (scratchpad with crowd forecast)	5%	2024-12-06
Claude-3-7-Sonnet-20250219 (scratchpad)	4%	2025-02-24
QwQ-32B-Preview (scratchpad with crowd forecast)	4%	2024-11-28
O3-Mini-2025-01-31 (scratchpad with crowd forecast)	4%	2025-01-31
Mistral-Large-2407 (zero shot with crowd forecast)	4%	2024-07-24
O4-Mini-2025-04-16 (scratchpad with crowd forecast)	4%	2025-04-16
Claude-3-Opus-20240229 (zero shot with crowd forecast)	4%	2024-02-29
GPT-4.5-Preview-2025-02-27 (scratchpad)	3%	2025-02-27
DeepSeek-V3 (scratchpad with crowd forecast)	1%	2024-12-25
GLM-4.5-Air-FP8 (scratchpad with crowd forecast)	1%	2025-07-28
Qwen2.5-72B-Instruct-Turbo (zero shot with crowd forecast)	1%	2024-09-19
GPT-4-0613 (zero shot with crowd forecast)	0%	2023-06-13
DeepSeek-R1 (scratchpad)	0%	2025-01-20
Qwen3-235B-A22B-Fp8-Tput (zero shot)	0%	2025-04-29
Grok-beta (zero shot)	0%	2024-11-04

Verification Source // https://www.forecastbench.org/