Political strategy
Political strategy capability requires performing social modeling and planning necessary to gain and exercise political influence in scenarios with multiple actors and rich social context (Shevlane et al., 2023).
ForecastBench
ForecastBench measures one critical component of this capability: forecasting accuracy on real-world prediction questions about future events. It is a dynamic benchmark evaluating AI forecasting accuracy on 1,000+ automatically generated and regularly updated real-world prediction questions. For scoring, it uses Brier scores (lower is better) which are inverted for display: 100% represents the best forecaster in the dataset (lowest Brier score), 0% represents the worst.
23%
GPT-5-2025-08-07 (zero shot with crowd forecast)
Chart loading…
Model scores
| Model | Score | Date |
|---|---|---|
| GPT-5-2025-08-07 (zero shot with crowd forecast) | 23% | 2025-08-07 |
| Claude-Opus-4-1-20250805 (zero shot with crowd forecast) | 22% | 2025-08-05 |
| Gemini-2.5-Pro (zero shot with crowd forecast) | 21% | 2025-06-17 |
| GPT-4.5-Preview-2025-02-27 (zero shot with crowd forecast) | 21% | 2025-02-27 |
| O3-2025-04-16 (zero shot with crowd forecast) | 20% | 2025-04-16 |
| GPT-4.5-Preview-2025-02-27 (scratchpad with crowd forecast) | 20% | 2025-02-27 |
| Claude-3-7-Sonnet-20250219 (scratchpad with crowd forecast) | 19% | 2025-02-24 |
| DeepSeek-R1 (scratchpad with crowd forecast) | 17% | 2025-01-20 |
| Claude-3-7-Sonnet-20250219 (zero shot with crowd forecast) | 15% | 2025-02-24 |
| O3-2025-04-16 (scratchpad with crowd forecast) | 14% | 2025-04-16 |
| GPT-4.1-2025-04-14 (scratchpad with crowd forecast) | 13% | 2025-04-14 |
| Grok-beta (zero shot with crowd forecast) | 13% | 2024-11-04 |
| O3-Mini-2025-01-31 (zero shot with crowd forecast) | 12% | 2025-01-31 |
| Claude-3-5-Sonnet-20241022 (scratchpad with crowd forecast) | 12% | 2024-10-22 |
| Gemini-2.5-Flash-Preview-04-17 (zero shot) | 12% | 2025-04-17 |
| GPT-5-Mini-2025-08-07 (zero shot with crowd forecast) | 12% | 2025-08-07 |
| O4-Mini-2025-04-16 (zero shot with crowd forecast) | 11% | 2025-04-16 |
| Claude-3-5-Sonnet-20241022 (zero shot with crowd forecast) | 10% | 2024-10-22 |
| Claude-3-5-Sonnet-20240620 (zero shot with crowd forecast) | 10% | 2024-06-20 |
| GPT-4.1-2025-04-14 (zero shot with crowd forecast) | 10% | 2025-04-14 |
| Qwen3-235B-A22B-Fp8-Tput (scratchpad with crowd forecast) | 9% | 2025-04-29 |
| Claude-Sonnet-4-20250514 (zero shot with crowd forecast) | 9% | 2025-05-22 |
| Claude-3-5-Sonnet-20240620 (scratchpad with crowd forecast) | 9% | 2024-06-20 |
| Claude-Opus-4-20250514 (scratchpad with crowd forecast) | 9% | 2025-05-22 |
| GPT-4-Turbo-2024-04-09 (zero shot with crowd forecast) | 9% | 2024-04-09 |
| DeepSeek-V3 (zero shot with crowd forecast) | 8% | 2024-12-25 |
| Meta-Llama-3.1-405B-Instruct-Turbo (zero shot with crowd forecast) | 8% | 2024-07-23 |
| Gemini-2.5-Flash (zero shot with crowd forecast) | 8% | 2025-06-17 |
| O3-2025-04-16 (zero shot) | 8% | 2025-04-16 |
| Gemini-2.5-Flash-Preview-04-17 (zero shot with crowd forecast) | 8% | 2025-04-17 |
| Qwen3-235B-A22B-Fp8-Tput (zero shot with crowd forecast) | 8% | 2025-04-29 |
| GLM-4.5-Air-FP8 (zero shot with crowd forecast) | 6% | 2025-07-28 |
| O3-2025-04-16 (scratchpad) | 6% | 2025-04-16 |
| Kimi-K2-Instruct (zero shot with crowd forecast) | 6% | 2025-07-12 |
| GPT-4o-2024-05-13 (zero shot with crowd forecast) | 6% | 2024-05-13 |
| GPT-4.5-Preview-2025-02-27 (zero shot) | 6% | 2025-02-27 |
| Llama-3.3-70B-Instruct-Turbo (scratchpad with crowd forecast) | 5% | 2024-12-06 |
| Claude-3-7-Sonnet-20250219 (scratchpad) | 4% | 2025-02-24 |
| QwQ-32B-Preview (scratchpad with crowd forecast) | 4% | 2024-11-28 |
| O3-Mini-2025-01-31 (scratchpad with crowd forecast) | 4% | 2025-01-31 |
| Mistral-Large-2407 (zero shot with crowd forecast) | 4% | 2024-07-24 |
| O4-Mini-2025-04-16 (scratchpad with crowd forecast) | 4% | 2025-04-16 |
| Claude-3-Opus-20240229 (zero shot with crowd forecast) | 4% | 2024-02-29 |
| GPT-4.5-Preview-2025-02-27 (scratchpad) | 3% | 2025-02-27 |
| DeepSeek-V3 (scratchpad with crowd forecast) | 1% | 2024-12-25 |
| GLM-4.5-Air-FP8 (scratchpad with crowd forecast) | 1% | 2025-07-28 |
| Qwen2.5-72B-Instruct-Turbo (zero shot with crowd forecast) | 1% | 2024-09-19 |
| GPT-4-0613 (zero shot with crowd forecast) | 0% | 2023-06-13 |
| DeepSeek-R1 (scratchpad) | 0% | 2025-01-20 |
| Qwen3-235B-A22B-Fp8-Tput (zero shot) | 0% | 2025-04-29 |
| Grok-beta (zero shot) | 0% | 2024-11-04 |
Why this benchmark?
Gamified evaluations like those in Welfare Diplomacy assess aspects of political strategy, though their correspondence to real-world political decision-making remains uncertain. While Forecast Bench does not directly measure political strategy, it evaluates forecasting ability, a fundamental component of strategic planning. Given the limited alternatives in this domain, Forecast Bench represents the most suitable available option.
Related takeover scenarios
Source: https://www.forecastbench.org/