Cyber-offense
Cyber-offense capability encompasses discovering vulnerabilities in systems, writing exploit code, making effective decisions post-access, skillfully evading threat detection while pursuing objectives, and inserting subtle bugs for future exploitation when deployed as a coding assistant (Shevlane et al., 2023).
CyBench
CyBench is a cybersecurity benchmark consisting of 40 professional Capture The Flag (CTF) competition tasks. It evaluates AI agents' ability to identify vulnerabilities, exploit systems, and perform penetration testing - capabilities critical for both security testing and risk assessment.
55%
Claude Sonnet 4.5
5
months no update
Why this benchmark?
Multiple established benchmarks assess cybersecurity capabilities in LLMs, particularly when considering broader software proficiency. Notable examples include Meta's CyberSecEval, SWE-Bench, WMDB Cyber, BountyBench, Intercode CTFs, and GDM CTFs. We selected CyBench for three reasons: its concentrated emphasis on cybersecurity tasks, strong academic recognition with nearly 100 citations, and an active leaderboard featuring contemporary models. Public evaluations from the CyberSecEval suite would represent an equally valid alternative, should recent results become available.
Related takeover scenarios
Over time
Initializing Visualization...
Complete Model results
| Model Architecture | Performance Metric | Canonical Release |
|---|---|---|
| Claude Sonnet 4.5 | 55% | 2025-09-29 |
| Claude Opus 4.1 | 38% | 2025-08-05 |
| Claude Opus 4 | 38% | 2025-05-22 |
| claude-4-sonnet | 35% | 2025-05-22 |
| O3 Mini | 23% | 2025-01-31 |
| Claude 3.7 Sonnet | 20% | 2025-02-24 |
| GPT-4.5 Preview | 18% | 2025-02-27 |
| Claude 3.5 Sonnet | 18% | 2024-06-20 |
| GPT-4o | 13% | 2024-05-13 |
| o1 Mini | 10% | 2024-09-12 |
| O1 Preview | 10% | 2024-09-12 |
| Claude 3 Opus | 10% | 2024-03-04 |
| Mixtral 8x22B Instruct | 8% | 2024-04-17 |
| Gemini 1.5 Pro | 8% | 2024-02-15 |
| Llama 3.1 405B Instruct | 8% | 2024-06-23 |
| Llama 3 70B Chat | 5% | 2024-04-18 |
Verification Source // https://cybench.github.io/