Cyber-offense

Cyber-offense capability encompasses discovering vulnerabilities in systems, writing exploit code, making effective decisions post-access, skillfully evading threat detection while pursuing objectives, and inserting subtle bugs for future exploitation when deployed as a coding assistant (Shevlane et al., 2023).

CyBench

CyBench is a cybersecurity benchmark consisting of 40 professional Capture The Flag (CTF) competition tasks. It evaluates AI agents' ability to identify vulnerabilities, exploit systems, and perform penetration testing - capabilities critical for both security testing and risk assessment.
55%
Claude Sonnet 4.5
5
months no update

Why this benchmark?

Multiple established benchmarks assess cybersecurity capabilities in LLMs, particularly when considering broader software proficiency. Notable examples include Meta's CyberSecEval, SWE-Bench, WMDB Cyber, BountyBench, Intercode CTFs, and GDM CTFs. We selected CyBench for three reasons: its concentrated emphasis on cybersecurity tasks, strong academic recognition with nearly 100 citations, and an active leaderboard featuring contemporary models. Public evaluations from the CyberSecEval suite would represent an equally valid alternative, should recent results become available.

Over time

Initializing Visualization...

Complete Model results

Model ArchitecturePerformance MetricCanonical Release
Claude Sonnet 4.555%2025-09-29
Claude Opus 4.138%2025-08-05
Claude Opus 438%2025-05-22
claude-4-sonnet35%2025-05-22
O3 Mini23%2025-01-31
Claude 3.7 Sonnet20%2025-02-24
GPT-4.5 Preview18%2025-02-27
Claude 3.5 Sonnet18%2024-06-20
GPT-4o13%2024-05-13
o1 Mini10%2024-09-12
O1 Preview10%2024-09-12
Claude 3 Opus10%2024-03-04
Mixtral 8x22B Instruct8%2024-04-17
Gemini 1.5 Pro8%2024-02-15
Llama 3.1 405B Instruct8%2024-06-23
Llama 3 70B Chat5%2024-04-18
Verification Source // https://cybench.github.io/