AI development

AI development capability involves building new AI systems from scratch including systems with dangerous capabilities, adapting existing models to increase performance on extreme-risk-relevant tasks, and significantly improving the productivity of actors building dual-use AI capabilities (Shevlane et al., 2023).

MLE-bench

MLE-bench is an autonomous machine learning engineering benchmark using 75 curated Kaggle (a data science competition platform) competitions. It evaluates AI agents' end-to-end ML engineering capabilities including data analysis, feature engineering, model selection, hyperparameter tuning, and iterative improvement on real-world ML challenges.

44%
gemini-2.5-pro - FM Agent
Chart loading…

Model scores

ModelScoreDate
gemini-2.5-pro - FM Agent44%2025-07-17
gemini-2.5-pro - CAIR MLE-STAR-Pro39%2025-07-17
deepseek-r1 - InternAgent36%2025-01-20
gpt-5 - R&D-Agent35%2025-08-07
undisclosed - Neo multi-agent34%
o3 - AIRA-dojo32%2025-04-16
o3 + gpt-4.1 - R&D-Agent30%2025-04-16
deepseek-r1 - ML-Master29%2025-01-20
o1-preview - R&D-Agent22%2024-09-12
o1-preview - AIDE17%2024-09-12
gpt-4o-2024-08-06 - AIDE9%2024-08-06
claude-3-5-sonnet-20240620 - AIDE8%2024-06-20
gpt-4o-2024-08-06 - OpenHands5%2024-08-06
llama-3.1-405b-instruct - AIDE3%2024-06-23
gpt-4o-2024-08-06 - MLAB1%2024-08-06

Why this benchmark?

While other AI development benchmarks exist, such as RE-Bench, MLE Bench offers the advantage of being both well-established (developed by OpenAI) and having a publicly accessible, regularly updated leaderboard.

Related takeover scenarios