AI development

AI development capability involves building new AI systems from scratch including systems with dangerous capabilities, adapting existing models to increase performance on extreme-risk-relevant tasks, and significantly improving the productivity of actors building dual-use AI capabilities (Shevlane et al., 2023).

MLE-bench

MLE-bench is an autonomous machine learning engineering benchmark using 75 curated Kaggle (a data science competition platform) competitions. It evaluates AI agents' end-to-end ML engineering capabilities including data analysis, feature engineering, model selection, hyperparameter tuning, and iterative improvement on real-world ML challenges.

61%

gemini-3-pro-preview - PiEvolve (Fractal AI Research)

months no update

Why this benchmark?

While other AI development benchmarks exist, such as RE-Bench, MLE Bench offers the advantage of being both well-established (developed by OpenAI) and having a publicly accessible, regularly updated leaderboard.

Related takeover scenarios

AI takes over by self-improving

Over time

Initializing Visualization...

Complete Model results

Model Architecture	Performance Metric	Canonical Release
gemini-3-pro-preview - PiEvolve (Fractal AI Research)	61%	2025-11-18
gemini-2.5-pro - Famou-Agent 2.0	60%	2025-07-17
deepseek-v3.2-speciale - ML-Master 2.0	56%	2025-11-28
gemini-3-pro-preview - Leeroo	51%	2025-11-18
gpt-5-codex - Thesis	48%	2025-09-15
gemini-2.5-pro - CAIR MLE-STAR-Pro-1.5	44%	2025-07-17
gemini-2.5-pro - Famou-Agent	44%	2025-07-17
gemini-2.5-pro - CAIR MLE-STAR-Pro-1.0	39%	2025-07-17
deepseek-r1 - InternAgent	36%	2025-01-20
gpt-5 - R&D-Agent	35%	2025-08-07
undisclosed - Neo multi-agent	34%	—
o3 - AIRA-dojo	32%	2025-04-16
o3 + gpt-4.1 - R&D-Agent	30%	2025-04-16
deepseek-r1 - ML-Master	29%	2025-01-20
o1-preview - R&D-Agent	22%	2024-09-12
o1-preview - AIDE	17%	2024-09-12
gpt-4o-2024-08-06 - AIDE	9%	2024-08-06
claude-3-5-sonnet-20240620 - AIDE	8%	2024-06-20
gpt-4o-2024-08-06 - OpenHands	5%	2024-08-06
llama-3.1-405b-instruct - AIDE	3%	2024-06-23
gpt-4o-2024-08-06 - MLAB	2%	2024-08-06

Verification Source // https://openai.com/index/mle-bench/