AI development
AI development capability involves building new AI systems from scratch including systems with dangerous capabilities, adapting existing models to increase performance on extreme-risk-relevant tasks, and significantly improving the productivity of actors building dual-use AI capabilities (Shevlane et al., 2023).
MLE-bench
MLE-bench is an autonomous machine learning engineering benchmark using 75 curated Kaggle (a data science competition platform) competitions. It evaluates AI agents' end-to-end ML engineering capabilities including data analysis, feature engineering, model selection, hyperparameter tuning, and iterative improvement on real-world ML challenges.
44%
gemini-2.5-pro - FM Agent
Chart loading…
Model scores
| Model | Score | Date |
|---|---|---|
| gemini-2.5-pro - FM Agent | 44% | 2025-07-17 |
| gemini-2.5-pro - CAIR MLE-STAR-Pro | 39% | 2025-07-17 |
| deepseek-r1 - InternAgent | 36% | 2025-01-20 |
| gpt-5 - R&D-Agent | 35% | 2025-08-07 |
| undisclosed - Neo multi-agent | 34% | — |
| o3 - AIRA-dojo | 32% | 2025-04-16 |
| o3 + gpt-4.1 - R&D-Agent | 30% | 2025-04-16 |
| deepseek-r1 - ML-Master | 29% | 2025-01-20 |
| o1-preview - R&D-Agent | 22% | 2024-09-12 |
| o1-preview - AIDE | 17% | 2024-09-12 |
| gpt-4o-2024-08-06 - AIDE | 9% | 2024-08-06 |
| claude-3-5-sonnet-20240620 - AIDE | 8% | 2024-06-20 |
| gpt-4o-2024-08-06 - OpenHands | 5% | 2024-08-06 |
| llama-3.1-405b-instruct - AIDE | 3% | 2024-06-23 |
| gpt-4o-2024-08-06 - MLAB | 1% | 2024-08-06 |
Why this benchmark?
While other AI development benchmarks exist, such as RE-Bench, MLE Bench offers the advantage of being both well-established (developed by OpenAI) and having a publicly accessible, regularly updated leaderboard.