An open, reproducible academic benchmark that can't be bought — but critics, and even OpenAI, say training-data contamination has eroded what its top scores actually prove.
What it's really for An academic coding-agent benchmark; agents are graded automatically against real test suites.
What our grade covers The grade on this page is about its score for resolving real GitHub issues, not everything the site does.
High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.
No one pays to be ranked and there is no placement to buy — labs (OpenAI, Anthropic, Google, Meta) submit results voluntarily and are scored by the same open test harness, so position is earned by passing tests, not purchased.
Source →- Operating since
- 2023 (3 years) · source
- What it costs you
- Free to read The reviews are free to read.
- How they make money
- It doesn't make money: it's a free, MIT-licensed academic research benchmark out of Princeton with no ads, sponsorships, or paid placement.
- What they do
- It ranks AI coding agents by how many real GitHub issues they actually resolve, scored automatically against the projects' own test suites via an open, containerized evaluation harness.
- What to watch for
- A high SWE-bench score isn't proof of real coding skill: researchers have documented "solution leakage" and, by OpenAI's own February 2026 disclosure, frontier models had seen the test tasks in training, so rankings can reward memorization.
- Composite score
- 4.30 / 5.00 → grade A
How the grade was reached
Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.
What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.
Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?
Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?
How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.
Evidence
- SWE-bench is a benchmark for evaluating large language models on real-world software issues collected from GitHub; given a codebase and an issue, a model generates a patch that is then verified against the repository's own tests. The evaluation harness is open source, MIT-licensed, and uses a fully containerized Docker setup for reproducible evaluations, with leaderboard submissions run through the open sb-cli tool. Source: SWE-bench GitHub repository (SWE-bench/SWE-bench) →
- SWE-bench Verified is a 500-instance human-filtered subset created in collaboration with OpenAI, where human annotators reviewed each instance to ensure problem descriptions are clear, test patches are correct, and tasks are solvable. OpenAI's collaboration is disclosed openly on the benchmark's own Verified page. Source: SWE-bench Verified (swebench.com) →
- OpenAI stopped evaluating models against SWE-bench Verified on February 23, 2026, after an audit found 59.4% of failed test cases were flawed and that every frontier model (GPT-5.2, Claude Opus 4.5, Gemini 3) showed training-data contamination — models trained on post-June-2024 GitHub data had seen the 500 Verified tasks, including solutions. Source: OpenAI: Why we no longer evaluate SWE-bench Verified →