AI coding-agent benchmark

SWE-bench

Item: SWE-bench
Rating: 4.3
Author: Plumb

Princeton Language and Intelligence (Princeton University); created with the University of Chicago

Benchmark Free to read Visit SWE-bench ↗

An open, reproducible academic benchmark that can't be bought — but critics, and even OpenAI, say training-data contamination has eroded what its top scores actually prove.

What it's really for An academic coding-agent benchmark; agents are graded automatically against real test suites.

What our grade covers The grade on this page is about its score for resolving real GitHub issues, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Follow the money

No one pays to be ranked and there is no placement to buy — labs (OpenAI, Anthropic, Google, Meta) submit results voluntarily and are scored by the same open test harness, so position is earned by passing tests, not purchased.

Source →

Operating since: 2023 (3 years) · source
What it costs you: Free to read The reviews are free to read.
How they make money: It doesn't make money: it's a free, MIT-licensed academic research benchmark out of Princeton with no ads, sponsorships, or paid placement.
What they do: It ranks AI coding agents by how many real GitHub issues they actually resolve, scored automatically against the projects' own test suites via an open, containerized evaluation harness.
What to watch for: A high SWE-bench score isn't proof of real coding skill: researchers have documented "solution leakage" and, by OpenAI's own February 2026 disclosure, frontier models had seen the test tasks in training, so rankings can reward memorization.
Composite score: 4.30 / 5.00 → grade A
Last verified: July 11, 2026 — when we last checked this entry's facts, links, and grade against the live site.
Site confirmed live: July 11, 2026 — our monthly automated check reached SWE-bench and got a normal response.

How the grade was reached

Independence · 30% weight 5 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 4 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 5 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 4 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 2 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

SWE-bench is a benchmark for evaluating large language models on real-world software issues collected from GitHub; given a codebase and an issue, a model generates a patch that is then verified against the repository's own tests. The evaluation harness is open source, MIT-licensed, and uses a fully containerized Docker setup for reproducible evaluations, with leaderboard submissions run through the open sb-cli tool. Source: SWE-bench GitHub repository (SWE-bench/SWE-bench) →
SWE-bench Verified is a 500-instance human-filtered subset created in collaboration with OpenAI, where human annotators reviewed each instance to ensure problem descriptions are clear, test patches are correct, and tasks are solvable. OpenAI's collaboration is disclosed openly on the benchmark's own Verified page. Source: SWE-bench Verified (swebench.com) →
OpenAI stopped evaluating models against SWE-bench Verified on February 23, 2026, after an audit found 59.4% of failed test cases were flawed and that every frontier model (GPT-5.2, Claude Opus 4.5, Gemini 3) showed training-data contamination — models trained on post-June-2024 GitHub data had seen the 500 Verified tasks, including solutions. Source: OpenAI: Why we no longer evaluate SWE-bench Verified →

Compare with others

Others reviewing ai models (compare all →)

A+ EvalPlus Leaderboard A+ Stanford HELM A- Artificial Analysis A- The Verge A- Vals AI A- Digital Trends B+ Engadget B TechRadar C+ LMArena C+ Vellum LLM Leaderboard C- There's An AI For That (TAAFT) C- Product Hunt D+ Futurepedia D Toolify AI A+ Hugging Face Open LLM Leaderboard B- Papers with Code

← Back to the Report Card