Plumb
A

AI coding-agent benchmark

SWE-bench

Princeton Language and Intelligence (Princeton University); created with the University of Chicago

Benchmark Free to read Visit SWE-bench ↗

An open, reproducible academic benchmark that can't be bought — but critics, and even OpenAI, say training-data contamination has eroded what its top scores actually prove.

What it's really for An academic coding-agent benchmark; agents are graded automatically against real test suites.

What our grade covers The grade on this page is about its score for resolving real GitHub issues, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Follow the money

No one pays to be ranked and there is no placement to buy — labs (OpenAI, Anthropic, Google, Meta) submit results voluntarily and are scored by the same open test harness, so position is earned by passing tests, not purchased.

Source →
Operating since
2023 (3 years) · source
What it costs you
Free to read The reviews are free to read.
How they make money
It doesn't make money: it's a free, MIT-licensed academic research benchmark out of Princeton with no ads, sponsorships, or paid placement.
What they do
It ranks AI coding agents by how many real GitHub issues they actually resolve, scored automatically against the projects' own test suites via an open, containerized evaluation harness.
What to watch for
A high SWE-bench score isn't proof of real coding skill: researchers have documented "solution leakage" and, by OpenAI's own February 2026 disclosure, frontier models had seen the test tasks in training, so rankings can reward memorization.
Composite score
4.30 / 5.00 → grade A

How the grade was reached

Independence · 30% weight 5 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 4 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 5 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 4 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 2 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

Compare with others

Others reviewing ai models (compare all →)

← Back to the Report Card