EvalPlus Leaderboard

Item: EvalPlus Leaderboard
Rating: 4.7
Author: Plumb

EvalPlus team (researchers at University of Illinois Urbana-Champaign); open-source project (Apache-2.0)

Benchmark Free to read Visit EvalPlus Leaderboard ↗

An academic, open-source coding benchmark that auto-grades LLMs on hand-verified tests with a published, reproducible method and no money changing hands; the usual public-benchmark caveat is contamination, not commerce.

What it's really for A non-commercial academic benchmark for code-generation models.

What our grade covers The grade on this page is about its augmented HumanEval+/MBPP+ code-model scores, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Follow the money

No one pays it: it is a free, open-source academic benchmark with no sponsorship or pay-for-placement, so model vendors cannot buy a higher rank.

Source →

Operating since: 2023 (3 years) · source
What it costs you: Free to read The reviews are free to read.
How they make money: It is a non-commercial academic research project with no advertising, subscriptions, or paid placement; the code is released free under Apache-2.0.
What they do: It ranks AI code-generation models by running them against augmented, hand-verified HumanEval+ and MBPP+ test suites and reporting pass@1 scores from automated testing.
What to watch for: It only measures Python pass@1 on two fixed problem sets, so scores can be inflated by training-data contamination and the team itself urges checking multiple benchmarks rather than relying on this one.
Composite score: 4.70 / 5.00 → grade A+
Last verified: July 11, 2026 — when we last checked this entry's facts, links, and grade against the live site.
Site confirmed live: July 11, 2026 — our monthly automated check reached EvalPlus Leaderboard and got a normal response.

How the grade was reached

Independence · 30% weight 5 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 5 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 5 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 4 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 3 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

EvalPlus creates HumanEval+ and MBPP+ by extending the original tests roughly 80x/35x and ranks models by pass@1 using greedy decoding on hand-verified problems, all run through an open-source automated harness with setup details published in the GitHub repo. Source: EvalPlus Leaderboard (official) →
The method is documented in the peer-reviewed paper 'Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation' by Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang, first submitted May 2023 and published at NeurIPS 2023. Source: arXiv 2305.01210 →
The full evaluation toolkit is released open-source under the Apache-2.0 license with the test-generation code public, allowing anyone to reproduce the scores; there is no paid submission path or commercial sponsorship of rankings. Source: evalplus/evalplus GitHub →

Compare with others

Others reviewing ai models (compare all →)

A+ Stanford HELM A SWE-bench A- Artificial Analysis A- The Verge A- Vals AI A- Digital Trends B+ Engadget B TechRadar C+ LMArena C+ Vellum LLM Leaderboard C- There's An AI For That (TAAFT) C- Product Hunt D+ Futurepedia D Toolify AI A+ Hugging Face Open LLM Leaderboard B- Papers with Code

← Back to the Report Card