Plumb
A+

AI model benchmarks

Hugging Face Open LLM Leaderboard

Hugging Face, Inc.

Benchmark Free to read Visit Hugging Face Open LLM Leaderboard ↗

A free, reproducible benchmark scoreboard that nobody could pay to climb — though Hugging Face itself retired it in 2025, saying its tests had grown gameable and obsolete.

What it's really for A free, automated open-LLM benchmark run as community infrastructure (now retired).

What our grade covers The grade on this page is about its standardized six-benchmark ranking of open LLMs, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Operating since
2023 (3 years) · source
What it costs you
Free to read The reviews are free to read.
How they make money
It made no money directly: it was a free Hugging Face Space funded by the company as community infrastructure and a draw to its platform, with no ads, fees, or paid placement.
What they do
It automatically ran every submitted open-source LLM through the same fixed suite of six public benchmarks on identical hardware and ranked them by averaged, normalized scores.
What to watch for
It was retired in March 2025 and is now a frozen archive, and even at its peak its reliance on public benchmarks meant scores could be inflated by training on leaked test data rather than building genuinely better models.
Composite score
4.70 / 5.00 → grade A+

How the grade was reached

Independence · 30% weight 5 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 5 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 5 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 4 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 3 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

Compare with others

Others reviewing ai models (compare all →)

← Back to the Report Card