Hugging Face Open LLM Leaderboard

Item: Hugging Face Open LLM Leaderboard
Rating: 4.7
Author: Plumb

Hugging Face, Inc.

Benchmark Closed / dormant Free to read Visit Hugging Face Open LLM Leaderboard ↗

A free, reproducible benchmark scoreboard that nobody could pay to climb — though Hugging Face itself retired it in 2025, saying its tests had grown gameable and obsolete.

What it's really for A free, automated open-LLM benchmark run as community infrastructure (now retired).

What our grade covers The grade on this page is about its standardized six-benchmark ranking of open LLMs, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Operating since: 2023 (3 years) · source
What it costs you: Free to read The reviews are free to read.
How they make money: It made no money directly: it was a free Hugging Face Space funded by the company as community infrastructure and a draw to its platform, with no ads, fees, or paid placement.
What they do: It automatically ran every submitted open-source LLM through the same fixed suite of six public benchmarks on identical hardware and ranked them by averaged, normalized scores.
What to watch for: It was retired in March 2025 and is now a frozen archive, and even at its peak its reliance on public benchmarks meant scores could be inflated by training on leaked test data rather than building genuinely better models.
Composite score: 4.70 / 5.00 → grade A+
Last verified: July 11, 2026 — when we last checked this entry's facts, links, and grade against the live site.
Site confirmed live: July 11, 2026 — our monthly automated check reached Hugging Face Open LLM Leaderboard and got a normal response.

How the grade was reached

Independence · 30% weight 5 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 5 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 5 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 4 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 3 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

The leaderboard evaluated models on six benchmarks (IFEval, BBH, MATH, GPQA, MuSR, MMLU-PRO) using EleutherAI's open lm-evaluation-harness, with published commands and a public results dataset so anyone can reproduce the scores — and it directs users to a fork to 'reproduce our results.' Source: Open LLM Leaderboard official About page →
Hugging Face retired the leaderboard on March 13, 2025, stating 'For the last 2 years, we've evaluated over 13K models' and that it was 'slowly becoming obsolete' and 'could encourage people to hill climb irrelevant directions.' Source: HF retirement announcement (discussion #1135) →
The v2 overhaul was driven by contamination: the original six benchmarks 'had been so thoroughly leaked into training datasets that the top models were approaching human-level scores not because of genuine capability gains, but because their answers were effectively memorized,' showing the public-test approach was vulnerable to gaming. Source: DeepLearning.AI, The Batch →

Compare with others

Others reviewing ai models (compare all →)

A+ EvalPlus Leaderboard A+ Stanford HELM A SWE-bench A- Artificial Analysis A- The Verge A- Vals AI A- Digital Trends B+ Engadget B TechRadar C+ LMArena C+ Vellum LLM Leaderboard C- There's An AI For That (TAAFT) C- Product Hunt D+ Futurepedia D Toolify AI B- Papers with Code

← Back to the Report Card