A free, reproducible benchmark scoreboard that nobody could pay to climb — though Hugging Face itself retired it in 2025, saying its tests had grown gameable and obsolete.
What it's really for A free, automated open-LLM benchmark run as community infrastructure (now retired).
What our grade covers The grade on this page is about its standardized six-benchmark ranking of open LLMs, not everything the site does.
High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.
- Operating since
- 2023 (3 years) · source
- What it costs you
- Free to read The reviews are free to read.
- How they make money
- It made no money directly: it was a free Hugging Face Space funded by the company as community infrastructure and a draw to its platform, with no ads, fees, or paid placement.
- What they do
- It automatically ran every submitted open-source LLM through the same fixed suite of six public benchmarks on identical hardware and ranked them by averaged, normalized scores.
- What to watch for
- It was retired in March 2025 and is now a frozen archive, and even at its peak its reliance on public benchmarks meant scores could be inflated by training on leaked test data rather than building genuinely better models.
- Composite score
- 4.70 / 5.00 → grade A+
How the grade was reached
Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.
What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.
Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?
Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?
How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.
Evidence
- The leaderboard evaluated models on six benchmarks (IFEval, BBH, MATH, GPQA, MuSR, MMLU-PRO) using EleutherAI's open lm-evaluation-harness, with published commands and a public results dataset so anyone can reproduce the scores — and it directs users to a fork to 'reproduce our results.' Source: Open LLM Leaderboard official About page →
- Hugging Face retired the leaderboard on March 13, 2025, stating 'For the last 2 years, we've evaluated over 13K models' and that it was 'slowly becoming obsolete' and 'could encourage people to hill climb irrelevant directions.' Source: HF retirement announcement (discussion #1135) →
- The v2 overhaul was driven by contamination: the original six benchmarks 'had been so thoroughly leaked into training datasets that the top models were approaching human-level scores not because of genuine capability gains, but because their answers were effectively memorized,' showing the public-test approach was vulnerable to gaming. Source: DeepLearning.AI, The Batch →