Plumb
C+

AI model leaderboards

LMArena

Arena Intelligence Inc.

Benchmark Free to read Visit LMArena ↗

Crowd vibes on which answer sounds better, with the biggest labs structurally advantaged.

What it's really for An AI leaderboard from blind head-to-head votes; it measures which answer people prefer, not which is correct.

What our grade covers The grade on this page is about its crowd-voted Elo leaderboard of AI models, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Operating since
2023 (3 years) · source
What it costs you
Free to read The reviews are free to read.
How they make money
It runs a free crowdsourced leaderboard to attract traffic and data, then sells paid model-evaluation services, private testing arenas, and data/analytics tooling to enterprises and the same AI labs it ranks (~$30M annualized revenue by late 2025).
What they do
It produces public leaderboards that rank AI chatbots and models using a Bradley-Terry/Elo score derived from millions of anonymous head-to-head votes where users pick the better of two blind responses.
What to watch for
You get an aggregate of strangers' gut-feel preferences on which answer sounds better, not a test of factual accuracy or safety, and a peer-reviewed audit found the biggest labs can quietly test many variants and cherry-pick their best score, so a top rank may reflect gaming the arena more than a genuinely better model.
Composite score
2.70 / 5.00 → grade C+

How the grade was reached

Independence · 30% weight 2 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 3 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 4 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 2 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 2 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

Compare with others

Others reviewing ai models (compare all →)

← Back to the Report Card