LMArena

Item: LMArena
Rating: 2.7
Author: Plumb

Arena Intelligence Inc.

Benchmark Free to read Visit LMArena ↗

Crowd vibes on which answer sounds better, with the biggest labs structurally advantaged.

What it's really for An AI leaderboard from blind head-to-head votes; it measures which answer people prefer, not which is correct.

What our grade covers The grade on this page is about its crowd-voted Elo leaderboard of AI models, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Operating since: 2023 (3 years) · source
What it costs you: Free to read The reviews are free to read.
How they make money: It runs a free crowdsourced leaderboard to attract traffic and data, then sells paid model-evaluation services, private testing arenas, and data/analytics tooling to enterprises and the same AI labs it ranks (~$30M annualized revenue by late 2025).
What they do: It produces public leaderboards that rank AI chatbots and models using a Bradley-Terry/Elo score derived from millions of anonymous head-to-head votes where users pick the better of two blind responses.
What to watch for: You get an aggregate of strangers' gut-feel preferences on which answer sounds better, not a test of factual accuracy or safety, and a peer-reviewed audit found the biggest labs can quietly test many variants and cherry-pick their best score, so a top rank may reflect gaming the arena more than a genuinely better model.
Composite score: 2.70 / 5.00 → grade C+
Last verified: July 11, 2026 — when we last checked this entry's facts, links, and grade against the live site.
Site confirmed live: July 11, 2026 — our monthly automated check reached LMArena and got a normal response.

How the grade was reached

Independence · 30% weight 2 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 3 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 4 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 2 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 2 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

LMArena originated in 2023 as a UC Berkeley research project called Chatbot Arena and was formally incorporated in 2025 as Arena Intelligence Inc.; its easiest path to profitability involves selling evaluation tools, data access, and premium leaderboard services to the same labs whose models it ranks, creating pressure to favor large customers. Source: Contrary Research - LMArena Business Breakdown →
The peer-reviewed audit of ~2 million battles found undisclosed private testing (Meta tested 27 Llama-4 variants, disclosing only the best), preferential sampling, score-retraction privileges, and asymmetric deprecation that structurally advantage a handful of proprietary providers; even limited extra Arena data yielded relative gains up to 112% on the arena distribution, indicating overfitting to the leaderboard rather than true model quality. Source: The Leaderboard Illusion (arXiv 2504.20879) →
LMArena monetizes through a commercial AI Evaluations service that lets enterprises, model labs, and developers hire its crowdsourced community for testing, reaching ~$30M annualized revenue within four months of its September 2025 launch, after raising $100M+ from a16z and UC Investments. Source: TechCrunch - LMArena lands $1.7B valuation →

Compare with others

Others reviewing ai models (compare all →)

A+ EvalPlus Leaderboard A+ Stanford HELM A SWE-bench A- Artificial Analysis A- The Verge A- Vals AI A- Digital Trends B+ Engadget B TechRadar C+ Vellum LLM Leaderboard C- There's An AI For That (TAAFT) C- Product Hunt D+ Futurepedia D Toolify AI A+ Hugging Face Open LLM Leaderboard B- Papers with Code

← Back to the Report Card