Crowd vibes on which answer sounds better, with the biggest labs structurally advantaged.
What it's really for An AI leaderboard from blind head-to-head votes; it measures which answer people prefer, not which is correct.
What our grade covers The grade on this page is about its crowd-voted Elo leaderboard of AI models, not everything the site does.
High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.
- Operating since
- 2023 (3 years) · source
- What it costs you
- Free to read The reviews are free to read.
- How they make money
- It runs a free crowdsourced leaderboard to attract traffic and data, then sells paid model-evaluation services, private testing arenas, and data/analytics tooling to enterprises and the same AI labs it ranks (~$30M annualized revenue by late 2025).
- What they do
- It produces public leaderboards that rank AI chatbots and models using a Bradley-Terry/Elo score derived from millions of anonymous head-to-head votes where users pick the better of two blind responses.
- What to watch for
- You get an aggregate of strangers' gut-feel preferences on which answer sounds better, not a test of factual accuracy or safety, and a peer-reviewed audit found the biggest labs can quietly test many variants and cherry-pick their best score, so a top rank may reflect gaming the arena more than a genuinely better model.
- Composite score
- 2.70 / 5.00 → grade C+
How the grade was reached
Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.
What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.
Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?
Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?
How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.
Evidence
- LMArena originated in 2023 as a UC Berkeley research project called Chatbot Arena and was formally incorporated in 2025 as Arena Intelligence Inc.; its easiest path to profitability involves selling evaluation tools, data access, and premium leaderboard services to the same labs whose models it ranks, creating pressure to favor large customers. Source: Contrary Research - LMArena Business Breakdown →
- The peer-reviewed audit of ~2 million battles found undisclosed private testing (Meta tested 27 Llama-4 variants, disclosing only the best), preferential sampling, score-retraction privileges, and asymmetric deprecation that structurally advantage a handful of proprietary providers; even limited extra Arena data yielded relative gains up to 112% on the arena distribution, indicating overfitting to the leaderboard rather than true model quality. Source: The Leaderboard Illusion (arXiv 2504.20879) →
- LMArena monetizes through a commercial AI Evaluations service that lets enterprises, model labs, and developers hire its crowdsourced community for testing, reaching ~$30M annualized revenue within four months of its September 2025 launch, after raising $100M+ from a16z and UC Investments. Source: TechCrunch - LMArena lands $1.7B valuation →