Plumb
A-

AI model benchmarking

Artificial Analysis

Independent (Artificial Analysis, Inc.; seed-backed by AI Grant / Nat Friedman & Daniel Gross)

Benchmark Free to read Visit Artificial Analysis ↗

Standardized, unbought AI benchmarks; a useful filter, not a final verdict.

What it's really for An independent AI benchmark; the public leaderboards lead into paid enterprise reports.

What our grade covers The grade on this page is about its LLM intelligence, speed, and price benchmarks, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Follow the money

The same AI labs and inference providers it ranks are also its paying customers for private custom benchmarking and enterprise reports, but the founders state explicitly that "no one pays to be on the website" and "you can't pay us for better results," so paying does not buy public placement.

Source →
Operating since
2023 (3 years) · source
What it costs you
Free to read The reviews are free to read.
How they make money
It makes money from paid enterprise "insights" report subscriptions and private custom benchmarking commissioned by AI-stack companies, while the public leaderboards remain free.
What they do
It independently benchmarks and ranks LLMs and inference API providers on intelligence (a composite of ~10 eval datasets), speed/latency, and live API price, publishing the results as free public leaderboards.
What to watch for
You get standardized lab-style benchmark scores, not a guarantee they match your real-world use case, and high-scoring models can still be the ones a given vendor optimized ("benchmark-gamed") for, so treat the index as a starting filter rather than a verdict.
Composite score
4.10 / 5.00 → grade A-

How the grade was reached

Independence · 30% weight 4 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 5 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 4 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 2 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 4 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

Compare with others

Others reviewing ai models (compare all →)

← Back to the Report Card