Plumb
C+

AI model benchmarks

Vellum LLM Leaderboard

Vellum (Vellum AI, Inc.), YC W2023; founders Akash Sharma, Sidd Seethepalli, Noa Flaherty

Benchmark Free to read Visit Vellum LLM Leaderboard ↗

A clean, free benchmark scoreboard for frontier LLMs that doubles as lead-gen for Vellum's dev platform; useful at a glance, but it mixes provider-reported scores with its own evals and discloses no per-result sourcing.

What it's really for A benchmark leaderboard that doubles as marketing for Vellum's developer platform.

What our grade covers The grade on this page is about its aggregated public-benchmark LLM rankings, not everything the site does.

Medium Scoring Confidence Mostly sourced, but a detail or two still needs a primary source, so the grade could shift slightly.

Follow the money

No party pays for placement; the board is funded by Vellum itself as free marketing, and revenue comes from customers of Vellum's paid AI-development platform rather than from the model vendors it ranks.

Source →
Operating since
2023 (3 years) · source
What it costs you
Free to read The reviews are free to read.
How they make money
Free marketing and lead-generation asset for Vellum, an AI development/agent platform that makes money from paid subscription and usage-based plans for building LLM apps.
What they do
Aggregates and ranks frontier large language models on public benchmarks (reasoning, math, coding, tool use) alongside cost, speed, and context-window comparisons.
What to watch for
By its own one-line disclosure the data mixes "model providers" (self-reported) with "independently run evaluations," and the board shows no per-result sourcing or reproducible protocol, so you can't tell which numbers were verified first-hand.
Composite score
2.70 / 5.00 → grade C+

How the grade was reached

Independence · 30% weight 3 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 3 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 3 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 1 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 2 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

Compare with others

Others reviewing ai models (compare all →)

← Back to the Report Card