Vellum LLM Leaderboard

Item: Vellum LLM Leaderboard
Rating: 2.7
Author: Plumb

Vellum (Vellum AI, Inc.), YC W2023; founders Akash Sharma, Sidd Seethepalli, Noa Flaherty

Benchmark Free to read Visit Vellum LLM Leaderboard ↗

A clean, free benchmark scoreboard for frontier LLMs that doubles as lead-gen for Vellum's dev platform; useful at a glance, but it mixes provider-reported scores with its own evals and discloses no per-result sourcing.

What it's really for A benchmark leaderboard that doubles as marketing for Vellum's developer platform.

What our grade covers The grade on this page is about its aggregated public-benchmark LLM rankings, not everything the site does.

Medium Scoring Confidence Mostly sourced, but a detail or two still needs a primary source, so the grade could shift slightly.

Follow the money

No party pays for placement; the board is funded by Vellum itself as free marketing, and revenue comes from customers of Vellum's paid AI-development platform rather than from the model vendors it ranks.

Source →

Operating since: 2023 (3 years) · source
What it costs you: Free to read The reviews are free to read.
How they make money: Free marketing and lead-generation asset for Vellum, an AI development/agent platform that makes money from paid subscription and usage-based plans for building LLM apps.
What they do: Aggregates and ranks frontier large language models on public benchmarks (reasoning, math, coding, tool use) alongside cost, speed, and context-window comparisons.
What to watch for: By its own one-line disclosure the data mixes "model providers" (self-reported) with "independently run evaluations," and the board shows no per-result sourcing or reproducible protocol, so you can't tell which numbers were verified first-hand.
Composite score: 2.70 / 5.00 → grade C+
Last verified: July 11, 2026 — when we last checked this entry's facts, links, and grade against the live site.
Site confirmed live: July 11, 2026 — our monthly automated check reached Vellum LLM Leaderboard and got a normal response.

How the grade was reached

Independence · 30% weight 3 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 3 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 3 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 1 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 2 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

The leaderboard's own data-source statement: "The data comes from model providers as well as independently run evaluations by Vellum or the open-source community," featuring non-saturated benchmarks (e.g. GPQA Diamond, AIME 2025, SWE-Bench) and excluding outdated ones like MMLU. It carries no conflict-of-interest or independence disclosure and no pay-for-placement mechanism. Source: Vellum LLM Leaderboard →
Vellum is an AI development platform (YC Winter 2023) founded in 2023 by Akash Sharma, Sidd Seethepalli, and Noa Flaherty, helping teams build and deploy production LLM applications and agents. Source: Y Combinator company profile →
Vellum is a commercial vendor in the LLM tooling space, raising a $5M seed (2023) and a $20M Series A (July 2025); it monetizes its development/evaluation platform via paid plans, with the public leaderboard serving as a free marketing and lead-generation asset rather than a paid ranking service. Source: Axios / Vellum funding reporting →

Compare with others

Others reviewing ai models (compare all →)

A+ EvalPlus Leaderboard A+ Stanford HELM A SWE-bench A- Artificial Analysis A- The Verge A- Vals AI A- Digital Trends B+ Engadget B TechRadar C+ LMArena C- There's An AI For That (TAAFT) C- Product Hunt D+ Futurepedia D Toolify AI A+ Hugging Face Open LLM Leaderboard B- Papers with Code

← Back to the Report Card