A clean, free benchmark scoreboard for frontier LLMs that doubles as lead-gen for Vellum's dev platform; useful at a glance, but it mixes provider-reported scores with its own evals and discloses no per-result sourcing.
What it's really for A benchmark leaderboard that doubles as marketing for Vellum's developer platform.
What our grade covers The grade on this page is about its aggregated public-benchmark LLM rankings, not everything the site does.
Medium Scoring Confidence Mostly sourced, but a detail or two still needs a primary source, so the grade could shift slightly.
No party pays for placement; the board is funded by Vellum itself as free marketing, and revenue comes from customers of Vellum's paid AI-development platform rather than from the model vendors it ranks.
Source →- Operating since
- 2023 (3 years) · source
- What it costs you
- Free to read The reviews are free to read.
- How they make money
- Free marketing and lead-generation asset for Vellum, an AI development/agent platform that makes money from paid subscription and usage-based plans for building LLM apps.
- What they do
- Aggregates and ranks frontier large language models on public benchmarks (reasoning, math, coding, tool use) alongside cost, speed, and context-window comparisons.
- What to watch for
- By its own one-line disclosure the data mixes "model providers" (self-reported) with "independently run evaluations," and the board shows no per-result sourcing or reproducible protocol, so you can't tell which numbers were verified first-hand.
- Composite score
- 2.70 / 5.00 → grade C+
How the grade was reached
Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.
What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.
Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?
Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?
How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.
Evidence
- The leaderboard's own data-source statement: "The data comes from model providers as well as independently run evaluations by Vellum or the open-source community," featuring non-saturated benchmarks (e.g. GPQA Diamond, AIME 2025, SWE-Bench) and excluding outdated ones like MMLU. It carries no conflict-of-interest or independence disclosure and no pay-for-placement mechanism. Source: Vellum LLM Leaderboard →
- Vellum is an AI development platform (YC Winter 2023) founded in 2023 by Akash Sharma, Sidd Seethepalli, and Noa Flaherty, helping teams build and deploy production LLM applications and agents. Source: Y Combinator company profile →
- Vellum is a commercial vendor in the LLM tooling space, raising a $5M seed (2023) and a $20M Series A (July 2025); it monetizes its development/evaluation platform via paid plans, with the public leaderboard serving as a free marketing and lead-generation asset rather than a paid ranking service. Source: Axios / Vellum funding reporting →