Stanford HELM

Item: Stanford HELM
Rating: 4.6
Author: Plumb

Stanford University — Center for Research on Foundation Models (CRFM), part of Stanford HAI

Benchmark Free to read Visit Stanford HELM ↗

A Stanford academic benchmark that runs its own standardized tests on AI models and publishes the raw prompts and code — about as close to an unbuyable, reproducible leaderboard as the field offers.

What it's really for A non-commercial academic AI benchmark; independence is the whole point.

What our grade covers The grade on this page is about its standardized multi-scenario LLM evaluations, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Follow the money

Funding comes from Stanford HAI's Industrial Affiliates Program (tech-company members) plus donated model APIs from providers like Google, OpenAI, Anthropic, Amazon, Together AI and Writer; per its disclosures these contributions do not buy ranking placement, since HELM runs every model through the same standardized tests.

Source →

Operating since: 2022 (4 years) · source
What it costs you: Free to read The reviews are free to read.
How they make money: It doesn't make money; it's a non-commercial academic project funded by Stanford's HAI Industrial Affiliates Program, with model APIs donated by providers.
What they do: HELM independently evaluates large language models (and multimodal/audio/domain variants) across many scenarios and metrics using one standardized, open-source pipeline and publishes the leaderboards.
What to watch for: It ranks AI models on benchmark tasks, not real-world product quality, and by its own disclosure the affiliated companies that fund Stanford HAI and donate API access are also among the firms whose models appear on the leaderboards.
Composite score: 4.60 / 5.00 → grade A+
Last verified: July 11, 2026 — when we last checked this entry's facts, links, and grade against the live site.
Site confirmed live: July 11, 2026 — our monthly automated check reached Stanford HELM and got a normal response.

How the grade was reached

Independence · 30% weight 4 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 5 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 5 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 4 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 5 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

HELM is described as 'an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models,' licensed Apache-2.0, with helm-run/helm-summarize/helm-server commands that let anyone re-run the evaluations. Source: stanford-crfm/helm GitHub repository →
HELM runs models itself under standardized conditions rather than using self-reported numbers: it uses identical prompt templates and uniform scoring across all models, displays raw prompts on the leaderboard, and states results are 'fully reproducible using the HELM framework.' It also discloses 'HELM Capabilities is funded by the HAI Industrial Affiliates Program' and thanks Together AI, OpenAI, Google, Anthropic, Amazon and Writer for providing model APIs. Source: Stanford CRFM — HELM Capabilities announcement →
The original HELM paper (Liang, Bommasani, Lee et al.), submitted 16 November 2022, established the benchmark, evaluating 30 models across 42 scenarios and 7 metrics under standardized conditions and releasing all raw model prompts and completions publicly for transparency. Source: Holistic Evaluation of Language Models, arXiv:2211.09110 →

Compare with others

Others reviewing ai models (compare all →)

A+ EvalPlus Leaderboard A SWE-bench A- Artificial Analysis A- The Verge A- Vals AI A- Digital Trends B+ Engadget B TechRadar C+ LMArena C+ Vellum LLM Leaderboard C- There's An AI For That (TAAFT) C- Product Hunt D+ Futurepedia D Toolify AI A+ Hugging Face Open LLM Leaderboard B- Papers with Code

← Back to the Report Card