Plumb
A-

AI model benchmarking

Vals AI

Independent (Vals AI, Inc.; VC-backed, no corporate parent)

Benchmark Free to read Visit Vals AI ↗

A rare independent benchmark of AI on real legal and tax work, with the catch that vendors opt in and some pay.

What it's really for An independent AI benchmark; testing models on real expert tasks is the whole job.

What our grade covers The grade on this page is about its hands-on AI leaderboards on legal, tax, and finance tasks, not everything the site does.

High Scoring Confidence Checked against primary sources. We are confident in the facts and the grade here.

Follow the money

Vals AI is paid by AI vendors it benchmarks: in its Vals Legal AI Report it discloses "Vals AI has a customer relationship with one or more of the participants," and those participants (Harvey, Thomson Reuters, vLex, Vecflow) joined voluntarily and chose which skills to be evaluated on, so the parties it ranks are also the parties that fund it and shape what gets measured.

Source →
Operating since
2023 (3 years) · source
What it costs you
Free to read The reviews are free to read.
How they make money
It earns revenue from enterprise/custom benchmarking engagements, licensing of its private validation datasets, and access to its evaluation platform/infrastructure, including from some of the AI vendors it evaluates.
What they do
It publishes independent leaderboards that score AI models and agentic products on realistic legal, tax, and finance tasks by hands-on testing them against expert-built, private held-out test sets.
What to watch for
In its industry reports the evaluated vendors participate voluntarily and pick which skills they're scored on (and some are also paying Vals customers), so a leaderboard may omit tasks where a vendor would look weak rather than show the full picture.
Composite score
4.00 / 5.00 → grade A-

How the grade was reached

Independence · 30% weight 3 / 5

Does the site take money from the very entities it ranks? Pay-for-placement, vendor-funded data, and affiliate commissions all pull this down. The less the ranking can be bought, the higher the score.

Evidence basis · 30% weight 5 / 5

What is the ranking actually built on? Hands-on testing scores highest, then verified first-hand reviews, then opinion or popularity surveys and self-reported figures, then pay-to-rank, which scores lowest.

Method transparency · 20% weight 4 / 5

Is the methodology published, specific, and reproducible? Can a reader see how a given rank was reached, or is it a black box?

Conflict disclosure · 10% weight 4 / 5

Are commercial relationships, sponsorships, and affiliate arrangements disclosed clearly and near the rankings themselves, rather than buried?

Manipulation resistance · 10% weight 4 / 5

How hard is it to game? Controls against fake reviews, solicited reviews, and vendor gaming raise this; an open box anyone can stuff lowers it.

Evidence

Compare with others

Others reviewing ai models (compare all →)

← Back to the Report Card