Plumb

Find reviews of an AI coding tool

Who reviews an AI coding tool, and can you trust them?

Plumb does not review an AI coding tool itself. We tell you which sites do, and grade each on the one thing that decides whether to believe it: how independent and evidence-based the ranking is.

Grade Review site Independence30%Evidence basis30%Method transparency20%Conflict disclosure10%Manipulation resistance10%
1 A+
Hugging Face Open LLM Leaderboard High Scoring Confidence Benchmark
Grades: its standardized six-benchmark ranking of open LLMs
A free, reproducible benchmark scoreboard that nobody could pay to climb — though Hugging Face itself retired it in 2025, saying its tests had grown gameable and obsolete.
55543
2 A+
EvalPlus Leaderboard High Scoring Confidence Benchmark
Grades: its augmented HumanEval+/MBPP+ code-model scores
An academic, open-source coding benchmark that auto-grades LLMs on hand-verified tests with a published, reproducible method and no money changing hands; the usual public-benchmark caveat is contamination, not commerce.
55543
3 A+
Stanford HELM High Scoring Confidence Benchmark
Grades: its standardized multi-scenario LLM evaluations
A Stanford academic benchmark that runs its own standardized tests on AI models and publishes the raw prompts and code — about as close to an unbuyable, reproducible leaderboard as the field offers.
45545
4 A
SWE-bench High Scoring Confidence Benchmark
Grades: its score for resolving real GitHub issues
An open, reproducible academic benchmark that can't be bought — but critics, and even OpenAI, say training-data contamination has eroded what its top scores actually prove.
54542
5 A-
Artificial Analysis High Scoring Confidence Benchmark
Grades: its LLM intelligence, speed, and price benchmarks
Standardized, unbought AI benchmarks; a useful filter, not a final verdict.
45424
6 A-
Vals AI High Scoring Confidence Benchmark
Grades: its hands-on AI leaderboards on legal, tax, and finance tasks
A rare independent benchmark of AI on real legal and tax work, with the catch that vendors opt in and some pay.
35444
7 B-
Papers with Code High Scoring Confidence Benchmark
Grades: its state-of-the-art ML leaderboards by task
A free, ad-free, open-data leaderboard for AI research that nobody could pay to top, but its benchmark scores are self-reported from papers rather than independently re-run, and Meta sunset the site in July 2025.
23442
8 C+
LMArena High Scoring Confidence Benchmark
Grades: its crowd-voted Elo leaderboard of AI models
Crowd vibes on which answer sounds better, with the biggest labs structurally advantaged.
23422
9 C+
Vellum LLM Leaderboard Medium Scoring Confidence Benchmark
Grades: its aggregated public-benchmark LLM rankings
A clean, free benchmark scoreboard for frontier LLMs that doubles as lead-gen for Vellum's dev platform; useful at a glance, but it mixes provider-reported scores with its own evals and discloses no per-result sourcing.
33312
10 C-
There's An AI For That (TAAFT) High Scoring Confidence Directory / lead-gen
Grades: its searchable AI-tool leaderboard
A massive, popular map of AI tools ranked by community saves and votes, but the prominent "Featured" slots are an openly paid bid-for-position auction, so treat top placement as advertising, not a verdict.
12332
11 C-
Product Hunt High Scoring Confidence Crowd reviews
Grades: its daily upvote leaderboard of new products
A launch-day upvote contest, gamed by solicited votes, that says nothing about quality.
22141
12 D+
Futurepedia High Scoring Confidence Directory / lead-gen
Grades: its searchable directory of AI tools
A big, browsable AI-tool directory, but by its own disclosure it runs on affiliate links and vendor-paid "Verified" listings, so it's a discovery catalog, not a hands-on testing lab.
12232
13 D
Toolify AI Medium Scoring Confidence Directory / lead-gen
Grades: its AI-tool category and revenue leaderboards
A massive, useful AI-tool index — but by its own model it ranks by popularity and paid signals, not hands-on testing, so treat it as a starting point, not a verdict.
11211

These are the sites that review an AI coding tool (and the rest of the ai models). Columns are the five rubric dimensions, scored 0-5, with each column's weight shown in its header (independence and evidence carry the most). See the full methodology.

All ai models → | Find another product or service