Find reviews of an AI coding tool

Who reviews an AI coding tool, and can you trust them?

Plumb does not review an AI coding tool itself. We tell you which sites do, and grade each on the one thing that decides whether to believe it: how independent and evidence-based the ranking is.

	Grade	Review site	Independence30%	Evidence basis30%	Method transparency20%	Conflict disclosure10%	Manipulation resistance10%
1	A+	EvalPlus Leaderboard High Scoring Confidence Benchmark Grades: its augmented HumanEval+/MBPP+ code-model scores An academic, open-source coding benchmark that auto-grades LLMs on hand-verified tests with a published, reproducible method and no money changing hands; the usual public-benchmark caveat is contamination, not commerce.	5	5	5	4	3
2	A+	Stanford HELM High Scoring Confidence Benchmark Grades: its standardized multi-scenario LLM evaluations A Stanford academic benchmark that runs its own standardized tests on AI models and publishes the raw prompts and code — about as close to an unbuyable, reproducible leaderboard as the field offers.	4	5	5	4	5
3	A	SWE-bench High Scoring Confidence Benchmark Grades: its score for resolving real GitHub issues An open, reproducible academic benchmark that can't be bought — but critics, and even OpenAI, say training-data contamination has eroded what its top scores actually prove.	5	4	5	4	2
4	A-	Artificial Analysis High Scoring Confidence Benchmark Grades: its LLM intelligence, speed, and price benchmarks Standardized, unbought AI benchmarks; a useful filter, not a final verdict.	4	5	4	2	4
5	A-	The Verge High Scoring Confidence Partly paywalled Editorial reviews Grades: its hands-on tech reviews and scores Hands-on tech journalism with real testing and disclosed affiliate links; by its own policy, commissions and ads don't dictate scores, but review depth varies by category.	4	5	3	4	4
6	A-	Vals AI High Scoring Confidence Benchmark Grades: its hands-on AI leaderboards on legal, tax, and finance tasks A rare independent benchmark of AI on real legal and tax work, with the catch that vendors opt in and some pay.	3	5	4	4	4
7	A-	Digital Trends High Scoring Confidence Editorial reviews Grades: its 5-star tech reviews and buying guides A long-running hands-on tech reviewer with a published testing method, but its "best of" picks ride on the same affiliate links it earns commissions from, which it discloses.	3	5	4	4	4
8	B+	Engadget High Scoring Confidence Editorial reviews Grades: its 1-100 editor reviews of gadgets A veteran tech publication that hands-on tests gadgets with a published ethics statement, while earning affiliate commissions on the products it recommends.	2	5	4	4	4
9	B	TechRadar High Scoring Confidence Editorial reviews Grades: its star-rated reviews and 'best of' buying guides for tech Helpful buying guides built around the buy links that actually pay the bills.	2	4	4	4	4
10	C+	LMArena High Scoring Confidence Benchmark Grades: its crowd-voted Elo leaderboard of AI models Crowd vibes on which answer sounds better, with the biggest labs structurally advantaged.	2	3	4	2	2
11	C+	Vellum LLM Leaderboard Medium Scoring Confidence Benchmark Grades: its aggregated public-benchmark LLM rankings A clean, free benchmark scoreboard for frontier LLMs that doubles as lead-gen for Vellum's dev platform; useful at a glance, but it mixes provider-reported scores with its own evals and discloses no per-result sourcing.	3	3	3	1	2
12	C-	There's An AI For That (TAAFT) High Scoring Confidence Directory / lead-gen Grades: its searchable AI-tool leaderboard A massive, popular map of AI tools ranked by community saves and votes, but the prominent "Featured" slots are an openly paid bid-for-position auction, so treat top placement as advertising, not a verdict.	1	2	3	3	2
13	C-	Product Hunt High Scoring Confidence Crowd reviews Grades: its daily upvote leaderboard of new products A launch-day upvote contest, gamed by solicited votes, that says nothing about quality.	2	2	1	4	1
14	D+	Futurepedia High Scoring Confidence Directory / lead-gen Grades: its searchable directory of AI tools A big, browsable AI-tool directory, but by its own disclosure it runs on affiliate links and vendor-paid "Verified" listings, so it's a discovery catalog, not a hands-on testing lab.	1	2	2	3	2
15	D	Toolify AI Medium Scoring Confidence Directory / lead-gen Grades: its AI-tool category and revenue leaderboards A massive, useful AI-tool index — but by its own model it ranks by popularity and paid signals, not hands-on testing, so treat it as a starting point, not a verdict.	1	1	2	1	1
16	A+	Hugging Face Open LLM Leaderboard High Scoring Confidence Benchmark Closed / dormant Grades: its standardized six-benchmark ranking of open LLMs A free, reproducible benchmark scoreboard that nobody could pay to climb — though Hugging Face itself retired it in 2025, saying its tests had grown gameable and obsolete.	5	5	5	4	3
17	B-	Papers with Code High Scoring Confidence Benchmark Closed / dormant Grades: its state-of-the-art ML leaderboards by task A free, ad-free, open-data leaderboard for AI research that nobody could pay to top, but its benchmark scores are self-reported from papers rather than independently re-run, and Meta sunset the site in July 2025.	2	3	4	4	2

These are the sites that review an AI coding tool (and the rest of the ai models). Columns are the five rubric dimensions, scored 0-5, with each column's weight shown in its header (independence and evidence carry the most). See the full methodology.

All ai models → | Find another product or service