Find reviews of an AI coding tool
Who reviews an AI coding tool, and can you trust them?
Plumb does not review an AI coding tool itself. We tell you which sites do, and grade each on the one thing that decides whether to believe it: how independent and evidence-based the ranking is.
| Grade | Review site | |
|---|---|---|
| 1 | A+ | Grades: its standardized six-benchmark ranking of open LLMs A free, reproducible benchmark scoreboard that nobody could pay to climb — though Hugging Face itself retired it in 2025, saying its tests had grown gameable and obsolete. |
| 2 | A+ | Grades: its augmented HumanEval+/MBPP+ code-model scores An academic, open-source coding benchmark that auto-grades LLMs on hand-verified tests with a published, reproducible method and no money changing hands; the usual public-benchmark caveat is contamination, not commerce. |
| 3 | A+ | Grades: its standardized multi-scenario LLM evaluations A Stanford academic benchmark that runs its own standardized tests on AI models and publishes the raw prompts and code — about as close to an unbuyable, reproducible leaderboard as the field offers. |
| 4 | A | Grades: its score for resolving real GitHub issues An open, reproducible academic benchmark that can't be bought — but critics, and even OpenAI, say training-data contamination has eroded what its top scores actually prove. |
| 5 | A- | Grades: its LLM intelligence, speed, and price benchmarks Standardized, unbought AI benchmarks; a useful filter, not a final verdict. |
| 6 | A- | Grades: its hands-on AI leaderboards on legal, tax, and finance tasks A rare independent benchmark of AI on real legal and tax work, with the catch that vendors opt in and some pay. |
| 7 | B- | Grades: its state-of-the-art ML leaderboards by task A free, ad-free, open-data leaderboard for AI research that nobody could pay to top, but its benchmark scores are self-reported from papers rather than independently re-run, and Meta sunset the site in July 2025. |
| 8 | C+ | Grades: its crowd-voted Elo leaderboard of AI models Crowd vibes on which answer sounds better, with the biggest labs structurally advantaged. |
| 9 | C+ | Grades: its aggregated public-benchmark LLM rankings A clean, free benchmark scoreboard for frontier LLMs that doubles as lead-gen for Vellum's dev platform; useful at a glance, but it mixes provider-reported scores with its own evals and discloses no per-result sourcing. |
| 10 | C- | Grades: its searchable AI-tool leaderboard A massive, popular map of AI tools ranked by community saves and votes, but the prominent "Featured" slots are an openly paid bid-for-position auction, so treat top placement as advertising, not a verdict. |
| 11 | C- | Grades: its daily upvote leaderboard of new products A launch-day upvote contest, gamed by solicited votes, that says nothing about quality. |
| 12 | D+ | Grades: its searchable directory of AI tools A big, browsable AI-tool directory, but by its own disclosure it runs on affiliate links and vendor-paid "Verified" listings, so it's a discovery catalog, not a hands-on testing lab. |
| 13 | D | Grades: its AI-tool category and revenue leaderboards A massive, useful AI-tool index — but by its own model it ranks by popularity and paid signals, not hands-on testing, so treat it as a starting point, not a verdict. |
These are the sites that review an AI coding tool (and the rest of the ai models). Columns are the five rubric dimensions, scored 0-5, with each column's weight shown in its header (independence and evidence carry the most). See the full methodology.