How we built the Crucible Score: seven axes, one number
A deep look at the methodology behind the Crucible Score — seven axes, category-specific weights, and why composite scoring beats single-metric ratings.
Published 2026-05-29
The problem with single-number reviews
Most AI tool reviews boil everything down to a single star rating or a score out of 10. That’s convenient for readers, but it hides the dimensions that actually matter in production.
A tool can score 9/10 on output quality and 2/10 on reliability. A single number averaging those into a 5.5 doesn’t tell you that half the time it silently corrupts your workflow.
The Crucible Score was built to fix that.
The seven axes
Every tool is scored 1–10 on each axis, then combined into a weighted composite out of 100. The weights are category-specific and published, so you can see exactly why a tool scored what it did.
The seven axes:
- Performance — Success on a fixed battery of hard, category-specific tasks. The core stress test: does it do the job well, repeatedly?
- Reliability — Consistency across repeated runs, error rate, production latency, and workflow resilience under pressure.
- Price / Value — Real cost-per-result at scale, not the sticker price. Hidden costs and free-tier honesty.
- Setup Friction — Time-to-first-value, onboarding quality, and docs quality. High friction is the signal that a tool needs a build partner.
- Integrations — API quality, Zapier / Make / n8n fit. Does it slot into a real production stack?
- Support Maturity — Support responsiveness, documentation, and funding/longevity risk.
- Privacy / Compliance — Data handling, whether it trains on your data, and SOC2 / HIPAA / GDPR posture.
Why category-specific weights matter
A cold-email tool and a voice AI tool shouldn’t be judged by the same standard.
For cold-email tools, deliverability and price matter most. We weight reliability and price/value highest because those are the axes that determine whether the tool pays for itself or burns your sender reputation.
For image generation tools, output quality and performance (output quality on hard, standardized prompts) dominate the weighting because creative consistency is the product.
For automation tools, integrations and reliability are king. A Zapier or n8n alternative that can’t connect to your existing stack is a toy, not a tool.
The weights are published per category so you can see the tradeoffs.
The score is not the verdict
The Crucible Score is a composite of the seven axes, but it’s not the only thing that matters.
A tool that scores 85/100 but struggles in one specific way that’s critical to your workflow may still be a poor fit. The score helps you shortlist. The practical notes and the per-axis breakdown tell you whether it fits.
Read the score first. Read the testing notes second.
Last reviewed 2026-05-29. See our methodology and affiliate policy.