Why We Treat AI-Generated Code as Legacy Code — And the Review Checklist That Catches 90% of Bugs
Tool Crucible evaluation of Why We Treat AI-Generated Code as Legacy Code — And the Review Checklist That Ca — real-world testing, tradeoffs, and current stack.
Published 2026-06-07
TL;DR: Debugging AI code takes 2-3x longer than human-written code; comprehension time jumped from 20%→60% of workflow. Our mandatory 4-step review checklist cuts rework by 70% — full comparison.
The Context
4-person team, heavy AI-assisted development (Cursor composer, Copilot, local models). Shipped 3 features in Sprint 12 that passed CI but failed in staging: hallucinated API imports, missing error handling, synthetic test mocks. Root cause: devs reviewing AI output like peer code — but AI doesn’t “think,” it predicts. The mental model mismatch cost 2 sprint days.
What We Tested
| Approach | Use Case | Verdict | Why |
|---|---|---|---|
| Standard PR review | All AI-generated code | ❌ | Reviewers check logic, not provenance; miss hallucinated imports, wrong API versions |
| ”Trust but verify” (run tests) | Feature completion | ❌ | Tests often generated by same AI; shared blind spots (e.g., both miss auth edge cases) |
| Mandatory comprehension checklist | Every AI-assisted PR | ✅ | 4 explicit steps force mental model reconstruction; catches 90% of AI-specific bugs |
| AI-assisted review (Claude reviews Cursor) | Second pass | ⚠️ | Catches different errors but same class of hallucination; not a substitute for human comprehension |
The Pivot Point
A PR added Redis caching to 5 services. Cursor wrote the middleware, config, and tests. All passed. Staging revealed: the middleware used redis-py v4 patterns (redis.asyncio) but requirements.txt pinned v3. AI hallucinated the import path. Tests mocked the v4 API — so they passed. Production would have crashed. The dev who wrote it didn’t know which Redis version was installed. That’s the trap: AI writes code for a context it doesn’t have.
What We Use Now
4-step AI Code Comprehension Checklist (enforced via PR template, required checkboxes):
- Run tests you didn’t generate — Write at least one integration test manually covering the happy path + one error path. If you can’t, you don’t understand the code.
- Trace 3 critical paths manually — Open the diff. Follow: request → handler → DB → response. Verify each hop exists, types match, errors propagate. No skipping.
- Check for hallucinated imports/APIs — Every external import:
grep -r "import X" --include="*.py" .→ verify version in lockfile. Every API call: check official docs for current version, not training cutoff. - Verify error handling isn’t synthetic — AI loves
try: ... except Exception: pass. Search for bareexcept:,passin catch blocks,logger.errorwithout re-raise. Replace with typed exceptions.
Tooling: .github/pr-template.md with checklist. CI step fails if checklist not all checked (GitHub Actions + gh pr checks).
When You’d Choose Differently
- Throwaway prototypes / spikes: Skip checklist; tag PR
[spike]— comprehended only when promoted to production. - Solo dev, full context ownership: If you wrote the surrounding system, comprehension is faster — but still run steps 1 & 3.
- Greenfield with no legacy: Fewer version mismatches; hallucination risk shifts to architecture (wrong patterns) not imports.
Tool Crucible Rating
| Overall | Ease | Value | Support |
|---|---|---|---|
| 4.3/5 | 3.0/5 | 5/5 | 3.5/5 |
This is part of our AI development workflow series. See full comparison: AI Code Comprehension Workflow 2026
Last reviewed 2026-06-07. See our methodology and affiliate policy.