AI时代Agent开发：为何TDD先写测试至关重要

2026-06-24阅读 0热度 0

TDD 在 AI 时代：为什么 Agent 比人类更需要“先写测试”

上一篇我们拆解了 Subagent-Driven Development 的执行与审核机制。但审核只能发现隐患，无法事前预防问题。Superpowers 的策略是在源头设立防线——强制 Agent 必须优先编写测试，再着手实现功能。

严格来说，测试驱动开发（TDD）并非新概念，它已存在超过20年。但 Superpowers 对 TDD 的定位与传统理解截然不同：它并非在倡导“TDD 是一种优秀实践”，而是强调“对 Agent 而言，TDD 是唯一能够客观证明代码正确性的路径”。这个出发点，彻底改变了其意义。

为什么 Agent 比人类更需要 TDD

人类开发者不做 TDD，至少还有多层兜底措施：可以手动验证，可以使用 IDE 调试器，可以通过运行程序查看输出，可以依赖经验判断边界条件。但 Agent 不具备这些能力——它无法“运行程序确认结果”（除非配置自动化执行环境），它的“验证”途径只有两条：

依靠自身推理判断代码逻辑是否正确——但这极不可靠，因为模型对自己输出的代码存在认知盲区。
执行自动化测试。

没有测试，Agent 唯一的验证手段就是自我推理。而模型最擅长的事情之一，恰恰是“合理化并解释自己的错误”——让它审查自己编写的代码，它大概率会报告“没问题，一切正常”。

TDD 正是打破这个闭环的关键：先编写测试，测试失败后，再写代码使其通过。这个过程引入了客观、可验证的中间环节，完全摆脱 Agent 的主观判断。

The Iron Law：没有失败的测试就不能写代码

NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST

这是 TDD skill 里的铁律——全大写、无条件、无例外。违反的后果很直接：

Write code before the test? Delete it. Start over.No exceptions:- Don't keep it as "reference"- Don't "adapt" it while writing tests- Don't look at it- Delete means deleteImplement fresh from tests. Period.

注意，这里并非指“测试没写好需要修改”——而是如果你先写了生产代码再补测试，代码直接删除，没有商量余地。为何如此极端？

因为代码一旦存在，你写的测试会无意识地转向“验证现有行为”，而非“定义期望行为”。你会盯着代码编写断言，测试演变成对实现的确认，而非对需求的规约。这个心理陷阱，人类会栽，Agent 则会栽得更深。

Red-Green-Refactor 循环

TDD 的核心循环包含三步，每一步都带有强制验证点：

RED → Verify RED → GREEN → Verify GREEN → REFACTOR → (stay green) → next RED

RED：写一个失败的测试

编写一个最小化的测试，描述你期望的行为：

test('retries failed operations 3 times', async () => {let attempts = 0;const operation = () => {attempts++;if (attempts < 3) throw new Error('fail');return 'success';};const result = await retryOperation(operation);expect(result).toBe('success');expect(attempts).toBe(3);});

一个测试只验证一个行为。测试名称应清晰表明“在什么场景下预期什么结果”。

对比反面案例：

test('retry works', async () => {const mock = jest.fn().mockRejectedValueOnce(new Error()).mockRejectedValueOnce(new Error()).mockResolvedValueOnce('success');await retryOperation(mock);expect(mock).toHa veBeenCalledTimes(3);});

问题一目了然：命名模糊——“retry works”说了等于没说；且测试的是 mock 调用次数，而非真实行为。

Verify RED：确认测试失败了

MANDATORY. Never skip.

运行测试，确认：

是失败（fail）而非报错（error）——两者性质不同
失败原因为“功能不存在”，而非“拼写错误”
失败信息符合预期

若测试直接通过怎么办？说明你在测试已有行为——要么功能已存在，要么测试编写有误。应修改测试，而非跳过验证。

这一步的意义在于：证明测试确实能检测到“功能缺失”。若跳过此步直接实现，你永远无法确认这个测试是否真正在检测某样东西。

GREEN：最小实现

编写恰好能让测试通过的代码，不多不少：

async function retryOperation(fn: () => Promise): Promise {for (let i = 0; i < 3; i++) {try {return await fn();} catch (e) {if (i === 2) throw e;}}throw new Error('unreachable');}

对比过度实现的版本：

async function retryOperation(fn: () => Promise,options?: {maxRetries?: number;backoff?: 'linear' | 'exponential';onRetry?: (attempt: number) => void;}): Promise {// YAGNI - 测试未要求这些功能}

核心原则：不要添加测试未要求的功能，不要重构其他代码，不要做任何“超越测试”的优化。

Verify GREEN：确认测试通过

运行测试，确认：

新测试通过
其他所有测试仍通过
无 warning 或 error

Test fails? Fix code, not test.Other tests fail? Fix now.

REFACTOR：清理代码

仅在 GREEN 阶段后才能进行重构：

消除重复
改善命名
提取辅助函数

约束明确：重构不能改变行为——测试必须始终保持绿色。

然后：下一个 RED

回到起点，编写下一条测试。整个循环是增量式的——每次只前进一小步，每一步都有据可查。

为什么顺序是铁律

Superpowers 用大量篇幅论证“为何不能先写代码后补测试”。这些并非教条式论证，而是针对模型最容易犯的错误专门设计。

“先写代码再补测试，效果一样”

Tests written after code pass immediately. Passing immediately proves nothing:- Might test wrong thing- Might test implementation, not beha vior- Might miss edge cases you forgot- You never saw it catch the bugTest-first forces you to see the test fail, proving it actually tests something.

这一点对 Agent 尤为致命——Agent 补的测试几乎一定基于它已写好的实现。它不会故意写出不通过的测试，这就完全失去了 TDD 的意义。

“我已经手动验证了所有边界情况”

Manual testing is ad-hoc. You think you tested everything but:- No record of what you tested- Can't re-run when code changes- Easy to forget cases under pressure- "It worked when I tried it" ≠ comprehensive

“删掉已经写好的代码太浪费了”

Sunk cost fallacy. The time is already gone. Your choice now:- Delete and rewrite with TDD (X more hours, high confidence)- Keep it and add tests after (30 min, low confidence, likely bugs)The "waste" is keeping code you can't trust.

“TDD 是教条主义，务实才是正道”

TDD IS pragmatic:- Finds bugs before commit (faster than debugging after)- Prevents regressions (tests catch breaks immediately)- Documents beha vior (tests show how to use code)- Enables refactoring (change freely, tests catch breaks)"Pragmatic" shortcuts = debugging in production = slower.

Common Rationalizations 表

这是 Superpowers 的 Rationalization Prevention 机制在 TDD 领域的具体应用——提前将所有潜在借口摆上桌面，各个击破：

借口	现实
"Too simple to test"	Simple code breaks. Test takes 30 seconds.
"I'll test after"	Tests passing immediately prove nothing.
"Tests after achieve same goals"	Tests-after = "what does this do?" Tests-first = "what should this do?"
"Already manually tested"	Ad-hoc ≠ systematic. No record, can't re-run.
"Deleting X hours is wasteful"	Sunk cost fallacy. Keeping unverified code is technical debt.
"Keep as reference, write tests first"	You'll adapt it. That's testing after. Delete means delete.
"Need to explore first"	Fine. Throw away exploration, start with TDD.
"Test hard = design unclear"	Listen to test. Hard to test = hard to use.
"TDD will slow me down"	TDD faster than debugging. Pragmatic = test-first.
"Manual test faster"	Manual doesn't prove edge cases. You'll re-test every change.
"Existing code has no tests"	You're improving it. Add tests for existing code.

注意第7条“Need to explore first”——Superpowers 允许探索性编码（spike），但要求探索完成后删除所有代码，从 TDD 重新开始。探索的目的是理解问题，而非产出代码。

Red Flags：什么时候 TDD 做错了

- Code before test- Test after implementation- Test passes immediately- Can't explain why test failed- Tests added "later"- Rationalizing "just this once"- "I already manually tested it"- "Tests after achieve the same purpose"- "It's about spirit not ritual"- "Keep as reference" or "adapt existing code"- "Already spent X hours, deleting is wasteful"- "TDD is dogmatic, I'm being pragmatic"- "This is different because..."All of these mean: Delete code. Start over with TDD.

最后一条“This is different because...”最为隐蔽——模型总能找到“这次情况特殊”的理由。Superpowers 的回答很干脆：没有“特殊情况”。

Good Tests 的标准

维度	好的测试	坏的测试
Minimal	只测一件事。名字里有“and”？拆开。	`test('validates email and domain and whitespace')`
Clear	名字描述行为	`test('test1')`
Shows intent	展示期望的 API 用法	让人看不出代码应该怎么用
Real code	用真实代码（mock 只在不可避免时用）	到处 mock

一个关键原则：测试应展示 API 的理想用法。如果测试代码写得很丑陋、很复杂，那说明 API 设计本身有问题。

Test hard = design unclear. Listen to test. Hard to test = hard to use.

Bug Fix 的 TDD 流程

Bug 修复尤其适合 TDD——先写测试复现 bug，再修复，一劳永逸。

假设有个 bug：空邮箱被接受了。

RED（复现 bug）：

test('rejects empty email', async () => {const result = await submitForm({ email: '' });expect(result.error).toBe('Email required');});

Verify RED：

$ npm testFAIL: expected 'Email required', got undefined

测试失败，证明 bug 确实存在。

GREEN（最小修复）：

function submitForm(data: FormData) {if (!data.email?.trim()) {return { error: 'Email required' };}// ...}

Verify GREEN：

$ npm testPASS

REFACTOR：若多个字段需类似验证，提取验证逻辑。

此流程的好处：修复 bug 后，该测试永久保留在测试套件中。若未来有人改动相关代码导致回归，测试会立即捕获。

Bug found? Write failing test reproducing it. Follow TDD cycle. Test proves fix and prevents regression.Never fix bugs without a test.

When Stuck：测试写不出来怎么办

问题	解法
不知道怎么测	先写理想的 API 调用方式。先写断言。若仍不行，去问你的 human partner。
测试太复杂	说明设计太复杂。简化接口。
必须 mock 所有东西	代码耦合过紧。使用依赖注入解耦。
测试 setup 太大	提取 helper。若仍复杂，说明设计需简化。

核心理念：测试困难是设计问题的症状，而非 TDD 方法论的问题。如果测试难写，不是“TDD 不适合这个场景”——而是“代码设计需要改进”。

Verification Checklist

每次任务完成前的自查清单：

Before marking work complete:- [ ] Every new function/method has a test- [ ] Watched each test fail before implementing- [ ] Each test failed for expected reason (feature missing, not typo)- [ ] Wrote minimal code to pass each test- [ ] All tests pass- [ ] Output pristine (no errors, warnings)- [ ] Tests use real code (mocks only if una voidable)- [ ] Edge cases and errors coveredCan't check all boxes? You skipped TDD. Start over.

TDD 和上一篇的 SDD 怎么配合

在 Subagent-Driven Development 流程中，TDD 在 Implementer subagent 环节发挥作用：

Coordinator 分派 task│▼Implementer subagent 开始执行│├── RED: 写失败测试├── Verify RED: 运行确认失败├── GREEN: 最小实现├── Verify GREEN: 运行确认通过├── REFACTOR: 清理└── Commit│▼Spec Reviewer: 实现匹配 spec 吗？│▼Code Quality Reviewer: 代码质量如何？

Implementer 内部通过 TDD 保证自己编写的代码正确。Reviewer 从外部视角验证其是否满足 spec 和质量标准。内部保障（TDD）+ 外部审核（Review）= 双重质量闭环。

对 Agent 的特殊意义

TDD 对人类来说是“更好的实践”。但对 Agent 而言，TDD 几乎是“唯一可靠的验证手段”：

验证方式	人类	Agent
手动运行程序	✅ 随时可做	❌ 需要特殊环境配置
IDE 调试器	✅ 设断点、看变量	❌ 不可用
直觉/经验判断	✅ 有一定可靠性	❌ 模型会 rationalize
自动化测试	✅ 可选	✅ 唯一客观手段
自我 review	✅ 有一定效果	⚠️ 严重偏向自己的代码

这就是为什么 Superpowers 将 TDD 从“推荐实践”升级为“铁律”——并非出于教条，而是因为 Agent 没有其他可靠的质量保障手段。

实践建议

你不需要 Superpowers 也能让 Agent 做 TDD

在 CLAUDE.md 或 system prompt 中加入类似约束：

## TDD 铁律所有新功能和 bug 修复必须遵循 RED-GREEN-REFACTOR：1. 先写一个失败的测试2. 运行测试确认它失败了（不是报错，是失败）3. 写最少的代码让测试通过4. 运行测试确认通过5. 需要的话清理代码违反（先写代码再补测试）→ 删除代码从测试重新开始。

关键是强制“Verify RED”

大多数 Agent 愿意先写测试——但它们经常跳过“运行测试确认失败”这一步。这一步是 TDD 的灵魂：如果没看到测试失败，你就不知道它到底在检测什么。

If you didn't watch the test fail, you don't know if it tests the right thing.

用 Rationalization 表对抗模型的偷懒

模型会找各种理由跳过 TDD。提前将这些理由列出并标记为“这不是合理的跳过理由”——这就是 Rationalization Prevention 的力量。

总结

Superpowers 对 TDD 的处理揭示了一个深层洞察：AI Agent 需要 TDD 不是因为“好的开发习惯”，而是因为自动化测试是它唯一的客观质量验证手段。

没有 TDD 的 Agent 只能靠自我推理来判断代码是否正确——而这恰恰是最不可靠的方式。TDD 提供了客观、可重复、不依赖自我判断的验证路径。

下一篇是本系列的终章：当你想创建自己的 Skill（行为塑造代码）时，怎么用“压力测试”来确保它真的能约束 Agent——Writing Skills 的 TDD。

直接拿走：加到 CLAUDE.md 的 TDD 铁律

不需要 Superpowers。把下面这段加到你的 CLAUDE.md 里：

## TDD 铁律NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST.违反后果：- 测试之前写了代码？删掉，从头来。- 不要保留“参考”、不要“适配”、不要看它。删除就是删除。Red-Green-Refactor 循环：1. RED：写一个最小的失败测试，跑一遍确认它失败2. GREEN：写最少的代码让测试通过，跑一遍确认通过3. REFACTOR：测试通过后清理代码4. 每一步都要实际执行命令看输出——“我觉得会通过”不算。禁止用语：should pass、probably works、seems right、I tested it manually

本文素材来源：obra/superpowers/skills/test-driven-development/SKILL.md