diff options
| author | Craig Jennings <c@cjennings.net> | 2026-05-22 15:13:05 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-05-22 15:13:05 -0500 |
| commit | e5a01fd95771962f0ec3d0164266137f84975f39 (patch) | |
| tree | 073aacbb83b3b01f5db2f306a22405cdc06508ac | |
| parent | 0afe48df3d4fa49da889fbeaeaa38e8f971e030a (diff) | |
| download | rulesets-e5a01fd95771962f0ec3d0164266137f84975f39.tar.gz rulesets-e5a01fd95771962f0ec3d0164266137f84975f39.zip | |
docs(rules): add escalation testing and a spike protocol to testing.md
Two additions. An "Escalation Beyond Category and Pairwise" section adds property-based testing (for invariants across a broad input domain) and mutation testing (for when high coverage hides thin assertions), both as escalation paths rather than always-on gates. And the "I need to spike first" excuse is formalized into a disciplined spike protocol: TDD stays the default, but a spike is sanctioned only when timeboxed, not committed, and followed by the first failing test before productionizing.
| -rw-r--r-- | claude-rules/testing.md | 62 |
1 files changed, 61 insertions, 1 deletions
diff --git a/claude-rules/testing.md b/claude-rules/testing.md index b2ff606..b3fa5bf 100644 --- a/claude-rules/testing.md +++ b/claude-rules/testing.md @@ -78,6 +78,42 @@ the context requires *provably* exhaustive coverage (regulated systems — docum in an ADR), or the testing target is non-parametric (single happy path, performance regression, a specific error). +## Escalation Beyond Category and Pairwise + +The Normal/Boundary/Error categories and the pairwise matrix are the default +discipline. Two further techniques escalate beyond them — reach for them when +the default leaves a gap, not on every unit. + +### Property-Based Testing + +When an invariant holds across a broad input domain — round-trips +(`decode(encode(x)) == x`), idempotence (`f(f(x)) == f(x)`), ordering +invariants (output is always sorted), or any "output always satisfies X" — +generate inputs and assert the property instead of enumerating cases. The +generator explores corners you wouldn't think to write by hand, and a +failing case shrinks to a minimal reproducer. Use the standard tool for the +language (Hypothesis for Python, fast-check for JS, proptest for Rust). +State the property as the test name and let the framework supply the inputs. + +Reach for this when the behavior is a law over a domain rather than a fixed +set of examples. Keep category-discipline cases for the specific edges that +must always hold; the property test covers the space between them. + +### Mutation Testing + +When line coverage is high but you suspect the assertions are thin — tests +that execute the code without checking its output, or that pass with a +function body replaced by a stub — use mutation testing to measure whether +the suite actually kills injected faults. The tool flips conditionals, swaps +operators, and deletes statements, then reruns the suite; a surviving mutant +is a fault the tests didn't catch. Use mutmut or cosmic-ray for Python, +Stryker for JS. High line coverage with a low mutation score means weak +assertions, not a tested codebase. + +Reach for this on critical logic where coverage looks reassuring but you +want evidence the tests would fail on a regression. It's a diagnostic, not a +gate on every change — mutation runs are slow. + ## Test Organization Typical layout: @@ -274,10 +310,34 @@ TDD is non-negotiable. These are the rationalizations agents use to skip it — | "I'm only changing one line" | One-line changes cause production outages. Write a test that covers the line you're changing. | | "The existing code has no tests" | Start with a characterization test. Don't make the problem worse. | | "This is demo/prototype code" | Demos build habits. Untested demo code becomes untested production code. | -| "I need to spike first" | Spikes are fine — then throw away the spike, write the test, and implement properly. | +| "I need to spike first" | Spikes are fine — under the protocol below. Throw the spike away, then write the first failing test before productionizing. | If you catch yourself thinking any of these, stop and write the test. +### The Spike Exception (Disciplined) + +TDD stays the default. The one sanctioned way to write code before a test is +a spike — exploratory code that answers "is this approach even viable?" when +you can't yet write a meaningful failing test because the shape of the +solution is unknown. A spike is disciplined only when all three hold: + +1. **Timebox it.** Set a limit before starting (an hour, an afternoon) and + stop when it's up. An open-ended spike is just untested implementation + wearing a different name. +2. **Do not commit spike code.** The spike is a learning artifact, not a + deliverable. It never enters the branch history. Keep it in a scratch + file or a throwaway worktree. +3. **Throw the spike away, then start with a failing test.** Once the spike + has answered the viability question, delete it. Write the first failing + test against the now-understood behavior, then productionize under normal + Red/Green/Refactor. The production code is written test-first even though + the exploration wasn't — you don't promote the spike into production by + bolting tests on after. + +The spike buys understanding, not code. If you find yourself keeping the +spike because rewriting it feels wasteful, the timebox was too long or the +problem was tractable enough to TDD from the start. + ## Anti-Patterns (Do Not Do) - Hardcoded dates or timestamps (they rot) |
