docs(rules): add escalation testing and a spike protocol to testing.md

Two additions. An "Escalation Beyond Category and Pairwise" section adds property-based testing (for invariants across a broad input domain) and mutation testing (for when high coverage hides thin assertions), both as escalation paths rather than always-on gates. And the "I need to spike first" excuse is formalized into a disciplined spike protocol: TDD stays the default, but a spike is sanctioned only when timeboxed, not committed, and followed by the first failing test before productionizing.
author: Craig Jennings <c@cjennings.net> 2026-05-22 15:13:05 -0500
committer: Craig Jennings <c@cjennings.net> 2026-05-22 15:13:05 -0500
commit: e5a01fd95771962f0ec3d0164266137f84975f39 (patch)
tree: 073aacbb83b3b01f5db2f306a22405cdc06508ac
parent: 0afe48df3d4fa49da889fbeaeaa38e8f971e030a (diff)
download: rulesets-e5a01fd95771962f0ec3d0164266137f84975f39.tar.gz
rulesets-e5a01fd95771962f0ec3d0164266137f84975f39.zip
1 files changed, 61 insertions, 1 deletions
diff --git a/claude-rules/testing.md b/claude-rules/testing.md
index b2ff606..b3fa5bf 100644
--- a/claude-rules/testing.md
+++ b/claude-rules/testing.md
@@ -78,6 +78,42 @@ the context requires *provably* exhaustive coverage (regulated systems — docum
 in an ADR), or the testing target is non-parametric (single happy path,
 performance regression, a specific error).
 
+## Escalation Beyond Category and Pairwise
+
+The Normal/Boundary/Error categories and the pairwise matrix are the default
+discipline. Two further techniques escalate beyond them — reach for them when
+the default leaves a gap, not on every unit.
+
+### Property-Based Testing
+
+When an invariant holds across a broad input domain — round-trips
+(`decode(encode(x)) == x`), idempotence (`f(f(x)) == f(x)`), ordering
+invariants (output is always sorted), or any "output always satisfies X" —
+generate inputs and assert the property instead of enumerating cases. The
+generator explores corners you wouldn't think to write by hand, and a
+failing case shrinks to a minimal reproducer. Use the standard tool for the
+language (Hypothesis for Python, fast-check for JS, proptest for Rust).
+State the property as the test name and let the framework supply the inputs.
+
+Reach for this when the behavior is a law over a domain rather than a fixed
+set of examples. Keep category-discipline cases for the specific edges that
+must always hold; the property test covers the space between them.
+
+### Mutation Testing
+
+When line coverage is high but you suspect the assertions are thin — tests
+that execute the code without checking its output, or that pass with a
+function body replaced by a stub — use mutation testing to measure whether
+the suite actually kills injected faults. The tool flips conditionals, swaps
+operators, and deletes statements, then reruns the suite; a surviving mutant
+is a fault the tests didn't catch. Use mutmut or cosmic-ray for Python,
+Stryker for JS. High line coverage with a low mutation score means weak
+assertions, not a tested codebase.
+
+Reach for this on critical logic where coverage looks reassuring but you
+want evidence the tests would fail on a regression. It's a diagnostic, not a
+gate on every change — mutation runs are slow.
+
 ## Test Organization
 
 Typical layout:
@@ -274,10 +310,34 @@ TDD is non-negotiable. These are the rationalizations agents use to skip it —
 | "I'm only changing one line" | One-line changes cause production outages. Write a test that covers the line you're changing. |
 | "The existing code has no tests" | Start with a characterization test. Don't make the problem worse. |
 | "This is demo/prototype code" | Demos build habits. Untested demo code becomes untested production code. |
-| "I need to spike first" | Spikes are fine — then throw away the spike, write the test, and implement properly. |
+| "I need to spike first" | Spikes are fine — under the protocol below. Throw the spike away, then write the first failing test before productionizing. |
 
 If you catch yourself thinking any of these, stop and write the test.
 
+### The Spike Exception (Disciplined)
+
+TDD stays the default. The one sanctioned way to write code before a test is
+a spike — exploratory code that answers "is this approach even viable?" when
+you can't yet write a meaningful failing test because the shape of the
+solution is unknown. A spike is disciplined only when all three hold:
+
+1. **Timebox it.** Set a limit before starting (an hour, an afternoon) and
+   stop when it's up. An open-ended spike is just untested implementation
+   wearing a different name.
+2. **Do not commit spike code.** The spike is a learning artifact, not a
+   deliverable. It never enters the branch history. Keep it in a scratch
+   file or a throwaway worktree.
+3. **Throw the spike away, then start with a failing test.** Once the spike
+   has answered the viability question, delete it. Write the first failing
+   test against the now-understood behavior, then productionize under normal
+   Red/Green/Refactor. The production code is written test-first even though
+   the exploration wasn't — you don't promote the spike into production by
+   bolting tests on after.
+
+The spike buys understanding, not code. If you find yourself keeping the
+spike because rewriting it feels wasteful, the timebox was too long or the
+problem was tractable enough to TDD from the start.
+
 ## Anti-Patterns (Do Not Do)
 
 - Hardcoded dates or timestamps (they rot)
author	Craig Jennings <c@cjennings.net>	2026-05-22 15:13:05 -0500
committer	Craig Jennings <c@cjennings.net>	2026-05-22 15:13:05 -0500
commit	e5a01fd95771962f0ec3d0164266137f84975f39 (patch)
tree	073aacbb83b3b01f5db2f306a22405cdc06508ac
parent	0afe48df3d4fa49da889fbeaeaa38e8f971e030a (diff)
download	rulesets-e5a01fd95771962f0ec3d0164266137f84975f39.tar.gz rulesets-e5a01fd95771962f0ec3d0164266137f84975f39.zip