docs(skills): tighten debug, root-cause-trace, and five-whys

Three audit-pass fixes across the debugging skills. debug now captures environment and recent-change context (versions, flags, dataset, seed/clock, concurrency, recent commits) as a Phase-1 step. Many intermittent bugs live in state or environment, not a local code path, and "what changed recently" is often the fastest route to the cause. root-cause-trace's defense-in-depth said to add a check at every layer that could have caught the bad value, which breeds validation spam. It now adds checks only at boundary-owning layers (ingress, persistence, the invariant owner, final render), and says a pass-through function that owns neither a boundary nor an invariant shouldn't get a duplicate null check. five-whys now makes each link carry an evidence field and a counterfactual: if you remove this cause, does the symptom above still happen? That's the guard against a tidy chain that reads well but wouldn't have prevented the failure.
author: Craig Jennings <c@cjennings.net> 2026-05-22 14:26:43 -0500
committer: Craig Jennings <c@cjennings.net> 2026-05-22 14:26:43 -0500
commit: 3916dc446c8925f64a974498d326637b34d46575 (patch)
tree: c6f685d27707511a6f0bd8fc18279e70371d45f7
parent: 282a35d30edca8c5ae7a5955111440908413f267 (diff)
download: rulesets-3916dc446c8925f64a974498d326637b34d46575.tar.gz
rulesets-3916dc446c8925f64a974498d326637b34d46575.zip
3 files changed, 16 insertions, 7 deletions
diff --git a/debug/SKILL.md b/debug/SKILL.md
index ae864db..4db6bbd 100644
--- a/debug/SKILL.md
+++ b/debug/SKILL.md
@@ -22,8 +22,9 @@ No fixes during this phase. Gather, don't guess.
 1. **Reproduce the failure** — run the failing test or trigger the bug. Capture the exact error message, stack trace, or incorrect output. If you can't reproduce it, say so plainly; intermittent failures get treated differently from deterministic ones.
 2. **Check logs and observability** — application logs, error tracking, APM traces, dashboards around the time of failure. Logs often reveal context that code reading alone cannot.
 3. **Locate the failure point** — name the file and line where the error surfaces. Read the surrounding code so you understand what it was *trying* to do, not just where it broke.
+4. **Record the environment and recent changes** — versions (runtime, key deps), feature-flag and config state, the dataset or fixture in play, seed and clock/time, concurrency level, and the recent commits or config/infra changes around when it started failing. Many intermittent bugs live in environment or state transitions, not a local code path — and "what changed recently" is often the fastest route to the cause. For a deterministic local bug this is one line; for an intermittent one it's the most important step here.
 
-You now have: a reproducer (or a "can't reproduce" note), a captured failure signature, and a failure-site address.
+You now have: a reproducer (or a "can't reproduce" note), a captured failure signature, a failure-site address, and the environment/recent-change context around it.
 
 ## Phase 2 — Route to the Right Technique
 
diff --git a/five-whys/SKILL.md b/five-whys/SKILL.md
index 7d7080d..e645f82 100644
--- a/five-whys/SKILL.md
+++ b/five-whys/SKILL.md
@@ -37,13 +37,20 @@ Good: "The 2026-04-17 release was rolled back at 14:02 after the cart-checkout e
 
 Bad: "Our releases are flaky."
 
-### 2. Ask Why — One Answer
+### 2. Ask Why — One Answer, With Evidence and a Counterfactual
 
 Not three possible answers. One best-supported answer, based on evidence you can point to. If the question genuinely has multiple independent causes, you'll branch in step 4.
 
+Each link in the chain owes two things, not just an answer:
+
+- **Evidence** — what you can point to that supports this cause (a log line, a commit, a metric, a config value). "It seems like" without evidence is a guess, and a guessed link derails every why below it.
+- **Counterfactual check** — if this cause were removed, would the symptom above it plausibly not have happened? If removing the cause leaves the symptom standing, you've named a coincidence, not a cause. This is the main guard against monocausal storytelling — a tidy chain that reads well but wouldn't actually have prevented the failure.
+
 ```
 Why did the release roll back?
   → The cart-checkout endpoint returned 500 on ~8% of traffic.
+    evidence: rollback log 14:02; APM 500-rate panel; error tracker grouped on CartController#checkout
+    counterfactual: no 500 spike → no rollback trigger fires → release stays up. Holds.
 ```
 
 ### 3. Take the Answer as the New Question
diff --git a/root-cause-trace/SKILL.md b/root-cause-trace/SKILL.md
index fad4601..bb46f41 100644
--- a/root-cause-trace/SKILL.md
+++ b/root-cause-trace/SKILL.md
@@ -99,13 +99,14 @@ Two actions, in order:
 
 **a. Fix at the trigger.** In the example: change the query to `INNER JOIN`, or explicitly handle customer-less orders (depending on intent).
 
-**b. Add defense-in-depth.** For each layer between the trigger and the symptom, ask: *could this layer have caught the bad value?* If yes, add the check.
+**b. Add defense-in-depth — at boundaries, not everywhere.** Don't armor every layer between the trigger and the symptom; that's validation spam that buries the real check and slows every call. Add a check only at a layer that *owns* a boundary or invariant:
 
-- Parser/validator layer: reject rows without `customer_id`
-- Service layer: throw if `order.customer` is nil instead of passing it downstream
-- Formatter layer: render "Unknown customer" rather than crashing
+- **Ingress / trust boundary** — where untrusted or external data first enters (parser/validator layer: reject rows without `customer_id`)
+- **Persistence boundary** — before writing to or reading from a store
+- **Invariant-owning layer** — the service that's supposed to guarantee a fact (throw if `order.customer` is nil rather than passing it downstream)
+- **Final render/output** — degrade gracefully (render "Unknown customer" rather than crashing)
 
-Each defense means the next time something similar happens, it surfaces earlier and with better context. The goal isn't any single check — it's that the bad value can't propagate silently.
+A pass-through function that neither owns the invariant nor crosses a boundary should *not* get a duplicate null check — let the boundary layers carry it. Each boundary defense means a similar bad value surfaces earlier and with better context; the goal isn't any single check, it's that the bad value can't propagate silently past the layer responsible for it.
 
 ## Adding Instrumentation
author	Craig Jennings <c@cjennings.net>	2026-05-22 14:26:43 -0500
committer	Craig Jennings <c@cjennings.net>	2026-05-22 14:26:43 -0500
commit	3916dc446c8925f64a974498d326637b34d46575 (patch)
tree	c6f685d27707511a6f0bd8fc18279e70371d45f7
parent	282a35d30edca8c5ae7a5955111440908413f267 (diff)
download	rulesets-3916dc446c8925f64a974498d326637b34d46575.tar.gz rulesets-3916dc446c8925f64a974498d326637b34d46575.zip