From 3916dc446c8925f64a974498d326637b34d46575 Mon Sep 17 00:00:00 2001 From: Craig Jennings Date: Fri, 22 May 2026 14:26:43 -0500 Subject: docs(skills): tighten debug, root-cause-trace, and five-whys Three audit-pass fixes across the debugging skills. debug now captures environment and recent-change context (versions, flags, dataset, seed/clock, concurrency, recent commits) as a Phase-1 step. Many intermittent bugs live in state or environment, not a local code path, and "what changed recently" is often the fastest route to the cause. root-cause-trace's defense-in-depth said to add a check at every layer that could have caught the bad value, which breeds validation spam. It now adds checks only at boundary-owning layers (ingress, persistence, the invariant owner, final render), and says a pass-through function that owns neither a boundary nor an invariant shouldn't get a duplicate null check. five-whys now makes each link carry an evidence field and a counterfactual: if you remove this cause, does the symptom above still happen? That's the guard against a tidy chain that reads well but wouldn't have prevented the failure. --- debug/SKILL.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'debug') diff --git a/debug/SKILL.md b/debug/SKILL.md index ae864db..4db6bbd 100644 --- a/debug/SKILL.md +++ b/debug/SKILL.md @@ -22,8 +22,9 @@ No fixes during this phase. Gather, don't guess. 1. **Reproduce the failure** — run the failing test or trigger the bug. Capture the exact error message, stack trace, or incorrect output. If you can't reproduce it, say so plainly; intermittent failures get treated differently from deterministic ones. 2. **Check logs and observability** — application logs, error tracking, APM traces, dashboards around the time of failure. Logs often reveal context that code reading alone cannot. 3. **Locate the failure point** — name the file and line where the error surfaces. Read the surrounding code so you understand what it was *trying* to do, not just where it broke. +4. **Record the environment and recent changes** — versions (runtime, key deps), feature-flag and config state, the dataset or fixture in play, seed and clock/time, concurrency level, and the recent commits or config/infra changes around when it started failing. Many intermittent bugs live in environment or state transitions, not a local code path — and "what changed recently" is often the fastest route to the cause. For a deterministic local bug this is one line; for an intermittent one it's the most important step here. -You now have: a reproducer (or a "can't reproduce" note), a captured failure signature, and a failure-site address. +You now have: a reproducer (or a "can't reproduce" note), a captured failure signature, a failure-site address, and the environment/recent-change context around it. ## Phase 2 — Route to the Right Technique -- cgit v1.2.3