diff options
| author | Craig Jennings <c@cjennings.net> | 2026-05-06 05:23:01 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-05-06 05:23:01 -0500 |
| commit | 7f51ed751ce8bf1dad23cfd9534d1238f538671e (patch) | |
| tree | 2b3e21eccea8b67eddcd7c6318ee262216328dc6 | |
| parent | 40dda3a48fbfbc14dd76c3e21d289a1b0fa813a8 (diff) | |
| download | rulesets-7f51ed751ce8bf1dad23cfd9534d1238f538671e.tar.gz rulesets-7f51ed751ce8bf1dad23cfd9534d1238f538671e.zip | |
refactor(debug): make debug a triage router, align specialist cross-refs
I sharpened the debug skill so it stops duplicating root-cause-trace and five-whys. Phase 1 captures evidence and stops there. Phase 2 routes to the right specialist instead of asking why three times inline. Phases 3 and 4 keep the verify-and-fix discipline. I also updated the companion lines in root-cause-trace and five-whys so all three descriptions stay in sync.
| -rw-r--r-- | debug/SKILL.md | 62 | ||||
| -rw-r--r-- | five-whys/SKILL.md | 2 | ||||
| -rw-r--r-- | root-cause-trace/SKILL.md | 2 |
3 files changed, 36 insertions, 30 deletions
diff --git a/debug/SKILL.md b/debug/SKILL.md index 1a64e82..ae864db 100644 --- a/debug/SKILL.md +++ b/debug/SKILL.md @@ -1,11 +1,11 @@ --- name: debug -description: Investigate a bug or test failure methodically through four phases — understand the symptom (reproduce, read logs, locate failure point, trace data flow), isolate variables (minimal repro, bisect), form and test hypotheses, then fix at the root. Captures evidence before proposing fixes; rejects shotgun debugging; escalates to architectural investigation after three failed fix attempts. Use when the failure mode is unclear, the failure reproduces inconsistently, or you're about to start guessing. Do NOT use for clear local bugs where the fix site is obvious (just fix it), for ticket-driven implementation work with a known fix (use start-work), for backward-walking a specific error up the call stack (use root-cause-trace), or for process/organizational root-cause analysis of recurring incidents (use five-whys). Companion to start-work / root-cause-trace / five-whys — debug is the broad investigative workflow; the others specialize. +description: Triage a bug or test failure — capture evidence, decide which investigation technique fits, then hand off and verify the fix. The skill itself covers Phase 1 (capture) and Phases 3-4 (verify and fix); cause-finding is delegated to root-cause-trace (for code-execution chains) or five-whys (for process / decision chains). Use when the failure mode is unclear and you don't yet know which technique applies. Do NOT use for clear local bugs where the fix site is obvious (just fix it), for ticket-driven implementation work with a known fix (use start-work), when you already know the bug is a stack-trace walk (skip straight to root-cause-trace), or when it's a process post-mortem (skip straight to five-whys). --- -# /debug — Systematic Debugging +# /debug — Triage and Route -Investigate a bug or test failure methodically. No guessing, no shotgun fixes. +A bug investigation has three jobs: gather evidence, decide what kind of bug it is, and verify the fix. This skill covers the first and third. The middle — actually finding the cause — is delegated to whichever specialist matches the bug's shape. ## Usage @@ -15,48 +15,54 @@ Investigate a bug or test failure methodically. No guessing, no shotgun fixes. If no description is given, prompt the user to describe the symptom. -## Instructions +## Phase 1 — Capture Evidence -Work through four phases in order. Do not skip to a fix. +No fixes during this phase. Gather, don't guess. -### Phase 1: Understand the Symptom +1. **Reproduce the failure** — run the failing test or trigger the bug. Capture the exact error message, stack trace, or incorrect output. If you can't reproduce it, say so plainly; intermittent failures get treated differently from deterministic ones. +2. **Check logs and observability** — application logs, error tracking, APM traces, dashboards around the time of failure. Logs often reveal context that code reading alone cannot. +3. **Locate the failure point** — name the file and line where the error surfaces. Read the surrounding code so you understand what it was *trying* to do, not just where it broke. -1. **Reproduce the failure** — run the failing test or trigger the bug. Capture the exact error message, stack trace, or incorrect output. If you can't reproduce it, say so — don't guess. -2. **Check logs and observability** — review application logs, error tracking, and metrics around the time of failure. For deployed services, check structured logs, APM traces, and alerting dashboards. Logs often reveal context that code reading alone cannot. -3. **Locate the failure point** — identify the exact file and line where the error occurs. Read the surrounding code. Understand what the code is trying to do, not just where it fails. -4. **Trace the data flow** — follow the inputs from their origin to the failure point. Read callers, callees, models, serializers, and middleware in the path. Understand how the data got into the state that caused the failure. +You now have: a reproducer (or a "can't reproduce" note), a captured failure signature, and a failure-site address. -Do not propose any fix during this phase. You are gathering evidence. +## Phase 2 — Route to the Right Technique -### Phase 2: Identify the Root Cause +Decide which kind of bug this is. Each branch hands off to the appropriate specialist. -5. **Ask "why?" at least three times** — if a value is wrong in a view, why? Because the service returned bad data. Why? Because the model query missed a filter. Why? Because the migration didn't add the index. That's the root cause. -6. **Check for related symptoms** — search for similar patterns elsewhere in the codebase. If the bug is in one endpoint, check sibling endpoints for the same mistake. Bugs often travel in packs. -7. **Form a hypothesis** — state the root cause clearly: "The bug is caused by [X] in [file:line] because [reason]." If you have multiple hypotheses, rank them by likelihood. +- **Code-execution chain** — there's a stack trace, the failure is downstream of a bad value, the fix site isn't the trigger site → use [`root-cause-trace`](../root-cause-trace/SKILL.md). It walks the call chain backward to the origin and adds defense-in-depth at each layer. +- **Process / decision / organizational failure** — recurring incident, missed review, "we've fixed this three times," slow CI, manual step that didn't get run → use [`five-whys`](../five-whys/SKILL.md). It drives why-questioning until the cause is a missing mechanism, not a person. +- **Local, obvious fix site** — the failure point and the cause are the same line, no tracing or post-morteming needed → don't use this skill. Just fix it. -### Phase 3: Verify the Hypothesis +If two branches genuinely apply (a code bug also revealed a process gap), run both — the techniques are independent and complement each other. -8. **Write a failing test** that proves your hypothesis — the test should fail for the reason you identified, not just any reason. If the test passes, your hypothesis is wrong. Go back to Phase 2. -9. **Confirm the test fails for the right reason** — read the failure output. Does it match your hypothesis? A test that fails for a different reason than expected is not evidence. +## Phase 3 — Verify the Hypothesis -### Phase 4: Fix and Verify +Once the chosen specialist has produced a hypothesis: -10. **Write the minimal fix** — change only what is necessary to address the root cause. Do not refactor, clean up, or improve adjacent code. -11. **Run the failing test** — confirm it passes. -12. **Add boundary and error case tests** — cover edge cases around the fix. -13. **Run the full test suite** — confirm no regressions. -14. **Commit** following conventional commit format. +1. **Write a failing test** that proves the hypothesis — the test should fail for the exact reason identified, not just any reason. +2. **Confirm it fails for the right reason** — read the failure output. A test that fails for a different reason than expected is not evidence. + +If the test passes when the hypothesis says it should fail, the hypothesis is wrong. Go back to Phase 2 and re-route, or rerun the specialist with new evidence. + +## Phase 4 — Fix and Verify + +1. **Write the minimal fix** — change only what addresses the root cause. Don't refactor or improve adjacent code. +2. **Run the failing test** — confirm it passes. +3. **Add boundary and error case tests** — cover edge cases around the fix. +4. **Run the full test suite** — confirm no regressions. + +Commit handling is governed by `commits.md` and the publish flow — not by this skill. ## Escalation Rule -If you've attempted 3 fixes and the bug persists, stop. The problem is likely architectural, not local. Report what you've learned and recommend a broader investigation rather than attempting fix #4. +If you've attempted three fixes and the bug persists, stop. The problem is likely architectural, not local. Report what you've learned and recommend a broader investigation (often `arch-evaluate`) rather than attempting fix #4. -When fanning out investigation across multiple independent files or subsystems, follow `subagents.md` — use parallel read-only agents for exploration, never for concurrent writes, and dispatch a fresh fix-agent on failure rather than retrying in the main context. +When fanning out investigation across multiple independent files or subsystems, follow `subagents.md` — parallel read-only agents for exploration, never for concurrent writes, and dispatch a fresh fix-agent on failure rather than retrying in the main context. ## What Not to Do -- Don't propose fixes before completing Phase 2 -- Don't change multiple things at once — isolate variables +- Don't propose fixes before completing Phase 1 +- Don't run Phase 2's hand-off and then re-implement the specialist's technique inline — let the specialist do its job - Don't suppress errors or add try/catch to hide symptoms - Don't add logging as a fix (logging is a diagnostic, not a solution) - Don't delete or skip a failing test to "fix" the suite diff --git a/five-whys/SKILL.md b/five-whys/SKILL.md index 3549222..3cfacf4 100644 --- a/five-whys/SKILL.md +++ b/five-whys/SKILL.md @@ -1,6 +1,6 @@ --- name: five-whys -description: Drive iterative "why?" questioning from an observed problem to its actual root cause, then propose fixes that target the root rather than the symptom. Default depth is five, but the real stop condition is reaching a cause that, if eliminated, would prevent every observed symptom in the chain — that may take three whys or eight. Handles branching (multiple contributing causes, each explored separately), validates the chain by working backward from root to symptom, and rejects "human error" as a terminal answer (keep asking why the process allowed that error). Use for process, decision, and organizational failures — missed code reviews, recurring incidents, slow deploys, flaky releases, "we've fixed this three times already" problems. Do NOT use for debugging a stack trace (use root-cause-trace, which walks the call chain), for tactical defect investigation where the fix site is local and obvious, or for blame attribution (the skill refuses to terminate at a person). Companion to root-cause-trace — that's for code execution; this is for process/decision root-causes. +description: Drive iterative "why?" questioning from an observed problem to its actual root cause, then propose fixes that target the root rather than the symptom. Default depth is five, but the real stop condition is reaching a cause that, if eliminated, would prevent every observed symptom in the chain — that may take three whys or eight. Handles branching (multiple contributing causes, each explored separately), validates the chain by working backward from root to symptom, and rejects "human error" as a terminal answer (keep asking why the process allowed that error). Use for process, decision, and organizational failures — missed code reviews, recurring incidents, slow deploys, flaky releases, "we've fixed this three times already" problems. Do NOT use for debugging a stack trace (use root-cause-trace, which walks the call chain), for tactical defect investigation where the fix site is local and obvious, or for blame attribution (the skill refuses to terminate at a person). Companion to root-cause-trace — that's for code execution; this is for process/decision root-causes. Often dispatched from `debug`'s Phase 2 routing when the failure is a process or organizational chain rather than a stack trace. --- # Five Whys diff --git a/root-cause-trace/SKILL.md b/root-cause-trace/SKILL.md index 210ee69..74e275d 100644 --- a/root-cause-trace/SKILL.md +++ b/root-cause-trace/SKILL.md @@ -1,6 +1,6 @@ --- name: root-cause-trace -description: Given an error that manifests deep in the call stack, trace backward through the call chain to find the original trigger, then fix at the source and add defense-in-depth at each intermediate layer. Covers the backward-trace workflow (observe symptom → identify immediate cause → walk up the chain → find origin → fix + layer defenses), when and how to add instrumentation (stack capture before the dangerous operation, not after), and the bisection pattern for identifying which test pollutes shared state. Use when an error appears in the middle or end of an execution path, when a stack trace shows a long chain, when invalid data has unknown origin, or when a failure reproduces inconsistently across runs. Do NOT use for clear local bugs where the fix site is obvious (just fix it), for design-level root-cause analysis of processes/decisions (use five-whys instead), for performance regressions (different class of investigation), or when there's no symptom yet to trace from. Companion to the general `debug` skill — `debug` is broader; `root-cause-trace` is specifically the backward-walk technique. +description: Given an error that manifests deep in the call stack, trace backward through the call chain to find the original trigger, then fix at the source and add defense-in-depth at each intermediate layer. Covers the backward-trace workflow (observe symptom → identify immediate cause → walk up the chain → find origin → fix + layer defenses), when and how to add instrumentation (stack capture before the dangerous operation, not after), and the bisection pattern for identifying which test pollutes shared state. Use when an error appears in the middle or end of an execution path, when a stack trace shows a long chain, when invalid data has unknown origin, or when a failure reproduces inconsistently across runs. Do NOT use for clear local bugs where the fix site is obvious (just fix it), for design-level root-cause analysis of processes/decisions (use five-whys instead), for performance regressions (different class of investigation), or when there's no symptom yet to trace from. Often dispatched from `debug`'s Phase 2 routing when the bug is a code-execution chain; usable directly when the symptom is already clearly a stack-trace walk. --- # Root-Cause Trace |
