aboutsummaryrefslogtreecommitdiff
path: root/debug
diff options
context:
space:
mode:
Diffstat (limited to 'debug')
-rw-r--r--debug/SKILL.md62
1 files changed, 34 insertions, 28 deletions
diff --git a/debug/SKILL.md b/debug/SKILL.md
index 1a64e82..ae864db 100644
--- a/debug/SKILL.md
+++ b/debug/SKILL.md
@@ -1,11 +1,11 @@
---
name: debug
-description: Investigate a bug or test failure methodically through four phases — understand the symptom (reproduce, read logs, locate failure point, trace data flow), isolate variables (minimal repro, bisect), form and test hypotheses, then fix at the root. Captures evidence before proposing fixes; rejects shotgun debugging; escalates to architectural investigation after three failed fix attempts. Use when the failure mode is unclear, the failure reproduces inconsistently, or you're about to start guessing. Do NOT use for clear local bugs where the fix site is obvious (just fix it), for ticket-driven implementation work with a known fix (use start-work), for backward-walking a specific error up the call stack (use root-cause-trace), or for process/organizational root-cause analysis of recurring incidents (use five-whys). Companion to start-work / root-cause-trace / five-whys — debug is the broad investigative workflow; the others specialize.
+description: Triage a bug or test failure — capture evidence, decide which investigation technique fits, then hand off and verify the fix. The skill itself covers Phase 1 (capture) and Phases 3-4 (verify and fix); cause-finding is delegated to root-cause-trace (for code-execution chains) or five-whys (for process / decision chains). Use when the failure mode is unclear and you don't yet know which technique applies. Do NOT use for clear local bugs where the fix site is obvious (just fix it), for ticket-driven implementation work with a known fix (use start-work), when you already know the bug is a stack-trace walk (skip straight to root-cause-trace), or when it's a process post-mortem (skip straight to five-whys).
---
-# /debug — Systematic Debugging
+# /debug — Triage and Route
-Investigate a bug or test failure methodically. No guessing, no shotgun fixes.
+A bug investigation has three jobs: gather evidence, decide what kind of bug it is, and verify the fix. This skill covers the first and third. The middle — actually finding the cause — is delegated to whichever specialist matches the bug's shape.
## Usage
@@ -15,48 +15,54 @@ Investigate a bug or test failure methodically. No guessing, no shotgun fixes.
If no description is given, prompt the user to describe the symptom.
-## Instructions
+## Phase 1 — Capture Evidence
-Work through four phases in order. Do not skip to a fix.
+No fixes during this phase. Gather, don't guess.
-### Phase 1: Understand the Symptom
+1. **Reproduce the failure** — run the failing test or trigger the bug. Capture the exact error message, stack trace, or incorrect output. If you can't reproduce it, say so plainly; intermittent failures get treated differently from deterministic ones.
+2. **Check logs and observability** — application logs, error tracking, APM traces, dashboards around the time of failure. Logs often reveal context that code reading alone cannot.
+3. **Locate the failure point** — name the file and line where the error surfaces. Read the surrounding code so you understand what it was *trying* to do, not just where it broke.
-1. **Reproduce the failure** — run the failing test or trigger the bug. Capture the exact error message, stack trace, or incorrect output. If you can't reproduce it, say so — don't guess.
-2. **Check logs and observability** — review application logs, error tracking, and metrics around the time of failure. For deployed services, check structured logs, APM traces, and alerting dashboards. Logs often reveal context that code reading alone cannot.
-3. **Locate the failure point** — identify the exact file and line where the error occurs. Read the surrounding code. Understand what the code is trying to do, not just where it fails.
-4. **Trace the data flow** — follow the inputs from their origin to the failure point. Read callers, callees, models, serializers, and middleware in the path. Understand how the data got into the state that caused the failure.
+You now have: a reproducer (or a "can't reproduce" note), a captured failure signature, and a failure-site address.
-Do not propose any fix during this phase. You are gathering evidence.
+## Phase 2 — Route to the Right Technique
-### Phase 2: Identify the Root Cause
+Decide which kind of bug this is. Each branch hands off to the appropriate specialist.
-5. **Ask "why?" at least three times** — if a value is wrong in a view, why? Because the service returned bad data. Why? Because the model query missed a filter. Why? Because the migration didn't add the index. That's the root cause.
-6. **Check for related symptoms** — search for similar patterns elsewhere in the codebase. If the bug is in one endpoint, check sibling endpoints for the same mistake. Bugs often travel in packs.
-7. **Form a hypothesis** — state the root cause clearly: "The bug is caused by [X] in [file:line] because [reason]." If you have multiple hypotheses, rank them by likelihood.
+- **Code-execution chain** — there's a stack trace, the failure is downstream of a bad value, the fix site isn't the trigger site → use [`root-cause-trace`](../root-cause-trace/SKILL.md). It walks the call chain backward to the origin and adds defense-in-depth at each layer.
+- **Process / decision / organizational failure** — recurring incident, missed review, "we've fixed this three times," slow CI, manual step that didn't get run → use [`five-whys`](../five-whys/SKILL.md). It drives why-questioning until the cause is a missing mechanism, not a person.
+- **Local, obvious fix site** — the failure point and the cause are the same line, no tracing or post-morteming needed → don't use this skill. Just fix it.
-### Phase 3: Verify the Hypothesis
+If two branches genuinely apply (a code bug also revealed a process gap), run both — the techniques are independent and complement each other.
-8. **Write a failing test** that proves your hypothesis — the test should fail for the reason you identified, not just any reason. If the test passes, your hypothesis is wrong. Go back to Phase 2.
-9. **Confirm the test fails for the right reason** — read the failure output. Does it match your hypothesis? A test that fails for a different reason than expected is not evidence.
+## Phase 3 — Verify the Hypothesis
-### Phase 4: Fix and Verify
+Once the chosen specialist has produced a hypothesis:
-10. **Write the minimal fix** — change only what is necessary to address the root cause. Do not refactor, clean up, or improve adjacent code.
-11. **Run the failing test** — confirm it passes.
-12. **Add boundary and error case tests** — cover edge cases around the fix.
-13. **Run the full test suite** — confirm no regressions.
-14. **Commit** following conventional commit format.
+1. **Write a failing test** that proves the hypothesis — the test should fail for the exact reason identified, not just any reason.
+2. **Confirm it fails for the right reason** — read the failure output. A test that fails for a different reason than expected is not evidence.
+
+If the test passes when the hypothesis says it should fail, the hypothesis is wrong. Go back to Phase 2 and re-route, or rerun the specialist with new evidence.
+
+## Phase 4 — Fix and Verify
+
+1. **Write the minimal fix** — change only what addresses the root cause. Don't refactor or improve adjacent code.
+2. **Run the failing test** — confirm it passes.
+3. **Add boundary and error case tests** — cover edge cases around the fix.
+4. **Run the full test suite** — confirm no regressions.
+
+Commit handling is governed by `commits.md` and the publish flow — not by this skill.
## Escalation Rule
-If you've attempted 3 fixes and the bug persists, stop. The problem is likely architectural, not local. Report what you've learned and recommend a broader investigation rather than attempting fix #4.
+If you've attempted three fixes and the bug persists, stop. The problem is likely architectural, not local. Report what you've learned and recommend a broader investigation (often `arch-evaluate`) rather than attempting fix #4.
-When fanning out investigation across multiple independent files or subsystems, follow `subagents.md` — use parallel read-only agents for exploration, never for concurrent writes, and dispatch a fresh fix-agent on failure rather than retrying in the main context.
+When fanning out investigation across multiple independent files or subsystems, follow `subagents.md` — parallel read-only agents for exploration, never for concurrent writes, and dispatch a fresh fix-agent on failure rather than retrying in the main context.
## What Not to Do
-- Don't propose fixes before completing Phase 2
-- Don't change multiple things at once — isolate variables
+- Don't propose fixes before completing Phase 1
+- Don't run Phase 2's hand-off and then re-implement the specialist's technique inline — let the specialist do its job
- Don't suppress errors or add try/catch to hide symptoms
- Don't add logging as a fix (logging is a diagnostic, not a solution)
- Don't delete or skip a failing test to "fix" the suite