claude-rules/verification.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112

# Verification Before Completion

Applies to: `**/*`

## The Rule

Do not claim work is done without fresh verification evidence. Run the command, read the output, confirm it matches the claim, then — and only then — declare success.

This applies to every completion claim:
- "Tests pass" → Run the test suite. Read the output. Confirm all green.
- "Linter is clean" → Run the linter. Read the output. Confirm no warnings.
- "Build succeeds" → Run the build. Read the output. Confirm no errors.
- "Bug is fixed" → Run the reproduction steps. Confirm the bug is gone.
- "No regressions" → Run the full test suite, not just the tests you added.

## Green Baseline Before Starting Work

Run the test suite before you start work, not only before you finish. A clean run at the start confirms the tree is in the known-good state you assume it is, so the baseline you build on and measure your changes against is actually green.

If the suite is red before you touch anything, fix or explicitly triage the failure first. A pre-existing failure left in place poisons every later "did I break this?" check: you can't separate your own regressions from the noise, and the end-of-work run stops being readable as pass/fail at a glance. Work that assumes a known-good base may also be built on a broken assumption you never saw.

When a pre-existing failure genuinely can't be fixed before the work begins (out of scope, or it needs a decision), record it as a tracked task with the diagnosis and carry its name forward. The green bar for the rest of the work is then explicitly "only this known failure remains," not a silent tolerance for red.

Two cases are not failures of this rule. A project with no suite has nothing to baseline — note that and proceed. A suite that can't run (no network, a missing dependency, a sandbox limit) is the "When You Cannot Verify" case below, not a work-blocker — record what you couldn't run and proceed with the risk named.

This is the start-of-work counterpart to the Before Committing gate below: one confirms the ground is solid before you build, the other confirms you didn't crack it.

## What Fresh Means

- Run the verification command **now**, in the current session
- Do not rely on a previous run from before your changes
- Do not assume your changes didn't break something unrelated
- Do not extrapolate from partial output — read the whole result

## Red Flags

If you find yourself using these words, you haven't verified:

- "should" ("tests should pass")
- "probably" ("this probably works")
- "I believe" ("I believe the build is clean")
- "based on the changes" ("based on the changes, nothing should break")

Replace beliefs with evidence. Run the command.

## A Passing Gate Can Skip Your New File

A green check only counts for the files the check actually ran on. When a lint, test, or format gate runs against an explicit, hand-maintained file list rather than a glob or auto-discovery, a newly-added file is silently skipped until someone adds its path by hand. The gate reports clean while never looking at the new file — a false pass that reads exactly like a real one.

The failure mode isn't tool-specific. It fires anywhere a quality gate enumerates files instead of discovering them: a Makefile lint target listing paths, a pre-commit `files:` regex that doesn't match a new extension, a CI matrix with hardcoded paths, a coverage config with an explicit include list.

So when you add a file, confirm the gates see it. Check whether each lint/test/format gate discovers files automatically or enumerates them, and if it enumerates, add the new path before trusting the green check. "Linter is clean" means nothing for a file the linter never ran on.

## When You Cannot Verify

Sometimes the verification command cannot run: the tool is absent, there is no network, a sandbox blocks it, or the environment is missing a dependency. A check that did not run must never be reported as a pass. "Unable to verify" is an honest, required outcome — not silence, and not an optimistic "should work."

When a check cannot run, report it in this order:

1. **Command attempted** — the exact command you tried to run.
2. **Why it could not run** — tool absent, no network, sandbox limit, missing environment, etc.
3. **Risk left unverified** — what might be broken that you would not know about, given the check did not run.
4. **Next command to close the gap** — the smallest command the user can run to verify themselves.

Do not let an unverifiable check vanish into a confident summary. State it plainly and hand the gap to the user.

## Handing Off Manual Verification

Some checks can only be run by the user: interactive UI a script can't drive, a live external service, visual rendering (colors, layout, faces), a real device, or anything where the verification *is* a human looking at the result. When the gap needs the user's hands or eyes — not just a command they could paste — don't bury the steps in prose. Write them as a structured, runnable checklist in the project's task file.

Create (or append to) a single parent task named **"Manual testing and validation"** in the project's todo file (`todo.org`, or the project's equivalent). Under it, write **one org sub-header per test**:

- **Title** — descriptive, naming the behavior under test (not "test 1").
- **"What we're verifying:"** — one line on the intent, so a failed test explains itself.
- **Steps** — **one action per item**, in order. Manual actions (a key to press, a phone step, a file to open, the `M-x` command under test) are list bullets, stated concretely. A step that *is* code the user will execute goes in an org src block, not a bullet — see below.
- **"Expected:"** — the observable result after the last step. One outcome, stated plainly, so a mismatch is unambiguous.

**Executable steps go in src blocks.** When a step is code the user runs — verification setup, bug-reproduction steps, walkthrough wiring — put it in an org src block so the user can execute it in place (`C-c C-c`) and read the result in the buffer, rather than copy-pasting a bullet:

- Use the correct language tag (`emacs-lisp`, `sh`, `python`, …).
- Let results land in the buffer: `emacs-lisp` default-value results are fine; shell blocks get `:results output`.
- Group statements into logical blocks (setup, action, restore) so each block is one `C-c C-c`.
- Prose context, manual actions (phone steps, key presses, the UI actions under test), and `Expected:` lines stay as list items between the blocks.

Format:

```
** TODO Manual testing and validation
*** <descriptive title>
What we're verifying: <one line>.
- <manual action — a key press, a phone step>
#+begin_src emacs-lisp
(setup-form)
(action-form)
#+end_src
- <manual action between blocks>
Expected: <the observable result after the last step>.
```

**Promote on failure.** If the actual behavior matches Expected, the test passed — the user marks or deletes it. If it differs, the user writes the actual behavior and any notes under that test, flips its header to a `TODO`, and promotes it to a top-level task. The structured shape is what makes that one-step: the failing case already carries the repro and the expected result, so it becomes a tracked bug without rewriting anything.

Use this whenever the verification gap from "When You Cannot Verify" above is a human-in-the-loop check rather than a command the user can run blind. Write every such test you want run, not a representative sample — the user runs the checklist once and reports back.

## Before Committing

Before any commit:
1. Run the full test suite as its own command, read the result, and commit only when failures are zero — never bundle the run with the commit (e.g. `make test; git commit`), where a red suite can't stop the commit. Run the whole suite, not just the touched file: a change can break a test elsewhere. If the suite can't run, that's "unable to verify" (see When You Cannot Verify above) — surface it, don't commit silently.
2. Run the linter — confirm no new warnings
3. Run the type checker — confirm no new errors
4. Review the diff — confirm only intended changes are staged

Do not commit based on the assumption that nothing broke. Verify.