aboutsummaryrefslogtreecommitdiff
path: root/.ai/workflows/spec-review.org
blob: d956f00a7e1af912387bbf02f9a453384ea05503 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
#+TITLE: Spec-Review Workflow
#+AUTHOR: Craig Jennings
#+DATE: 2026-05-23
#+STARTUP: showall

* Overview

The spec-review workflow evaluates a feature/specification document before implementation and decides one thing: can an engineer implement it confidently, test it thoroughly, and ship behavior that matches the user's mental model? If yes, say so and stop. If no, write a review file next to the spec naming every blocking gap and the concrete change that closes it.

This is the *reviewer* side of a pair. Its counterpart is the spec-response workflow, which the spec's author runs to fold a review back in. The contract between them is the review file: =<spec-basename>-review.org= (e.g. =docs/issue-query-spec.org= → =docs/issue-query-spec-review.org=). spec-review produces it; spec-response consumes it.

The goal is not to prove the spec is clever. It is to leave the implementer with *fewer* hidden decisions, not more prose.

* Problem We're Solving

Without a disciplined review before implementation:

- *Specs look finished but force implementers to invent product behavior.* Ownership, conflict, empty-state, and error semantics get discovered (and decided ad hoc) mid-implementation, where they're expensive to change.
- *API assumptions ship unverified.* A spec quietly depends on a field/mutation/enum shape that turns out wrong, and the work stalls at integration.
- *Reviews are vague or sprawling.* "Add tests," "consider edge cases," or a full rewrite of the spec — none of which an implementer can act on.
- *Deferred decisions live only in chat.* The next session re-litigates them.

*Impact:* viability and consistency problems that should surface in review instead surface during implementation, when they cost the most.

* Exit Criteria

A review is complete when:

1. *The implementation-readiness gate has been evaluated* and a rubric label assigned (=Ready= / =Ready with caveats= / =Not ready= / =Needs research=).
2. *If ready:* the user is told plainly ("This spec is implementation-ready. I have no further blocking review notes."), and the review stops — no churn for its own sake.
3. *If not ready:* a =<spec>-review.org= file is written next to the spec, in the standard structure, with every finding specific and actionable (current behavior named, risk explained, change recommended, blocking-or-not stated).
4. *The spec's review history is updated* with who reviewed it, when, which iteration it was, what changed or was recommended, and why.
5. *Deferred work is logged* to =todo.org= (v1 = =[#B]=, vNext/someday = =[#D]=), not left only in chat.
6. *Implementation tasks are enumerated* — the spec's =Implementation phases= section is lifted into a drop-in =todo.org= block (one entry per phase plus a test-surface entry), or, if the spec has no phase decomposition, that gap is raised as a finding.
7. *Generalizable review questions are harvested.* If Craig or the review uncovers a question/topic that applies beyond the current spec, rewrite it generically, add it to this workflow, and note the workflow change in the rulesets inbox/update note. Spec-specific observations stay in the spec review; reusable review practice moves here.

*Measurable validation:* after reading the review, an implementer knows exactly what to decide or change before starting, and an author running spec-response can give every finding an accept/modify/reject without guessing what it means.

* When to Use This Workflow

Trigger when:

- A spec is up for implementation and someone needs to know if it's ready.
- A spec changes materially (significant design conclusions changed → fresh review round).
- Craig says "review the spec" / "is this spec implementation-ready?" / "let's run the spec-review workflow."

Run it *early* — design review exists to catch viability problems and costly mistakes before implementation, not after.

* Precondition: spec filename

Before Phase 1, verify the file under review ends with =-spec.org=. Every design, decision, or planning document under a project's =docs/= directory carries that suffix as its identifier. The =.org= extension alone is not enough because =docs/= holds non-spec org files too (tutorials, frozen inventories, reference material).

If the file does not end with =-spec.org=, stop immediately and surface the mismatch:

#+begin_example
The file <path> does not end with -spec.org. Either the wrong file was named, or
the file should be renamed first. Spec workflows require the -spec.org suffix as a
guard against pointing the workflow at tutorial, inventory, or setup docs.
#+end_example

The user resolves the mismatch (rename the file, or point at a different one) and re-invokes the workflow. Do not proceed with the review against a misnamed file.

Anywhere the rest of this workflow refers to "the spec," "the spec file," or "the file under review," the path is the =-spec.org= file confirmed here.

* Approach: How We Work Together

** Phase 1: First-pass readiness gate

This is a fast triage on the spec's face — can it be rejected as obviously not ready before you invest in the full read? It is not the final word: several items below (architecture fit, API verification, integration points) can only be judged after Phase 2's code read, so Phase 3 re-runs this gate as the authoritative pass. Here, decide what you already can, and flag the rest to confirm after reading.

Mark the spec implementation-ready only if *all* of these hold:

- The user problem and target behavior are clear.
- Scope is explicit: v1, out-of-scope, and vNext are separated.
- Project principles are explicit where the feature touches user data, remote systems, destructive actions, long-running operations, or public extension points.
- All prior review questions/recommendations are answered or intentionally deferred.
- Every decision in the spec's =* Decisions= section is resolved — no =TODO= remains (the =[/]= cookie reads complete; =SUPERSEDED= and =CANCELLED= count as resolved), or each still-open decision carries a consciously-accepted, recorded risk. A spec still on the retired =State: proposed | accepted | superseded= field model fails this item until converted to decision tasks — convert on the next material touch.
- The expected UX is concrete enough to implement without guessing.
- Existing platform/framework affordances are reused, wrapped, or intentionally replaced; lost opportunities are called out.
- The data model, state ownership, persistence, and sync behavior are defined.
- Error strings, conflict, empty-state, and partial-failure behavior are defined with actionable user-facing messages.
- Security and privacy behavior is defined wherever the feature touches credentials, personal or shared data, or external services — or there is nothing sensitive to define and that's stated.
- Observability and diagnostic tooling are defined: how the user (or operator) sees what the feature is doing, how failures surface, how logs/debug commands isolate issues, and how sensitive data is redacted — or the feature is simple enough not to need it, and that's stated.
- Performance expectations and likely bottlenecks are considered, including long-running operations and how they are tested without making the normal suite slow/flaky.
- For transfer, sync, backup, import/export, queue, or other long-running data-movement specs: common and edge failure modes have been researched against current sources and comparable applications, and each material risk is classified as covered, avoided by scope, deferred with safeguards, or missing.
- The architecture fits the existing codebase and names its integration points.
- Architecture weak points are named with mitigation steps, especially around concurrency, lifecycle states, extension points, and failure isolation.
- The public customization/configuration surface is named with defaults, safety notes, and clear user-facing variable names.
- The documentation plan is defined: README, user guide, developer guide, migration/upgrade notes, or an explicit reason they are unnecessary.
- The project tooling is part of scope where appropriate: Makefile/package scripts expose setup, test, compile, lint, coverage, cleanup, and slow/manual verification.
- The test strategy covers unit, integration/fixture, regression, user-flow, slow/long-running behavior, coverage reporting, and optional live/external tests as appropriate.
- Rollout, migration, rollback/reversibility, and compatibility are defined when the feature changes public behavior, persisted data, external resources, package APIs, user configuration, or shared remote state.
- The plan can be phased without shipping broken intermediate states, and phases are small enough to reach a clean stopping point in one focused work session.
- External API assumptions are verified or explicitly listed as prerequisites.

If all true → tell the user it's ready and stop unless they ask for more. If any false → continue and write the review file. A "ready" at this phase is provisional; confirm it at Phase 3 after the code read.

** Phase 2: Required reading order

Never review a spec in isolation.

1. *Read the existing implementation first.* The code paths the spec would touch: public commands and entry points, internal helpers/boundaries, current data representation, persistence/write-back, async/sync, caching, error handling, existing tests, naming/style. Capture current-state facts with function names and file paths. Don't recommend designs that ignore how the package works today.
2. *Read related specs and task tracking.* Companion specs, relevant =todo.org= tasks, README/testing docs, prior review files. Record which tasks the spec absorbs, which stay separate, which decisions are already made, which are still open.
3. *Read the target spec end to end — twice.* First for its problem/behavior/phases/assumptions; second looking only for gaps. The second read asks: "What would an implementer still have to invent?"

** Phase 3: Re-run the gate (authoritative)

After reading code and spec, re-run the Phase 1 gate — this is the pass that counts, because now you can actually judge the items that needed the code: architecture fit, API verification, integration points. If now ready, don't manufacture churn. If not, write the review file.

** Phase 4: Evaluate across dimensions

Work the spec against these. Each is a source of concrete findings, not a box to tick.

- *User problem & mental model.* Real problem or just machinery? Does UX match how users think? Predictable names/commands/prompts/layouts? Is editable-vs-generated content obvious? Are destructive/workspace-mutating actions explicit and confirmed? For org features, be *strict about ownership* — users treat visible org text as editable unless told otherwise.
- *Project principles and non-negotiables.* Does the spec state do/don't rules that protect user trust? For filesystem/remote/data-mutating features, does it forbid silent data loss, require residue/temp-file cleanup verification, make destructive and fallback paths visible, define cancellation/hung-work behavior, and prevent secret leakage? For extension points, does it define contracts plugin authors must obey?
- *Reuse and lost opportunities.* What does the underlying platform/framework/library already provide? Does the spec preserve useful existing affordances, naming, configuration, keymaps, commands, hooks, logs, caches, or test utilities? If it replaces them, is that intentional and justified? Are there obvious user-value opportunities being left on the table?
- *Scope & boundaries.* What's v1 / out-of-scope / vNext? Are deferrals captured (=todo.org=)? Does the phased plan avoid half-working states? If a decision is deferred, is v1 still coherent without it? Can each phase finish cleanly in one focused work session, or is it too large to stop safely?
- *Requirements clarity.* Each requirement unambiguous, testable, traceable to a source (user ask / bug / API capability / decision)? Are "good/fast/intuitive/smart/seamless" replaced with concrete behavior? Terms used consistently? Infeasible/unverified requirements flagged?
- *Data model & ownership.* Durable IDs vs display-only names? What's local/remote/generated/cached/user-authored? Who owns each editable region? What's replaced on refresh, survives view-switch / restart? What if one remote object could appear in multiple local places? Prefer stable IDs for execution, names for display; don't overload generic org properties (=:ID:=) for app-specific remote IDs unless intentional.
- *Sync, refresh, conflict.* What triggers sync (command / save hook / state hook / timer) and refresh? One-way or bidirectional? No-op detection? Remote-conflict detection? v1 conflict behavior? Dirty-buffer behavior? API success but GraphQL-level failure? Network failure mid-pagination/mutation? For v1, simple is fine if explicit: detect, refuse, tell the user what to do next.
- *Transfer/sync failure-mode research.* For features that move or reconcile user data, do online/current research on how the domain fails in general and in comparable products. Look for partial writes, temp/residue files, source mutation during transfer, locked/read-only files, permission/ACL/xattr loss, quota/space failure, case/restricted-name collisions, invalid encodings, duplicate remote objects, symlink/hardlink/special-file behavior, clock skew/mtime unreliability, watcher delays, missing roots/unmounted volumes, ignored/selective trees, mass deletes, ransomware-like churn, reconnect/offline gaps, conflict propagation, stale status, and "stuck but no file list" complaints. Classify every material item as =covered=, =avoided by v1 scope=, =deferred with guardrails=, or =missing=; missing items become findings.
- *API & external dependencies.* Which fields/mutations/filters/enums are assumed, and are they verified against current schema/docs/live responses? Permission differences, rate limits, pagination, nulls, partial errors handled? Any optional external tool — what's the fallback? Don't let a spec silently depend on an unverified API shape.
- *CLI/tool-wrapper value.* If the feature wraps command-line tools or external executables, why would a user prefer the package over running the tool directly? Specs must cover common tool-specific failures, exit codes, stderr/log signatures, destination/backend gotchas, preflight/doctor checks, redaction, retry/cancel behavior, and a failure explainer. Helpful-error success criteria: identify the failure class, explain likely cause, prove it with redacted evidence, state safety outcome, offer repair actions, and provide a redacted reproduction command/log bundle.
- *Architecture & maintainability.* Fits current module boundaries, or should new ones precede growth? Which parts are pure vs side-effecting? Are transport/parsing/normalization/rendering/command-orchestration separated enough to test? Will it multiply special cases? Backward-compat preserved where needed? Favor small pure cores surrounded by thin IO/command layers. *Generated config across namespaces:* when a generated config (a package-manager config, a mount unit, a container or VM image, an installer profile) is consumed in more than one namespace, does every path or URL inside it resolve in the namespace that *consumes* it, not the one that *wrote* it? Build host, chroot, image runtime, installed target system, container, and VM are distinct namespaces — a path valid in the one that generates the config (a build-time staging dir) can be absent or different in the one that reads it (the live or target root).
- *Simplicity and complexity control.* Is the UX dead simple on the hot path? Are advanced controls out of the way? Are command functions thin? Are branchy functions split or justified? Would cyclomatic complexity, coverage, or code-size reports reveal refactoring targets? Are there explicit refactoring triggers before complexity hardens?
- *Extension/developer experience.* If plugins/backends/extensions are part of the feature, is the public ABI precise? Are capability flags defined? Are lifecycle callbacks, failure states, redaction rules, and contract tests documented? Can another engineer add an extension without reading private internals? Is the extension contract tiered so authors can build a minimum runnable plugin quickly, then graduate to publishable/excellent diagnostics? Does the spec clearly say what information is required, what is optional, and what can be unknown/conservative at first? Will authors know what evidence/patterns/capability metadata to add, or are they expected to infer it from core internals? Are scaffolds, templates, fixture capture tools, pattern-table helpers, contract-test tiers, and "where the line is" docs provided? Could the contract be so complex that reasonable plugin authors opt out, and if so, what helper API removes that burden? Treat plugin authors as end-users of the developer experience and require their plugins to give downstream users the same diagnostic quality the core promises.
- *Robustness & error handling.* Expected failure modes? Errors distinguishable from empty-but-successful results? Actionable messages? Do error strings name the operation, object, backend/API/tool involved, and next step? Retries? Stale/missing caches? Can malformed local files or missing properties crash commands? nil/null/missing-field behavior defined? Data-loss-by-default avoided? Anything that writes files or remote data needs a clear data-loss story.
- *Performance & scale.* Expected counts (issues/comments/labels/teams/projects/views)? Server-side filtering where possible? Bounded, visible pagination? Cached name→ID lookups? Sync calls in the command path acceptable? Could a save hook or whole-file scan make N network calls? Rendering linear? Full-file rewrites avoided? Long-running operations async/cancellable/observable? Is concurrency/queueing/backpressure defined? Are high-output process filters throttled and cheap? Is progress/ETA exposed only when defensible, and are hung/stalled operations detectable and killable? Identify UI freezes, repeated network calls, unbounded pagination — without premature optimization.
- *Security & privacy.* API keys safe? Debug logs leaking secrets or private issue text? Confirmations before mutating shared workspace objects? Personal vs shared distinguished? Local files holding sensitive descriptions/comments? Anything to redact from messages/logs? Any work-tracker integration may handle private company data.
- *UX & accessibility.* Discoverable commands? Recoverable mistakes? Prompts ordered to the task? Safe, useful defaults? Informative-not-noisy status messages? Does the UI avoid implying unsupported actions are supported? Match the upstream product's permissions/concepts? Are customizations named in user language, with clear defaults and docstrings? For Emacs packages, command names, completion candidates, buffer layout, defcustom names, and message wording *are* the UX.
- *Test strategy and coverage.* Characterization tests before behavior changes? Pure functions to unit-test? API responses needing fixtures? Command flows needing stubs? Regression tests for prior bugs? Boundary/error cases? What's covered elsewhere and shouldn't be re-tested? Which existing tests must change? How is coverage generated, summarized, and used to find untested/refactor-worthy code? Prefer tests that lock contracts: representation shape, query compilation, sync no-op, conflict refusal, pagination, dirty-buffer protection, log redaction, and long-running/slow-operation behavior via fakes rather than flaky live dependencies.
- *Observability & operations.* How does a user see what the package is doing? Progress messages for long ops? Useful, safe debug logging? Are logs structured enough to isolate issues from a bug report? Are commands provided to inspect/clear caches, test connectivity, diagnose backends/tools, copy redacted debug info, or reproduce command invocations? How are terminal states discovered: completion, failure, partial success, stalled/hung, cancelled, cleanup-unverified, and "needs user action"? Does the product notify only when useful, avoid noisy success spam, and keep non-success states visible until acknowledged? For generated org files, headers should often carry source, filter/view name, refresh time, count, truncation state.
- *Comparable-product sentiment.* When there are obvious adjacent products, research what users love and hate about them from official docs plus current community reports. Do not cargo-cult their feature set; translate findings into the spec's scope. For each loved behavior, say whether the spec provides it, intentionally omits it, or defers it. For each hated behavior, say whether the spec avoids, resolves, inherits, or accepts it.
- *Documentation & migration.* README/config changes? New commands and defcustoms documented? User guide needed? Developer guide/API guide needed? Makefile or package scripts documented as part of the product? Old behaviors removed/renamed/migrated? Breaking changes acceptable? Changelog note needed? Don't ship with stale examples.
- *Rollout, compatibility, and rollback.* How does the feature ship safely? Are old config/data formats migrated or read compatibly? Can users opt out or revert? If remote/shared data is mutated, is there a dry-run, backup/trash/versioning, or recovery path? If a public API/plugin contract changes, is the compatibility story explicit? For local packages, "rollback" may mean disabling the module, removing generated files, or restoring prior config; spell that out where it matters.
- *Development tooling.* Does the repo give contributors obvious commands for setup, fast tests, specific tests, compile, lint, coverage, cleanup, slow/manual tests, and release checks? Are optional/live tests gated by explicit environment variables? Is the Makefile/script surface consistent with sibling projects?
- *Small enhancement radar.* Are there low-complexity, high-value affordances already provided by the platform that should be surfaced now or explicitly deferred? Examples: archive/compress commands in file managers, built-in history, previews, diagnostics, or doctor commands. Keep the hot path simple; capture the opportunity rather than accidentally losing it.

** Phase 5: Write the review file

Use this structure for =<spec-basename>-review.org= unless the spec calls for something different:

#+begin_src org
,#+TITLE: Review: <Spec Title>
,#+AUTHOR: <reviewer>
,#+DATE: <date>
,#+STARTUP: showall

,* Scope reviewed
What code, tests, docs, and specs you read.

,* Implementation-readiness
Whether the spec is ready. If not, summarize the blockers.

,* Overall assessment
The short senior-engineering read: what's right, what's risky, what must be clarified.

,* High-priority findings
Concrete headings. Each: why it matters and what to change.

,* Medium-priority findings
Important improvements that shouldn't block all progress.

,* UX observations
,* Architecture observations
,* Robustness and performance observations
,* Test strategy recommendations
Specific test cases, not generic "add tests".
,* Documentation and tooling recommendations
README/user/developer docs, Makefile/package scripts, coverage, debug tools, and customization surface.

,* Suggested spec edits
Concrete edits to make the spec implementation-ready.

,* Agreed decisions
Answers reached during review. Omit if none.

,* Open questions
Only questions that truly block or materially affect implementation.

,* vNext candidates
Deferred features to capture in task tracking.
#+end_src

** Phase 6: Assign the rubric and update tracking

Assign one label consistently:

- =Ready= — no blocking open questions; implementation can start. Requires no decision in the spec's =* Decisions= section to still be =TODO= (the =[/]= cookie reads complete; =SUPERSEDED= and =CANCELLED= count as resolved) — a decision still =TODO= holds the rubric at =Not ready=, or =Ready with caveats= if the author consciously accepts and records the risk.
- =Ready with caveats= — can start if the caveats are accepted and tracked.
- =Not ready= — blocking ambiguity / missing decisions would force implementers to invent product behavior.
- =Needs research= — external API/library/platform assumptions must be verified first.

The most useful reviews move a spec from =Not ready= to =Ready with caveats= or =Ready= once decisions are captured.

Finding severity maps to blocking power: *high-priority findings block =Ready=* — they hold the rubric at =Not ready= (or =Ready with caveats= if the author accepts and tracks them) until dispositioned; *medium-priority findings are the author's discretion* and don't block. State the blocking status on each finding so the author running spec-response knows which ones gate the rubric.

Then update the spec's review history. Specs should carry a bottom section named =Review and iteration history= (or the nearest existing equivalent) that tracks each material author/reviewer pass. Add a concise entry for this review even when the spec is ready and no review file is written.

Each entry is an org subheading with a compound id followed by three body fields.

Heading format: =YYYY-MM-DD Day @ HH:MM:SS -ZZZZ — Contributor — Role=

The timestamp + contributor + role concatenate as an opaque id — nothing more should be read into it. Timestamp matches the project's todo-format event-log convention and gives natural temporal sort without implying any decision ordering. Contributor uses short parenthetical project context where session matters (e.g. =Claude Code (linear-emacs)=, =Claude Code (rulesets)=, =Codex=, =Craig Jennings=). Role can be compound (e.g. =reviewer + responder=) when one pass fused multiple roles. Author, reviewer, responder, verifier, and researcher are the standard role names.

Body fields:

- *What changed or was recommended:* high-signal summary, not a duplicate of the whole review.
- *Why:* the decision pressure or rationale that caused the contribution.
- *Artifacts:* links to the review file, response/disposition section, commits, task IDs, or source checks when useful.

If the spec has no such section, add it at the bottom. Keep the history short and cumulative; it is provenance for future readers, not a session transcript.

*Emit implementation tasks (drop-in for =todo.org=).* Read the spec's =Implementation phases= section and turn it into a paste-ready block in the review file, under a heading =Implementation tasks (drop-in for todo.org)=. One =** TODO= entry per phase, plus a final entry for the test surface. The point: the handoff to whoever implements is one paste, not a re-read of the spec, and a spec that can't be decomposed into phases fails this step, surfacing a shape problem before =Ready=.

Per-phase entry, following =todo-format.md= (terse heading names the phase; body holds the one-line deliverable plus a pointer back to the spec; tags on the heading):

#+begin_example
** TODO [#B] <phase name — smallest noun phrase> :feature:
<what this phase delivers, one line>. Spec: [[file:<spec path>]] (Implementation phases, phase N).
#+end_example

Final test-surface entry, mirroring the spec's =Acceptance criteria= when present:

#+begin_example
** TODO [#B] <feature> — test surface :test:
Unit: <...>. Integration: <...>. E2e / manual-verify: <acceptance criteria as checkable items>. Spec: [[file:<spec path>]] (Acceptance criteria).
#+end_example

Priority and tags follow the deferred-work rule below. Emit the block in the review file; the author pastes it into =todo.org= during spec-response, or you log it directly when you're also closing the loop. If the spec has no =Implementation phases= section, don't invent one — that absence is the finding, and the step becomes the prompt to ask the author to add a phase decomposition before the spec can be =Ready=.

Then log deferred work to =todo.org=: v1 implementation = =[#B]= (unless urgent or speculative); vNext/someday = =[#D]=. Tag =:feature:= / =:bug:= / =:refactor:= / =:test:= / =:quick:= / =:solo:= only when accurate. Don't leave important deferred decisions only in chat.

* Principles to Follow

** Read the code before critiquing the spec
A recommendation that ignores how the package works today is noise. Cite function names and file paths; ground findings in current behavior.

** Specific over vague
Name the current behavior or spec text, explain the risk, recommend the change, say whether it blocks. Avoid vague preferences, whole-spec rewrites, architecture astronautics, unbounded "consider" lists, and questions whose answer doesn't change implementation.

Model finding:
#+begin_quote
The spec says fetched comments render as subheadings but doesn't define whether editing an existing comment syncs back. That's risky because Linear only allows users to edit their own comments. V1 should treat fetched comments as remote-owned display content and support only adding new comments; editing own comments can be vNext.
#+end_quote

** Decide readiness first, critique second
The gate is the product of the review. A long critique on a spec that's actually ready just creates churn; a "looks fine" on one that isn't ships hidden decisions.

** Review as a risk reducer, not a rewrite contest
Good spec review improves the chance of a correct implementation. It is not a venue for proving cleverness, redesigning the whole product, or teaching broad lessons at the author's expense. Prefer the smallest spec change that removes the implementation risk. If a finding implies a major redesign, name the decision pressure and the options; don't silently rewrite the architecture in review prose.

** Separate blockers from preferences
Every material comment should be tagged by force: blocking, should-fix, or optional. Avoid burying a true blocker in a long list of preferences, and avoid making stylistic or taste comments look like correctness concerns. Bikeshedding belongs in linters, formatters, local conventions, or explicit optional notes.

** Make feedback author-usable
Review comments should be specific, neutral, and actionable: quote or name the spec behavior, explain the risk, recommend the smallest concrete change, and say how the author can verify the fix. Avoid personal language, rhetorical questions, vague "this needs work" comments, and comments that require the author to infer the desired edit.

** Preserve iteration provenance
Future reviewers and implementers need to know not just the current decision, but how the spec got there: how many review/response loops happened, who contributed, what they changed or recommended, and why. Keep that record in the spec itself under =Review and iteration history= so the trail survives deleted review files, chat loss, and agent handoffs.

** Be strict about ownership
Especially for org-mode features: a user treats visible text as editable unless the representation says otherwise. Make generated-vs-editable explicit.

** Never depend on an unverified API shape
If the spec assumes fields/mutations/enums, they're verified against current schema/docs/live responses, or listed as a research prerequisite. =Needs research= is a real, useful verdict.

** Favor small pure cores and thin IO layers
Push findings toward separable, unit-testable pure functions surrounded by thin command/transport layers.

** Block only on answer-changing questions
Open questions in the review are the ones whose answer materially changes implementation. Everything else is a suggestion, not a blocker.

* Living Document

Update this workflow as review practice sharpens — new dimensions worth checking, better model findings, refinements to the readiness gate and rubric.

When a review conversation surfaces a question that would improve future reviews, decide whether it is generalizable:

- *Generalizable:* applies to a class of future specs (for example, "How will users discover long-running failures?", "What do comparable products make users love/hate?", "Which transfer/sync edge cases are covered, avoided, or deferred?"). Rewrite it in product-neutral language, add it to this workflow, and append a short note to the rulesets inbox/update note explaining what changed and why.
- *Spec-specific:* depends on this product's exact domain, names, or implementation choices. Keep it in the review/spec history and do not bloat the workflow.

This workflow is allowed to grow, but only by turning repeated or broadly useful review instincts into clear checks that produce implementation-changing findings.

* Research notes behind this checklist

Combines the review process used on the linear-emacs issue query/representation specs with broader design/spec-review practice:

- Design reviews should happen early enough to find viability problems and costly mistakes before implementation (Google design-review research).
- Product/spec docs should create shared understanding — goals, assumptions, and an explicit "what we're not doing" (Atlassian PRD guidance).
- Review sessions should capture open questions and gaps so they're tracked and resolved (Atlassian design-review template).
- Requirements should be checked for correctness, ambiguity, completeness, consistency, verifiability, traceability, feasibility (SRS checklist categories).
- Design-doc review should be iterative but restart/refocus when significant conclusions change (Gerrit design-doc guidance).

Sources:

- [[https://research.google/pubs/improving-design-reviews-at-google/][Google Research: Improving Design Reviews at Google]]
- [[https://www.atlassian.com/agile/product-management/requirements][Atlassian: Product Requirements Documents]]
- [[https://www.atlassian.com/software/confluence/templates/design-review][Atlassian: Design Review Template]]
- [[https://www.geeksforgeeks.org/software-engineering/software-requirement-specification-srs-document-checklist/][GeeksforGeeks: SRS Document Checklist]]
- [[https://gerrit-review.googlesource.com/Documentation/dev-design-docs.html][Gerrit: Design Docs Review Process]]
- [[https://google.github.io/eng-practices/review/reviewer/][Google Engineering Practices: Code Review Developer Guide]]
- [[https://google.github.io/styleguide/docguide/best_practices.html][Google Documentation Guide: Best Practices]]
- [[https://spec-coding.dev/guides/spec-review-checklist][Spec Coding: Spec Review Checklist]]

* Review and iteration history

** 2026-05-23 Sat @ 00:00:00 -0500 — Claude Code (linear-emacs) — author
- *What:* Initial draft of the workflow file. Validated by a real 2026-05-23 run against a linear-emacs ticket before being handed off to rulesets.
- *Why:* Craig wanted to use each contributor for their value — his time goes into user scenarios and direction-setting, agents' time goes into structured technical review (gate-checking, finding-tracking, disposition rigor) where agents are more consistent than humans. spec-review formalized the reviewer-side discipline so Craig doesn't spend his time on gate-checks an agent can do better.
- *Artifacts:* Delivered to =rulesets/inbox/= on 2026-05-23 as part of a 4-file handoff (2 workflow drafts + 2 handoff notes). Inbox files since deleted after install. Heading timestamp is approximate per the opaque-id convention; no exact time was recorded.

** 2026-05-23 Sat @ 12:59:31 -0500 — Claude Code (rulesets) — reviewer + responder
- *What:* Three tightening edits to the review workflow itself before install: Phase 1 reframed as fast triage with Phase 3 the authoritative gate after the code read; readiness gate expanded from 11 to 13 items adding security/privacy and observability (each with an escape hatch for trivial features); severity-to-blocking-power mapping added in Phase 6 (high blocks Ready, medium is author discretion). Paired response-workflow edits landed in the same commit and are recorded in spec-response.org's history.
- *Why:* Three parallel research subagents (Definition-of-Ready / INVEST, requirements-quality IEEE 830 / ISO 29148, disposition-tracking peer-review practice) confirmed gaps. Close-read identified the Phase 1 vs Phase 3 ordering tension (gate items needed code-read context) and a security/observability blind spot.
- *Artifacts:* Commit =7f2aea1=, +706 lines across 6 files including INDEX entries and the paired spec-response.org edits.

** 2026-05-28 Thu @ 03:22:25 -0500 — Craig Jennings — author
- *What:* Added the Review-and-iteration-history requirement itself. New goal item ("the spec's review history is updated"); new Phase 6 sub-section with the entry shape (heading-as-compound-id + What/Why/Artifacts body); new "Preserve iteration provenance" principle.
- *Why:* Craig had no way to scan a spec under review and tell which iterations had already been processed, which were new, and which were pending — file mtime / created-time was the only signal, silly and error-prone. The iteration-history section makes processing state visible inside the spec content itself.
- *Artifacts:* Commit =55adf6e=.

** 2026-05-28 Thu @ 03:23:01 -0500 — Claude Code (rulesets) — author
- *What:* Backfilled this section with entries for the workflow file's prior iterations (Draft 1, Review-and-Fold, Requirement addition). Used commit author dates where exact; Draft 1's timestamp is an opaque-id placeholder since no exact time was recorded.
- *Why:* The iteration-history requirement landed in =55adf6e= without retroactive entries for the workflow file's own prior history. This pass closes that gap so the section starts with full provenance rather than empty.
- *Artifacts:* Working draft sources at =working/spec-workflows-iteration-history-backfill/draft-entries.org= (deleted on this commit). Spec-response cycle (two Codex reviews and three rulesets responses on the draft) validated the entries and the entry-shape before splicing.

** 2026-05-31 Sun @ 10:39:49 -0500 — Claude Code (rulesets) — author
- *What:* Added a Phase 6 step, "Emit implementation tasks (drop-in for todo.org)": lift the spec's =Implementation phases= section into a paste-ready =todo.org= block (one =[#B]= TODO per phase plus a test-surface entry mirroring =Acceptance criteria=), with the missing-phase-decomposition case raised as a finding rather than invented. Added Exit Criterion 6 to match.
- *Why:* From a pearl handoff. Phase 6 only logged deferred/v1 work in passing; this makes the implementer handoff one paste instead of a re-read, and forces specs to be decomposable into phases (a spec that resists it surfaces a shape problem before =Ready=).
- *Artifacts:* =todo.org= task ([#C] =:feature:solo:=, closed this commit). Phase 6 + Exit Criteria edits in this file.

** 2026-06-06 Sat @ 04:52:13 -0500 — Codex (emacs-d) — reviewer + author
- *What:* Added reusable checks from DUET's third review pass: transfer/sync failure-mode research, covered/avoided/deferred classification, comparable-product love/hate sentiment, terminal-state discovery/notification, and explicit harvesting of generalizable user questions into the workflow with a rulesets inbox note.
- *Why:* Craig asked for deeper online research into how transfers and sync go wrong, whether DUET covers or avoids those failures, whether users would discover failures, and how future spec reviews should absorb new reusable questions rather than leaving them in chat.
- *Artifacts:* DUET spec/review updates in =docs/design/duet-spec.org= and =docs/design/duet-spec-review.org=; rulesets inbox note =2026-06-06-spec-review-workflow-update.md= updated.

** 2026-06-06 Sat @ 05:18:30 -0500 — Codex (emacs-d) — researcher + author
- *What:* Added reviewer anti-pattern guidance and a rollout/rollback/reversibility dimension after a meta-review of design/spec-review advice. New principles: review as risk reduction rather than rewrite contest, separate blockers from preferences, and make feedback author-usable. Added source links for Google code-review/documentation guidance and a modern spec-review checklist.
- *Why:* Craig asked for online research into what good spec reviewers should always do and avoid, including reputable sources and developer-community advice. The workflow already covered most readiness mechanics; the missing reusable content was reviewer discipline and compatibility/rollback framing.
- *Artifacts:* This workflow; DUET final review pass; rulesets inbox note =2026-06-06-spec-review-workflow-update.md= updated.

** 2026-06-06 Sat @ 07:25:11 -0500 — Claude Code (rulesets) — responder
- *What:* Promoted the two Codex (emacs-d) passes above into the canonical workflow. The reusable checks had been written into =.emacs.d='s synced copy, which the startup rsync overwrites from canonical, so they would not have survived or propagated. Folded them into =claude-templates/.ai/workflows/spec-review.org= and synced the =.ai/= mirror, so every project picks them up on next startup. Verified the promotion is a pure superset — every change is an added or expanded generic check, no regressions.
- *Why:* The handoff =inbox/2026-06-06-spec-review-workflow-update.md= asked for it: reusable review practice belongs in the canonical workflow, spec-specific findings stay in the spec. Exit Criterion 7 and the Living-Document harvesting guidance, added in the same passes, name this as the intended path.
- *Artifacts:* Handoff preserved at =docs/design/2026-06-06-spec-review-reusable-checks-handoff.md=. Paired spec-response.org Phase 6 (author-side implementation-task breakdown) committed in the same session.

** 2026-06-12 Fri @ 19:09:00 -0500 — Claude Code (rulesets) — responder
- *What:* Reconciled this workflow to spec-create's new Decisions convention (each decision is an org =TODO= task that flips to =DONE= on agreement, with a =[/]= cookie on the =* Decisions= heading). The Phase 1/3 readiness gate gains an all-decisions-=DONE= item, and the Phase 6 =Ready= rubric label now requires the =[/]= cookie to read complete (a still-=TODO= decision holds the rubric at =Not ready=, or =Ready with caveats= on a recorded, consciously-accepted risk).
- *Why:* The convention change landed in spec-create.org via an .emacs.d handoff (originated in its keymap-consolidation spec); the rubric still gated on the retired =State:= field model.
- *Artifacts:* Handoff =inbox/2026-06-12-1906-from-.emacs.d-spec-create-decisions-todo-note.org=. Paired spec-create.org and spec-response.org edits in the same commit.

** 2026-06-12 Fri @ 19:30:03 -0500 — Claude Code (rulesets) — author
- *What:* Two refinements to the same-day decisions convention after Craig's review: the gate item and =Ready= rubric now read "no decision is still =TODO=" with =SUPERSEDED= and =CANCELLED= counting as resolved (spec-create's template defines them as done-class keywords via a =#+TODO:= header), and a spec still on the retired =State:= field model explicitly fails the gate item until converted — closing the vacuous-pass hole on old specs.
- *Why:* Review of the freshly-landed convention flagged that TODO/DONE alone lost the old model's superseded state and that the gate as written would silently pass a spec with no decision tasks at all. Craig chose the two done-class keywords and the auto-added =#+TODO:= header (the in-file header is what makes custom keywords portable).
- *Artifacts:* Paired spec-create.org edits (keyword scheme + template header) in the same commit.