diff options
| author | Craig Jennings <c@cjennings.net> | 2026-06-06 07:32:53 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-06-06 07:32:53 -0500 |
| commit | b1a6d0e60ddec39538d068bd991c5286830ff870 (patch) | |
| tree | 62ab564af989b6f1cf91c10cfe1f4e9fd3e908d9 /claude-templates | |
| parent | 255031a403fdfb4ea49938068399270121ed1975 (diff) | |
| download | rulesets-b1a6d0e60ddec39538d068bd991c5286830ff870.tar.gz rulesets-b1a6d0e60ddec39538d068bd991c5286830ff870.zip | |
feat(workflows): promote reusable spec-review checks from emacs-d review passes
I folded the reusable, product-neutral checks from two emacs-d review passes into the canonical spec-review.org, so they survive the startup rsync and reach every project instead of living only in a downstream copy. The additions cover package-readiness and Makefile scope, actionable error strings, observability and diagnostics, long-running performance and failure-mode research, defcustom surface, a documentation plan, architecture weak-point mitigation, simplicity controls, extension/plugin developer experience, comparable-product sentiment, terminal-state discovery, CLI-wrapper value, and rollout/rollback, plus three reviewer principles and a generalizable-question harvesting rule.
The promotion is a pure superset. Every change adds or expands a generic check, nothing regresses. Project-specific findings stayed in the source spec. The handoff that asked for this is preserved under docs/design.
Diffstat (limited to 'claude-templates')
| -rw-r--r-- | claude-templates/.ai/workflows/spec-review.org | 79 |
1 files changed, 67 insertions, 12 deletions
diff --git a/claude-templates/.ai/workflows/spec-review.org b/claude-templates/.ai/workflows/spec-review.org index c0df333..35e9d0e 100644 --- a/claude-templates/.ai/workflows/spec-review.org +++ b/claude-templates/.ai/workflows/spec-review.org @@ -32,6 +32,7 @@ A review is complete when: 4. *The spec's review history is updated* with who reviewed it, when, which iteration it was, what changed or was recommended, and why. 5. *Deferred work is logged* to =todo.org= (v1 = =[#B]=, vNext/someday = =[#D]=), not left only in chat. 6. *Implementation tasks are enumerated* — the spec's =Implementation phases= section is lifted into a drop-in =todo.org= block (one entry per phase plus a test-surface entry), or, if the spec has no phase decomposition, that gap is raised as a finding. +7. *Generalizable review questions are harvested.* If Craig or the review uncovers a question/topic that applies beyond the current spec, rewrite it generically, add it to this workflow, and note the workflow change in the rulesets inbox/update note. Spec-specific observations stay in the spec review; reusable review practice moves here. *Measurable validation:* after reading the review, an implementer knows exactly what to decide or change before starting, and an author running spec-response can give every finding an accept/modify/reject without guessing what it means. @@ -71,16 +72,24 @@ Mark the spec implementation-ready only if *all* of these hold: - The user problem and target behavior are clear. - Scope is explicit: v1, out-of-scope, and vNext are separated. +- Project principles are explicit where the feature touches user data, remote systems, destructive actions, long-running operations, or public extension points. - All prior review questions/recommendations are answered or intentionally deferred. - The expected UX is concrete enough to implement without guessing. +- Existing platform/framework affordances are reused, wrapped, or intentionally replaced; lost opportunities are called out. - The data model, state ownership, persistence, and sync behavior are defined. -- Error, conflict, empty-state, and partial-failure behavior are defined. +- Error strings, conflict, empty-state, and partial-failure behavior are defined with actionable user-facing messages. - Security and privacy behavior is defined wherever the feature touches credentials, personal or shared data, or external services — or there is nothing sensitive to define and that's stated. -- Observability is defined: how the user (or operator) sees what the feature is doing and how failures surface — or the feature is simple enough not to need it, and that's stated. -- Performance expectations and likely bottlenecks are considered. +- Observability and diagnostic tooling are defined: how the user (or operator) sees what the feature is doing, how failures surface, how logs/debug commands isolate issues, and how sensitive data is redacted — or the feature is simple enough not to need it, and that's stated. +- Performance expectations and likely bottlenecks are considered, including long-running operations and how they are tested without making the normal suite slow/flaky. +- For transfer, sync, backup, import/export, queue, or other long-running data-movement specs: common and edge failure modes have been researched against current sources and comparable applications, and each material risk is classified as covered, avoided by scope, deferred with safeguards, or missing. - The architecture fits the existing codebase and names its integration points. -- The test strategy covers unit, integration/fixture, regression, and user-flow as appropriate. -- The plan can be phased without shipping broken intermediate states. +- Architecture weak points are named with mitigation steps, especially around concurrency, lifecycle states, extension points, and failure isolation. +- The public customization/configuration surface is named with defaults, safety notes, and clear user-facing variable names. +- The documentation plan is defined: README, user guide, developer guide, migration/upgrade notes, or an explicit reason they are unnecessary. +- The project tooling is part of scope where appropriate: Makefile/package scripts expose setup, test, compile, lint, coverage, cleanup, and slow/manual verification. +- The test strategy covers unit, integration/fixture, regression, user-flow, slow/long-running behavior, coverage reporting, and optional live/external tests as appropriate. +- Rollout, migration, rollback/reversibility, and compatibility are defined when the feature changes public behavior, persisted data, external resources, package APIs, user configuration, or shared remote state. +- The plan can be phased without shipping broken intermediate states, and phases are small enough to reach a clean stopping point in one focused work session. - External API assumptions are verified or explicitly listed as prerequisites. If all true → tell the user it's ready and stop unless they ask for more. If any false → continue and write the review file. A "ready" at this phase is provisional; confirm it at Phase 3 after the code read. @@ -102,19 +111,29 @@ After reading code and spec, re-run the Phase 1 gate — this is the pass that c Work the spec against these. Each is a source of concrete findings, not a box to tick. - *User problem & mental model.* Real problem or just machinery? Does UX match how users think? Predictable names/commands/prompts/layouts? Is editable-vs-generated content obvious? Are destructive/workspace-mutating actions explicit and confirmed? For org features, be *strict about ownership* — users treat visible org text as editable unless told otherwise. -- *Scope & boundaries.* What's v1 / out-of-scope / vNext? Are deferrals captured (=todo.org=)? Does the phased plan avoid half-working states? If a decision is deferred, is v1 still coherent without it? +- *Project principles and non-negotiables.* Does the spec state do/don't rules that protect user trust? For filesystem/remote/data-mutating features, does it forbid silent data loss, require residue/temp-file cleanup verification, make destructive and fallback paths visible, define cancellation/hung-work behavior, and prevent secret leakage? For extension points, does it define contracts plugin authors must obey? +- *Reuse and lost opportunities.* What does the underlying platform/framework/library already provide? Does the spec preserve useful existing affordances, naming, configuration, keymaps, commands, hooks, logs, caches, or test utilities? If it replaces them, is that intentional and justified? Are there obvious user-value opportunities being left on the table? +- *Scope & boundaries.* What's v1 / out-of-scope / vNext? Are deferrals captured (=todo.org=)? Does the phased plan avoid half-working states? If a decision is deferred, is v1 still coherent without it? Can each phase finish cleanly in one focused work session, or is it too large to stop safely? - *Requirements clarity.* Each requirement unambiguous, testable, traceable to a source (user ask / bug / API capability / decision)? Are "good/fast/intuitive/smart/seamless" replaced with concrete behavior? Terms used consistently? Infeasible/unverified requirements flagged? - *Data model & ownership.* Durable IDs vs display-only names? What's local/remote/generated/cached/user-authored? Who owns each editable region? What's replaced on refresh, survives view-switch / restart? What if one remote object could appear in multiple local places? Prefer stable IDs for execution, names for display; don't overload generic org properties (=:ID:=) for app-specific remote IDs unless intentional. - *Sync, refresh, conflict.* What triggers sync (command / save hook / state hook / timer) and refresh? One-way or bidirectional? No-op detection? Remote-conflict detection? v1 conflict behavior? Dirty-buffer behavior? API success but GraphQL-level failure? Network failure mid-pagination/mutation? For v1, simple is fine if explicit: detect, refuse, tell the user what to do next. +- *Transfer/sync failure-mode research.* For features that move or reconcile user data, do online/current research on how the domain fails in general and in comparable products. Look for partial writes, temp/residue files, source mutation during transfer, locked/read-only files, permission/ACL/xattr loss, quota/space failure, case/restricted-name collisions, invalid encodings, duplicate remote objects, symlink/hardlink/special-file behavior, clock skew/mtime unreliability, watcher delays, missing roots/unmounted volumes, ignored/selective trees, mass deletes, ransomware-like churn, reconnect/offline gaps, conflict propagation, stale status, and "stuck but no file list" complaints. Classify every material item as =covered=, =avoided by v1 scope=, =deferred with guardrails=, or =missing=; missing items become findings. - *API & external dependencies.* Which fields/mutations/filters/enums are assumed, and are they verified against current schema/docs/live responses? Permission differences, rate limits, pagination, nulls, partial errors handled? Any optional external tool — what's the fallback? Don't let a spec silently depend on an unverified API shape. +- *CLI/tool-wrapper value.* If the feature wraps command-line tools or external executables, why would a user prefer the package over running the tool directly? Specs must cover common tool-specific failures, exit codes, stderr/log signatures, destination/backend gotchas, preflight/doctor checks, redaction, retry/cancel behavior, and a failure explainer. Helpful-error success criteria: identify the failure class, explain likely cause, prove it with redacted evidence, state safety outcome, offer repair actions, and provide a redacted reproduction command/log bundle. - *Architecture & maintainability.* Fits current module boundaries, or should new ones precede growth? Which parts are pure vs side-effecting? Are transport/parsing/normalization/rendering/command-orchestration separated enough to test? Will it multiply special cases? Backward-compat preserved where needed? Favor small pure cores surrounded by thin IO/command layers. -- *Robustness & error handling.* Expected failure modes? Errors distinguishable from empty-but-successful results? Actionable messages? Retries? Stale/missing caches? Can malformed local files or missing properties crash commands? nil/null/missing-field behavior defined? Data-loss-by-default avoided? Anything that writes files or remote data needs a clear data-loss story. -- *Performance & scale.* Expected counts (issues/comments/labels/teams/projects/views)? Server-side filtering where possible? Bounded, visible pagination? Cached name→ID lookups? Sync calls in the command path acceptable? Could a save hook or whole-file scan make N network calls? Rendering linear? Full-file rewrites avoided? Identify UI freezes, repeated network calls, unbounded pagination — without premature optimization. +- *Simplicity and complexity control.* Is the UX dead simple on the hot path? Are advanced controls out of the way? Are command functions thin? Are branchy functions split or justified? Would cyclomatic complexity, coverage, or code-size reports reveal refactoring targets? Are there explicit refactoring triggers before complexity hardens? +- *Extension/developer experience.* If plugins/backends/extensions are part of the feature, is the public ABI precise? Are capability flags defined? Are lifecycle callbacks, failure states, redaction rules, and contract tests documented? Can another engineer add an extension without reading private internals? Is the extension contract tiered so authors can build a minimum runnable plugin quickly, then graduate to publishable/excellent diagnostics? Does the spec clearly say what information is required, what is optional, and what can be unknown/conservative at first? Will authors know what evidence/patterns/capability metadata to add, or are they expected to infer it from core internals? Are scaffolds, templates, fixture capture tools, pattern-table helpers, contract-test tiers, and "where the line is" docs provided? Could the contract be so complex that reasonable plugin authors opt out, and if so, what helper API removes that burden? Treat plugin authors as end-users of the developer experience and require their plugins to give downstream users the same diagnostic quality the core promises. +- *Robustness & error handling.* Expected failure modes? Errors distinguishable from empty-but-successful results? Actionable messages? Do error strings name the operation, object, backend/API/tool involved, and next step? Retries? Stale/missing caches? Can malformed local files or missing properties crash commands? nil/null/missing-field behavior defined? Data-loss-by-default avoided? Anything that writes files or remote data needs a clear data-loss story. +- *Performance & scale.* Expected counts (issues/comments/labels/teams/projects/views)? Server-side filtering where possible? Bounded, visible pagination? Cached name→ID lookups? Sync calls in the command path acceptable? Could a save hook or whole-file scan make N network calls? Rendering linear? Full-file rewrites avoided? Long-running operations async/cancellable/observable? Is concurrency/queueing/backpressure defined? Are high-output process filters throttled and cheap? Is progress/ETA exposed only when defensible, and are hung/stalled operations detectable and killable? Identify UI freezes, repeated network calls, unbounded pagination — without premature optimization. - *Security & privacy.* API keys safe? Debug logs leaking secrets or private issue text? Confirmations before mutating shared workspace objects? Personal vs shared distinguished? Local files holding sensitive descriptions/comments? Anything to redact from messages/logs? Any work-tracker integration may handle private company data. -- *UX & accessibility.* Discoverable commands? Recoverable mistakes? Prompts ordered to the task? Safe, useful defaults? Informative-not-noisy status messages? Does the UI avoid implying unsupported actions are supported? Match the upstream product's permissions/concepts? For Emacs packages, command names, completion candidates, buffer layout, and message wording *are* the UX. -- *Test strategy.* Characterization tests before behavior changes? Pure functions to unit-test? API responses needing fixtures? Command flows needing stubs? Regression tests for prior bugs? Boundary/error cases? What's covered elsewhere and shouldn't be re-tested? Which existing tests must change? Prefer tests that lock contracts: representation shape, query compilation, sync no-op, conflict refusal, pagination, dirty-buffer protection. -- *Observability & operations.* How does a user see what the package is doing? Progress messages for long ops? Useful, safe debug logging? A way to inspect/clear caches? Verify setup/API connectivity? For generated org files, headers should often carry source, filter/view name, refresh time, count, truncation state. -- *Documentation & migration.* README/config changes? New commands and defcustoms documented? Old behaviors removed/renamed/migrated? Breaking changes acceptable? Changelog note needed? Don't ship with stale examples. +- *UX & accessibility.* Discoverable commands? Recoverable mistakes? Prompts ordered to the task? Safe, useful defaults? Informative-not-noisy status messages? Does the UI avoid implying unsupported actions are supported? Match the upstream product's permissions/concepts? Are customizations named in user language, with clear defaults and docstrings? For Emacs packages, command names, completion candidates, buffer layout, defcustom names, and message wording *are* the UX. +- *Test strategy and coverage.* Characterization tests before behavior changes? Pure functions to unit-test? API responses needing fixtures? Command flows needing stubs? Regression tests for prior bugs? Boundary/error cases? What's covered elsewhere and shouldn't be re-tested? Which existing tests must change? How is coverage generated, summarized, and used to find untested/refactor-worthy code? Prefer tests that lock contracts: representation shape, query compilation, sync no-op, conflict refusal, pagination, dirty-buffer protection, log redaction, and long-running/slow-operation behavior via fakes rather than flaky live dependencies. +- *Observability & operations.* How does a user see what the package is doing? Progress messages for long ops? Useful, safe debug logging? Are logs structured enough to isolate issues from a bug report? Are commands provided to inspect/clear caches, test connectivity, diagnose backends/tools, copy redacted debug info, or reproduce command invocations? How are terminal states discovered: completion, failure, partial success, stalled/hung, cancelled, cleanup-unverified, and "needs user action"? Does the product notify only when useful, avoid noisy success spam, and keep non-success states visible until acknowledged? For generated org files, headers should often carry source, filter/view name, refresh time, count, truncation state. +- *Comparable-product sentiment.* When there are obvious adjacent products, research what users love and hate about them from official docs plus current community reports. Do not cargo-cult their feature set; translate findings into the spec's scope. For each loved behavior, say whether the spec provides it, intentionally omits it, or defers it. For each hated behavior, say whether the spec avoids, resolves, inherits, or accepts it. +- *Documentation & migration.* README/config changes? New commands and defcustoms documented? User guide needed? Developer guide/API guide needed? Makefile or package scripts documented as part of the product? Old behaviors removed/renamed/migrated? Breaking changes acceptable? Changelog note needed? Don't ship with stale examples. +- *Rollout, compatibility, and rollback.* How does the feature ship safely? Are old config/data formats migrated or read compatibly? Can users opt out or revert? If remote/shared data is mutated, is there a dry-run, backup/trash/versioning, or recovery path? If a public API/plugin contract changes, is the compatibility story explicit? For local packages, "rollback" may mean disabling the module, removing generated files, or restoring prior config; spell that out where it matters. +- *Development tooling.* Does the repo give contributors obvious commands for setup, fast tests, specific tests, compile, lint, coverage, cleanup, slow/manual tests, and release checks? Are optional/live tests gated by explicit environment variables? Is the Makefile/script surface consistent with sibling projects? +- *Small enhancement radar.* Are there low-complexity, high-value affordances already provided by the platform that should be surfaced now or explicitly deferred? Examples: archive/compress commands in file managers, built-in history, previews, diagnostics, or doctor commands. Keep the hot path simple; capture the opportunity rather than accidentally losing it. ** Phase 5: Write the review file @@ -146,6 +165,8 @@ Important improvements that shouldn't block all progress. ,* Robustness and performance observations ,* Test strategy recommendations Specific test cases, not generic "add tests". +,* Documentation and tooling recommendations +README/user/developer docs, Makefile/package scripts, coverage, debug tools, and customization surface. ,* Suggested spec edits Concrete edits to make the spec implementation-ready. @@ -225,6 +246,15 @@ The spec says fetched comments render as subheadings but doesn't define whether ** Decide readiness first, critique second The gate is the product of the review. A long critique on a spec that's actually ready just creates churn; a "looks fine" on one that isn't ships hidden decisions. +** Review as a risk reducer, not a rewrite contest +Good spec review improves the chance of a correct implementation. It is not a venue for proving cleverness, redesigning the whole product, or teaching broad lessons at the author's expense. Prefer the smallest spec change that removes the implementation risk. If a finding implies a major redesign, name the decision pressure and the options; don't silently rewrite the architecture in review prose. + +** Separate blockers from preferences +Every material comment should be tagged by force: blocking, should-fix, or optional. Avoid burying a true blocker in a long list of preferences, and avoid making stylistic or taste comments look like correctness concerns. Bikeshedding belongs in linters, formatters, local conventions, or explicit optional notes. + +** Make feedback author-usable +Review comments should be specific, neutral, and actionable: quote or name the spec behavior, explain the risk, recommend the smallest concrete change, and say how the author can verify the fix. Avoid personal language, rhetorical questions, vague "this needs work" comments, and comments that require the author to infer the desired edit. + ** Preserve iteration provenance Future reviewers and implementers need to know not just the current decision, but how the spec got there: how many review/response loops happened, who contributed, what they changed or recommended, and why. Keep that record in the spec itself under =Review and iteration history= so the trail survives deleted review files, chat loss, and agent handoffs. @@ -244,6 +274,13 @@ Open questions in the review are the ones whose answer materially changes implem Update this workflow as review practice sharpens — new dimensions worth checking, better model findings, refinements to the readiness gate and rubric. +When a review conversation surfaces a question that would improve future reviews, decide whether it is generalizable: + +- *Generalizable:* applies to a class of future specs (for example, "How will users discover long-running failures?", "What do comparable products make users love/hate?", "Which transfer/sync edge cases are covered, avoided, or deferred?"). Rewrite it in product-neutral language, add it to this workflow, and append a short note to the rulesets inbox/update note explaining what changed and why. +- *Spec-specific:* depends on this product's exact domain, names, or implementation choices. Keep it in the review/spec history and do not bloat the workflow. + +This workflow is allowed to grow, but only by turning repeated or broadly useful review instincts into clear checks that produce implementation-changing findings. + * Research notes behind this checklist Combines the review process used on the linear-emacs issue query/representation specs with broader design/spec-review practice: @@ -261,6 +298,9 @@ Sources: - [[https://www.atlassian.com/software/confluence/templates/design-review][Atlassian: Design Review Template]] - [[https://www.geeksforgeeks.org/software-engineering/software-requirement-specification-srs-document-checklist/][GeeksforGeeks: SRS Document Checklist]] - [[https://gerrit-review.googlesource.com/Documentation/dev-design-docs.html][Gerrit: Design Docs Review Process]] +- [[https://google.github.io/eng-practices/review/reviewer/][Google Engineering Practices: Code Review Developer Guide]] +- [[https://google.github.io/styleguide/docguide/best_practices.html][Google Documentation Guide: Best Practices]] +- [[https://spec-coding.dev/guides/spec-review-checklist][Spec Coding: Spec Review Checklist]] * Review and iteration history @@ -288,3 +328,18 @@ Sources: - *What:* Added a Phase 6 step, "Emit implementation tasks (drop-in for todo.org)": lift the spec's =Implementation phases= section into a paste-ready =todo.org= block (one =[#B]= TODO per phase plus a test-surface entry mirroring =Acceptance criteria=), with the missing-phase-decomposition case raised as a finding rather than invented. Added Exit Criterion 6 to match. - *Why:* From a pearl handoff. Phase 6 only logged deferred/v1 work in passing; this makes the implementer handoff one paste instead of a re-read, and forces specs to be decomposable into phases (a spec that resists it surfaces a shape problem before =Ready=). - *Artifacts:* =todo.org= task ([#C] =:feature:solo:=, closed this commit). Phase 6 + Exit Criteria edits in this file. + +** 2026-06-06 Sat @ 04:52:13 -0500 — Codex (emacs-d) — reviewer + author +- *What:* Added reusable checks from DUET's third review pass: transfer/sync failure-mode research, covered/avoided/deferred classification, comparable-product love/hate sentiment, terminal-state discovery/notification, and explicit harvesting of generalizable user questions into the workflow with a rulesets inbox note. +- *Why:* Craig asked for deeper online research into how transfers and sync go wrong, whether DUET covers or avoids those failures, whether users would discover failures, and how future spec reviews should absorb new reusable questions rather than leaving them in chat. +- *Artifacts:* DUET spec/review updates in =docs/design/duet-spec.org= and =docs/design/duet-spec-review.org=; rulesets inbox note =2026-06-06-spec-review-workflow-update.md= updated. + +** 2026-06-06 Sat @ 05:18:30 -0500 — Codex (emacs-d) — researcher + author +- *What:* Added reviewer anti-pattern guidance and a rollout/rollback/reversibility dimension after a meta-review of design/spec-review advice. New principles: review as risk reduction rather than rewrite contest, separate blockers from preferences, and make feedback author-usable. Added source links for Google code-review/documentation guidance and a modern spec-review checklist. +- *Why:* Craig asked for online research into what good spec reviewers should always do and avoid, including reputable sources and developer-community advice. The workflow already covered most readiness mechanics; the missing reusable content was reviewer discipline and compatibility/rollback framing. +- *Artifacts:* This workflow; DUET final review pass; rulesets inbox note =2026-06-06-spec-review-workflow-update.md= updated. + +** 2026-06-06 Sat @ 07:25:11 -0500 — Claude Code (rulesets) — responder +- *What:* Promoted the two Codex (emacs-d) passes above into the canonical workflow. The reusable checks had been written into =.emacs.d='s synced copy, which the startup rsync overwrites from canonical, so they would not have survived or propagated. Folded them into =claude-templates/.ai/workflows/spec-review.org= and synced the =.ai/= mirror, so every project picks them up on next startup. Verified the promotion is a pure superset — every change is an added or expanded generic check, no regressions. +- *Why:* The handoff =inbox/2026-06-06-spec-review-workflow-update.md= asked for it: reusable review practice belongs in the canonical workflow, spec-specific findings stay in the spec. Exit Criterion 7 and the Living-Document harvesting guidance, added in the same passes, name this as the intended path. +- *Artifacts:* Handoff preserved at =docs/design/2026-06-06-spec-review-reusable-checks-handoff.md=. Paired spec-response.org Phase 6 (author-side implementation-task breakdown) committed in the same session. |
