diff options
| author | Craig Jennings <c@cjennings.net> | 2026-06-16 00:58:27 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-06-16 00:58:27 -0500 |
| commit | 7467d1f0c1da0ce3ffe07337cdaf689e010890a3 (patch) | |
| tree | d032ee1a7484a4a7802bd7ec5b545d810d43a691 /docs/design/2026-06-16-autonomous-batch-execution-spec.org | |
| parent | 26bcae666ac648812bb24bd666383f4da50976df (diff) | |
| download | rulesets-7467d1f0c1da0ce3ffe07337cdaf689e010890a3.tar.gz rulesets-7467d1f0c1da0ce3ffe07337cdaf689e010890a3.zip | |
docs: spec autonomous-batch execution and KB contribution
The parked Phase E proposal and the "fix speedrun" mode describe the same capability, so I reconciled them into one autonomous-batch spec: a dedicated work-the-backlog.org holds the execution loop, inbox-zero keeps its routing, and "fix speedrun" is a thin preset over it. The spec also designs an effectiveness-measurement trial (a per-task metrics log plus periodic org-roam synthesis articles). The second spec wires light KB-contribution prompts into four workflows plus a curated best-practices node.
Both tasks now carry a review VERIFY. The wrap-up-routing implementation stays open: it moves tasks between projects' todo.org files, so it needs a focused session with a data-loss checkpoint, not a tail-end rush.
Diffstat (limited to 'docs/design/2026-06-16-autonomous-batch-execution-spec.org')
| -rw-r--r-- | docs/design/2026-06-16-autonomous-batch-execution-spec.org | 329 |
1 files changed, 329 insertions, 0 deletions
diff --git a/docs/design/2026-06-16-autonomous-batch-execution-spec.org b/docs/design/2026-06-16-autonomous-batch-execution-spec.org new file mode 100644 index 0000000..e2e0f90 --- /dev/null +++ b/docs/design/2026-06-16-autonomous-batch-execution-spec.org @@ -0,0 +1,329 @@ +#+TITLE: Autonomous-Batch Task Execution — Spec +#+AUTHOR: Craig Jennings & Claude +#+DATE: 2026-06-16 +#+TODO: TODO | DONE SUPERSEDED CANCELLED + +* Metadata +| Status | draft | +|----------+--------------------------------------------------------------------| +| Owner | Craig Jennings | +|----------+--------------------------------------------------------------------| +| Reviewer | Craig Jennings | +|----------+--------------------------------------------------------------------| +| Date | 2026-06-16 | +|----------+--------------------------------------------------------------------| +| Related | [[file:../../working/inbox-zero-phase-e/proposed-inbox-zero.org][Phase E proposal]]; [[file:2026-06-15-fix-speedrun-workflow-proposal.org][fix-speedrun proposal]] | +|----------+--------------------------------------------------------------------| + +* Summary + +Two proposals arrived within a day of each other describing the same capability: have Claude work a batch of small, well-marked tasks autonomously, with a full quality bar per task and no per-step approval gate. The inbox-zero "Phase E" proposal drives it from a tag/priority query on a recurring loop; the "fix speedrun" proposal drives it from an explicit ordered list a human dictates in-session. This spec reconciles both into one feature: a single dedicated workflow, =work-the-backlog.org=, that holds the task-execution logic, with two thin callers feeding it. It also designs the instrumentation that measures whether the autonomy is actually paying off. + +* Problem / Context + +Craig has a standing backlog of small, solo-doable fixes across several projects, already marked with a tag convention (=:next:=, =:quick:+:solo:=). Doing them by hand one at a time is the bottleneck — each is 30 minutes or less, but the context-switch and the per-commit approval ceremony dominate the actual work. He wants Claude to burn these down unattended: on a recurring loop for the routed inbox case, and on demand when he batches a named list and says "fix speedrun, no approvals until done." + +Two separate proposals tried to answer this: + +- *Phase E* (in =inbox-zero.org=, edited in =.emacs.d= as a stopgap) bolted autonomous execution onto the inbox-zero workflow's on-demand and loop callers. The sender flagged the seam as the open question: coupling capture-routing with autonomous-implementation pollutes inbox-zero's three existing callers (startup, wrap-up, on-demand), two of which must never execute anything. +- *fix speedrun* (a =.emacs.d= theme-studio session that worked well) is the same execution loop driven by an explicit ordered task set, with end-of-set paging and always-push. + +They overlap almost entirely. The execution loop — eligibility gate, act-vs-file decision, per-task quality bar, bounded run — is identical. Only the *input* differs (tag query vs explicit list) and the *session mode* differs (loop default vs no-approvals + always-push + page). Building them as two features would duplicate the execution logic and let the two copies drift. The forces: keep inbox-zero's callers clean, share one execution loop, and make the autonomy safe enough to run unattended on a 30-minute timer without Craig watching. + +A second, explicit ask from Craig: instrument this so its effectiveness is measurable. "Gather data on this and create some org-roam articles we can look at later." Autonomous execution that silently makes bad commits is worse than no autonomy; the only way to know which it is, is to measure tasks completed vs deferred vs reverted, and human corrections in the following session, over time. + +* Goals and Non-Goals + +** Goals +- One workflow, =work-the-backlog.org=, owns the task-execution loop. Both input shapes (tag query, explicit list) and both session modes feed it. +- inbox-zero's three existing callers stay clean: the loop caller chains into =work-the-backlog= *after* routing; startup and wrap-up never touch it. +- "fix speedrun" is a thin named preset, not a second implementation: no-approvals session mode + always-push + end-of-set page, feeding an explicit ordered list. +- Commit autonomy defaults to file-only (surface a diff, no auto-commit). A project opts into autonomous commit+push explicitly via its per-project waiver. +- Hard guardrails: refuse to speedrun any task needing a design decision or carrying data-loss risk without a checkpoint; file a =VERIFY= and move on rather than guess-implement an underspecified task; a per-run cap / kill switch beyond "one task per run." +- A lightweight per-run metrics log plus a periodic synthesis step that writes org-roam KB articles summarizing the trend. + +** Non-Goals +- *Not* a replacement for =/start-work=. Tasks needing deliberation, design, or an hour-plus stay with =/start-work= and its approval gates. This feature only touches the small, marked, solo set. +- *Not* a new tag convention. It reads the project's own priority/tag scheme header; it never invents or hardcodes tags across projects. +- *Not* an inbox-routing change. =inbox-zero.org= keeps its A-D phases. The Phase E text added in =.emacs.d= as a stopgap is *removed* and its logic moves here. +- *Not* a multi-project orchestrator. One run works one project's backlog. Cross-project handoff stays with =inbox-send= and the paging reply. +- *Not* a credential-handling or external-API feature. Tasks that touch secrets or external mutations are out of the eligible set by the guardrail. + +** Scope tiers +- *v1:* =work-the-backlog.org=; the eligibility gate reading the project's scheme header; the act-vs-file decision with VERIFY-on-ambiguity; file-only commit default with per-project opt-in; the loop caller wiring and inbox-zero Phase E removal; the "fix speedrun" preset with end-of-set =notify --persist= page; the per-run metrics log (structured JSONL). +- *Out of scope:* a token-budget kill switch (cap is a task count in v1); cross-project batch runs; a dashboard or live UI over the metrics. +- *vNext (log to todo.org):* the periodic org-roam synthesis step if it doesn't make v1; a token/cost budget alongside the task-count cap; auto-detection of "human corrected my autonomous commit" from the next session's diff. + +* Design + +** Overview + +The architecture is one execution workflow with two callers and one preset, plus an instrumentation sidecar. + +#+begin_example + inbox-zero loop caller ──(after Phase D routing)──┐ + ├──▶ work-the-backlog.org ──▶ metrics log (JSONL) + "fix speedrun" preset ──(explicit ordered list)───┘ │ + = no-approvals + always-push + end-page ▼ + periodic synthesis ──▶ org-roam KB articles +#+end_example + +=work-the-backlog.org= is the only place the execution loop lives. It takes a *task set* (however assembled) and a *session mode* (which gates commit autonomy and paging), and works the set under a fixed safety contract. The two callers differ only in how they build the task set and which session mode they pass. + +This is the seam the Phase E sender asked for: separating capture-routing (inbox-zero) from autonomous-implementation (work-the-backlog) keeps inbox-zero's startup and wrap-up callers — which must never execute anything — untouched. The loop caller is the only one of inbox-zero's callers that chains forward into execution, and it does so as an explicit second step after routing completes, not as a phase buried inside inbox-zero. + +** The execution loop (two-altitude: caller's view) + +A caller hands =work-the-backlog= three things: + +1. *A task set* — either an explicit ordered list of task headings (fix speedrun), or the result of a tag/priority query against =todo.org= (the loop). The workflow does not care which; it receives an ordered list of candidate tasks. +2. *A session mode* — =file-only= (default) or =autonomous-commit= (requires the project's per-project waiver), and a paging flag. +3. *A run cap* — the maximum number of tasks to complete this run. + +It returns: per-task outcome (implemented+committed / implemented+diff-surfaced / deferred-VERIFY / deferred-too-large / skipped-ineligible), and a metrics record per task. + +** The execution loop (implementer's view) + +For the task set, in order, until the run cap is hit: + +1. *Eligibility gate* (below). Ineligible → record =skipped-ineligible=, next task. +2. *Scope read* of the relevant code. Cheap; just enough to make the act-vs-file call. +3. *Act-vs-file decision* (below). File → record the deferral reason, next task. +4. *Implement* under the project's commit discipline: TDD red→green→refactor, then =/review-code --staged=, fix all Critical/Important, then close the task per =todo-format.md=. +5. *Commit autonomy branch:* + - =file-only= → surface the diff, do *not* commit. Record =implemented-diff-surfaced=. + - =autonomous-commit= → =/voice personal= on the message, commit individually, push per the project's flow. Record =implemented-committed=. +6. *Record metrics* for the task (the JSONL append, below). +7. Decrement the cap. At zero, stop. + +After the set: if the paging flag is set, fire the end-of-set page (below). Surface the run summary. + +** Eligibility gate + +A task is autonomous-safe when *all* hold: + +1. *Status is =TODO=* — never =VERIFY=, =DOING=, =DONE=, or =CANCELLED=. =VERIFY= is the "awaiting Craig's manual confirmation" marker; auto-implementing one defeats the manual check it represents. +2. *Tagged per the project's autonomous-safe set* — resolved by reading the project's priority/tag scheme header at the top of its =todo.org=, not by hardcoding. The default reading is =:next:= OR both =:quick:= AND =:solo:=, but a project whose scheme declares a different autonomous-safe tag set overrides that. +3. *Solo-doable* — no input or undecided judgment call from Craig. +4. *Roughly 30 minutes or less* of focused work. + +** Act-vs-file decision (the guardrail) + +After the scope read, for each eligible candidate: + +- *Clear, bounded, solo, ≤ ~30 min* → implement. +- *Needs a design decision, Craig's input, or discussion* → do NOT implement. File a one-line note on the task naming the input it needs; surface it. +- *Carries data-loss risk without a checkpoint* (deletes data, rewrites persisted state, touches external/shared state irreversibly) → do NOT implement. File a =VERIFY= explaining the risk; surface it. +- *Underspecified or already-satisfied* → do NOT guess-implement. File a =VERIFY= noting why (the fix-speedrun "raise max spans to 5 — every cap was already 8" case) and move on. +- *An hour or more* → do NOT implement. File and surface as a =/start-work= candidate. + +When unsure which side a task falls on, file rather than implement. A wrong auto-implement costs more than a deferred task — it costs a revert *and* the human correction in the next session that the metrics are designed to catch. + +** Session modes and the "fix speedrun" preset + +Two orthogonal session-mode dimensions feed the loop: + +- *Commit autonomy:* =file-only= (default) or =autonomous-commit=. =autonomous-commit= is honored only when the project carries the per-project waiver (=.emacs.d= and =rulesets= have it; most projects do not). Absent the waiver, a request for =autonomous-commit= degrades to =file-only= and says so. +- *Paging:* on or off. End-of-set only. + +"fix speedrun" is the named preset = =autonomous-commit= + always-push + paging-on, fed an *explicit ordered list*. It is not a separate code path; it is a label for that combination of mode flags plus the explicit-list input. The loop caller, by contrast, runs =file-only= (unless the project has the waiver and opts the loop into commits) with paging off, fed the *tag query*. + +** Bounding the run and the kill switch + +Default cap: one task per run for the loop caller — implement the highest-priority eligible candidate (=[#A]= before =[#B]= before =[#C]=), record, then stop and let the next tick continue. The fix-speedrun preset works the whole explicit list in order (the human bounded it by naming it), still one commit per task. + +The kill switch is a hard per-run task cap passed by the caller, independent of "one per run": even fix-speedrun stops at the cap and pages with the remainder listed. A loop that fires every 30 minutes and commits unattended needs a ceiling that a runaway can't exceed. + +** End-of-set paging + +When the set is done (or the cap is hit), if paging is on, fire one page — end-of-set only, never per-task: + +#+begin_src sh +notify alarm "Page" "<project>: <N> done, <M> remaining — <one-line summary>" --persist +#+end_src + +=--persist= keeps it on screen until dismissed (the page-me convention). The message carries the project name, the completed count, and the remaining count, so Craig can reply confirming ready + naming the next project in one turn. The page-signal wrapper removed 2026-06-12 is reconciled to =notify= here — there is no separate page-signal call. + +* Alternatives Considered + +** Fold execution into inbox-zero (the Phase E stopgap shape) +- Good, because it's the smallest diff — the loop caller already runs inbox-zero, so execution is "one more phase." +- Bad, because it couples capture-routing with implementation. inbox-zero has three callers; startup and wrap-up must never execute. A Phase E inside inbox-zero forces both to carry a "skip Phase E" caveat and risks a future caller running it by accident. +- Neutral, because the eligibility-gate and act-vs-file text is identical either way — only its *home* differs. + +** Two separate features (keep Phase E and fix-speedrun distinct) +- Good, because each proposal ships as written with no reconciliation work. +- Bad, because the execution loop is duplicated in two places and will drift; a guardrail tightened in one won't reach the other. Two ways to do autonomous execution is two things to audit. +- Neutral, because the input and session-mode differences are real — but they're thin caller-level differences, not a reason to fork the engine. + +** Autonomous-commit as the default +- Good, because it's faster end-to-end with no diff to review. +- Bad, because most projects lack the per-project waiver, and an unattended loop committing to a project that never opted in is exactly the failure the file-only default prevents. The blast radius of a bad autonomous commit is a revert plus lost trust in the loop. +- Neutral, because the projects that *do* want it (=.emacs.d=, =rulesets=) opt in explicitly, so the capability is available where it's wanted without being the default everywhere. + +* Decisions [/] + +** TODO Where the eligibility gate reads its tag set +- Owner / by-when: Craig / spec-review +- Context: Phase E hardcoded =:next:= / =:quick:+:solo:=. Projects' priority/tag schemes vary, and the =todo-format.md= scheme header is the declared source of truth per project. +- Decision: We will read the project's =todo.org= priority/tag scheme header to resolve the autonomous-safe tag set, defaulting to =:next:= OR =:quick:+:solo:= when the header doesn't declare an explicit autonomous-safe set. +- Consequences: easier — one workflow works correctly across projects with different tag vocabularies; harder — a project with no scheme header (or a malformed one) needs a fallback, and the "default reading" has to be specified precisely enough that two projects agree on it. + +** TODO The do-not-auto-implement marker set +- Owner / by-when: Craig / spec-review +- Context: =VERIFY= means "awaiting Craig's manual confirmation" in =.emacs.d= and =rulesets=. Other projects may use =VERIFY= differently or not at all. The gate excludes =VERIFY=, =DOING=, =DONE=, =CANCELLED= by status, but the *marker semantics* are what matter. +- Decision: We will define the do-not-auto-implement set as: any status that is not =TODO=, plus any task carrying a project-declared "hold" marker. The canonical default treats =VERIFY= as do-not-implement; a project overrides only by declaring its marker semantics in its scheme header. +- Consequences: easier — the gate is portable and a project can't accidentally have its manual-check tasks auto-run; harder — requires the scheme header to carry marker semantics, which most don't yet, so the default has to be safe-by-omission (exclude anything not plainly =TODO=). + +** TODO Commit-autonomy opt-in mechanism +- Owner / by-when: Craig / spec-review +- Context: =file-only= is the default; =.emacs.d= and =rulesets= have a per-project waiver allowing autonomous commits. Where does the workflow *read* that a project has opted in? +- Decision: We will read the opt-in from the project's existing per-project waiver location (the same place the commit discipline's "no approval gate" waiver lives — =notes.org= Workflow State or =CLAUDE.md=), not introduce a new config file. +- Consequences: easier — no new config surface, reuses the existing waiver concept; harder — the waiver's exact location and format must be pinned so the workflow can detect it deterministically, and a project with the commit waiver but *not* wanting the loop to commit needs a way to say "waiver yes, loop-commit no" (two flags, not one). + +** TODO Run-cap default and the kill switch shape +- Owner / by-when: Craig / spec-review +- Context: The loop default is one task per run; fix-speedrun works an explicit list. Both need a hard ceiling a runaway can't exceed. +- Decision: We will pass a hard per-run task cap from the caller (loop default 1; fix-speedrun = length of the explicit list, capped at a ceiling), and stop + page with the remainder when the cap is hit. v1 caps by task count, not token budget. +- Consequences: easier — a simple integer the caller controls; bounded blast radius; harder — a task-count cap doesn't bound *cost* (one 30-min task can burn many tokens), so a token budget is vNext, and until then a pathological task can run long within a single cap slot. + +** TODO Metrics log location and format +- Owner / by-when: Craig / spec-review +- Context: Per-run metrics must land somewhere structured and queryable, per-project, and survive across sessions for the synthesis step to read. +- Decision: We will append one JSONL record per task to a per-project log at =.ai/metrics/work-the-backlog.jsonl=, git-tracked, with the synthesis step reading the union across projects. +- Consequences: easier — append-only JSONL is trivial to write and =jq=-queryable; per-project keeps it local to where the work happened; harder — a git-tracked log adds churn to every autonomous run's commit (or needs its own commit), and "union across projects" needs the synthesis step to know where every project's log lives. + +** TODO Synthesis cadence and trigger +- Owner / by-when: Craig / spec-review +- Context: Craig wants periodic org-roam articles summarizing the data. What triggers synthesis, and how often? +- Decision: We will run synthesis on an explicit trigger ("synthesize backlog metrics") and optionally a weekly scheduled run, writing one KB node per synthesis under =~/org/roam/agents/= per the knowledge-base rule. +- Consequences: easier — explicit trigger means no surprise writes, and the KB rule already governs node shape; harder — a weekly scheduled run needs a scheduler entry and the KB write-classification (personal-only) must gate it so work-project metrics never land in the KB. + +* Implementation phases + +** Phase 1 — Extract the execution loop into work-the-backlog.org +Write =work-the-backlog.org= holding the eligibility gate, act-vs-file decision, per-task quality bar, and run-cap logic — taking a task set + session mode + cap as input. Remove the stopgap "Phase E" text from =inbox-zero.org= (restore it to its A-D shape) in the same change so there's one home, not two. Tree stays working: inbox-zero reverts to routing-only, and the new workflow is callable but not yet wired to the loop. + +** Phase 2 — Wire the two callers +Add the loop caller's chain step (after inbox-zero Phase D, invoke work-the-backlog with the tag query + file-only + cap 1) and the "fix speedrun" preset (explicit list + autonomous-commit + always-push + paging-on). Both go through the same workflow. Tree stays working: each caller is independently testable. + +** Phase 3 — File-only vs autonomous-commit gate +Implement the commit-autonomy branch: read the per-project waiver, degrade =autonomous-commit= to =file-only= when absent, surface the degrade. Tree stays working: default file-only behavior is the safe path even before the waiver-read lands. + +** Phase 4 — Guardrails and the page +Implement the data-loss / design-decision refusal, the VERIFY-on-ambiguity filing, and the end-of-set =notify alarm ... --persist= page. Tree stays working: guardrails only ever *reduce* what runs, so adding them can't break a passing run. + +** Phase 5 — Metrics log +Append the per-task JSONL record at each task outcome. Tree stays working: logging is a side effect that doesn't alter execution. + +** Phase 6 — Synthesis to org-roam +Write the synthesis step: read the JSONL union, compute the per-run and trend metrics (below), write a KB node under =~/org/roam/agents/= per the knowledge-base rule, personal-projects-only classification enforced. Tree stays working: synthesis is read-only over the logs plus a KB write. + +* Acceptance criteria +- [ ] =work-the-backlog.org= exists and is the only home for the execution loop; =inbox-zero.org= is back to its A-D routing-only shape with no Phase E. +- [ ] The loop caller chains into work-the-backlog after routing; startup and wrap-up never invoke it. +- [ ] "fix speedrun" runs as the preset (autonomous-commit + always-push + end-page) over an explicit ordered list, one commit per task. +- [ ] A task tagged for autonomous execution but at status =VERIFY= / =DOING= / =DONE= / =CANCELLED= is skipped by the gate. +- [ ] The eligibility tag set is read from the project's =todo.org= scheme header, not hardcoded. +- [ ] In a project without the commit waiver, an =autonomous-commit= request degrades to file-only and says so; no commit is made. +- [ ] A task carrying data-loss risk or needing a design decision is refused with a filed VERIFY, not implemented. +- [ ] An underspecified / already-satisfied task files a VERIFY noting why and the run continues. +- [ ] The run stops at the per-run cap and pages with the remaining tasks listed. +- [ ] Each task outcome appends one JSONL record to =.ai/metrics/work-the-backlog.jsonl=. +- [ ] The synthesis step reads the logs and writes a KB node under =~/org/roam/agents/=; it refuses to write for work-classified projects. + +* Effectiveness measurement + +This section answers Craig's explicit ask: measure whether autonomous-batch execution is actually effective, and build the "gather data → org-roam articles" loop. + +** What "effective" means here + +The autonomy is effective if it completes real work that *stays* completed — i.e. tasks land green and the next session doesn't have to undo or fix them. The two failure modes to catch are (1) the loop defers everything (over-cautious, no value delivered) and (2) the loop implements badly (commits that get reverted or hand-corrected next session). Both are measurable. + +** Per-run metrics (the JSONL record) + +One record per task, appended to =.ai/metrics/work-the-backlog.jsonl= at each task outcome: + +| Field | Meaning | +|-------------------+--------------------------------------------------------------------| +| =ts= | ISO timestamp of the task outcome | +|-------------------+--------------------------------------------------------------------| +| =run_id= | UUID shared by all tasks in one run | +|-------------------+--------------------------------------------------------------------| +| =project= | project basename | +|-------------------+--------------------------------------------------------------------| +| =caller= | =loop= or =fix-speedrun= | +|-------------------+--------------------------------------------------------------------| +| =task= | task heading (slug) | +|-------------------+--------------------------------------------------------------------| +| =outcome= | implemented-committed / implemented-diff / deferred-verify / | +| | deferred-too-large / skipped-ineligible | +|-------------------+--------------------------------------------------------------------| +| =defer_reason= | for deferrals: needs-input / data-loss / underspecified / too-large | +|-------------------+--------------------------------------------------------------------| +| =wall_clock_s= | seconds from task start to outcome | +|-------------------+--------------------------------------------------------------------| +| =commit_sha= | for committed tasks; empty otherwise | +|-------------------+--------------------------------------------------------------------| +| =review_findings= | count of /review-code Critical+Important findings on this task | +|-------------------+--------------------------------------------------------------------| + +Per-run rollups computed at synthesis (not stored per record): tasks attempted, completed, VERIFY-deferred, reverted; wall-clock total; commits landed; review findings per commit. + +** The corrections signal (the key metric) + +The hardest and most valuable metric is *human corrections in the following session* — did Craig revert or hand-fix an autonomous commit? v1 captures the cheap proxy: at synthesis, for each =commit_sha=, check whether a later commit touching the same files reverted it or carries a "fix"/"revert" of that change within N days. A clean run is one where the autonomous commits survive untouched. (Auto-detecting "this later commit corrected that autonomous one" precisely is a vNext refinement; the proxy — reverted-or-touched-soon-after — is good enough to flag a problem run for human review.) + +** Where the data lands + +Per-project git-tracked JSONL at =.ai/metrics/work-the-backlog.jsonl=. Append-only, =jq=-queryable, survives across sessions and machines via the normal project sync. Git-tracked so the history is auditable and the synthesis step can read it from any clone. + +** The synthesis loop (gather → article) + +On the "synthesize backlog metrics" trigger (and optionally a weekly scheduled run): + +1. Read the JSONL union across the personal projects the synthesizer can see. +2. Compute the rollups and the trend: completion rate over time, defer-reason distribution, review-findings-per-commit trend, and the corrections-signal flag count. +3. Write one org-roam KB node under =~/org/roam/agents/YYYYMMDDHHMMSS-backlog-metrics-<window>.org= per the knowledge-base rule — filetags =:agent:metrics:=, a concise title, the rollup table, the trend narrative, and =[[id:...]]= links to prior synthesis nodes so the series is traceable. +4. Enforce the KB write-classification: *personal projects only*. A work-classified project's metrics never write to the KB — they stay in that project's own =.ai/metrics/= log and the synthesizer reports the refusal per the KB refusal contract. + +The KB node is the artifact Craig reviews later — "are the autonomous runs completing more and getting corrected less over the last month?" reads off the trend table without re-querying raw logs. + +* Readiness dimensions + +- *Data model & ownership:* The task set is read from =todo.org= (project-owned, user-authored). The metrics JSONL is generated, append-only, git-tracked, project-owned. KB nodes are agent-generated under =~/org/roam/agents/= (never overwriting Craig's hand-authored nodes — link only). No editable region is co-owned. +- *Errors, empty states & failure:* Empty task set → report "nothing eligible" and stop. Malformed scheme header → fall back to the default tag reading and surface the fallback. A task that fails mid-implementation → leave the tree working (don't commit a broken state), record the failure outcome, surface it, continue to the next task. No silent data loss: the data-loss guardrail refuses irreversible tasks outright. +- *Security & privacy:* Tasks touching credentials or external mutations are excluded by the data-loss / external-state guardrail. The KB write is personal-projects-only; work metrics never leave the project. No secrets in the JSONL (task slugs and SHAs only). +- *Observability:* The end-of-set page surfaces the run outcome. The per-task surface (implemented / deferred + reason / skipped) is the live progress view. The metrics log + KB synthesis is the long-run observability. A bad run is isolable from the JSONL (which task, which outcome, which review findings). +- *Performance & scale:* Expected counts are small — a handful of tasks per run, one run per 30-min tick. No bottleneck at this scale. The cap bounds the worst case. Synthesis over months of JSONL is still a small file (one record per task). +- *Reuse & lost opportunities:* Reuses =todo-format.md= for task close, =/review-code= and =/voice personal= for the quality bar, =notify= for paging, the knowledge-base rule for KB writes, the per-project waiver for commit-autonomy. No new config file (the opt-in rides the existing waiver). The execution loop is the one new shared asset. +- *Architecture fit & weak points:* Integration points — inbox-zero loop caller (chain after Phase D), the per-project waiver location, =todo.org= scheme header, =~/org/roam/agents/=. Weak point: the commit-autonomy gate depends on deterministically reading the waiver; mitigated by defaulting to file-only when the read is ambiguous (fail safe, not open). Second weak point: a 30-min loop committing unattended; mitigated by the hard cap and file-only default. +- *Config surface:* Per-project — commit-autonomy opt-in (via existing waiver), optional loop-commit flag, optional autonomous-safe tag override in the scheme header. Per-call — task set, session mode, run cap. Defaults: file-only, paging-off (loop) / paging-on (fix-speedrun), cap 1 (loop). +- *Documentation plan:* The workflow file itself is the user/operator doc (matches inbox-zero.org's self-documenting style). The =.emacs.d= stopgap note and the fix-speedrun proposal are superseded by this spec; no separate migration doc needed beyond removing the Phase E text. +- *Dev tooling:* N/A for new build targets — the workflows are prose, exercised by invocation. The metrics JSONL is =jq=-inspectable by hand; a tiny rollup helper may be added under =.ai/scripts/= if the synthesis prose proves to need it (decided at Phase 6, not a v1 prerequisite). +- *Rollout, compatibility & rollback:* Rollout is removing Phase E from inbox-zero and adding work-the-backlog — both prose changes, instantly reversible. Compatibility: inbox-zero's three callers are unchanged except the loop caller gaining a forward chain. Rollback: delete work-the-backlog and the loop chain step; inbox-zero is already back to A-D. The file-only default means the worst pre-rollback state is surfaced diffs, not committed changes. +- *External APIs & deps:* =notify alarm "Page" "<msg>" --persist= verified against =/home/cjennings/.local/bin/notify= and the page-me workflow. =~/org/roam/= KB write path and node shape verified against the knowledge-base rule. No external API calls. + +* Risks, Rabbit Holes, and Drawbacks + +- *The corrections signal is a proxy, not ground truth.* "A later commit touched the same files" over-counts (legitimate follow-up work) and under-counts (a correction in a different file). It's a flag for human review, not a verdict. Don't rabbit-hole on making it precise in v1 — the proxy plus a human glance is the design. +- *Waiver detection drift.* If the per-project waiver location moves or its format changes, the commit-autonomy gate could mis-read. Mitigation: fail safe to file-only. Pin the waiver format in the Phase 3 decision before building. +- *Unattended-commit blast radius.* The headline risk. Mitigated three ways: file-only default, the hard cap, and the data-loss guardrail. The metrics loop is the fourth layer — it makes a bad run visible after the fact even if the first three let something through. +- *Scope creep into /start-work territory.* The temptation to let "≤ 30 min" stretch. The act-vs-file gate and the "when unsure, file" rule are the brake; keep them strict. + +* Testing / Verification / Rollout + +Verification is by invocation against a project's real =todo.org=: run the loop caller in file-only mode and confirm it surfaces diffs without committing; run fix-speedrun against a small explicit list in a waiver-carrying project and confirm one commit per task + the end page; plant a =VERIFY=-status task and a data-loss task and confirm both are skipped/refused; confirm the JSONL grows one record per task; run synthesis and confirm a KB node lands (personal project) or is refused (work project). Rollout is the Phase 1-6 sequence, each leaving the tree working; the file-only default makes early phases safe to ship before the commit and paging phases land. + +* References / Appendix + +- [[file:../../working/inbox-zero-phase-e/proposed-inbox-zero.org][Phase E proposal (inbox-zero stopgap)]] and [[file:../../working/inbox-zero-phase-e/sender-note.org][its sender note with the 5 open questions]]. +- [[file:2026-06-15-fix-speedrun-workflow-proposal.org][fix-speedrun proposal]]. +- [[file:../../.ai/workflows/inbox-zero.org][inbox-zero.org (canonical, A-D)]] — the routing workflow this feature decouples from. +- =~/code/rulesets/claude-rules/knowledge-base.md= — the org-roam write contract the synthesis step follows. + +* Review and iteration history +** 2026-06-16 Tue — author +- What: initial draft reconciling the Phase E and fix-speedrun proposals into one work-the-backlog.org feature, plus the effectiveness-measurement instrumentation. +- Why: two overlapping proposals arrived within a day; building them separately would duplicate the execution loop and let it drift. Craig also asked explicitly for measurement + org-roam synthesis. +- Artifacts: this spec; the two source proposals under docs/design/ and working/inbox-zero-phase-e/. |
