aboutsummaryrefslogtreecommitdiff
path: root/docs/design/2026-06-16-autonomous-batch-execution-spec.org
blob: e2e0f9034d972ca09dbf13e2fb4d85d973bd7062 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
#+TITLE: Autonomous-Batch Task Execution — Spec
#+AUTHOR: Craig Jennings & Claude
#+DATE: 2026-06-16
#+TODO: TODO | DONE SUPERSEDED CANCELLED

* Metadata
| Status   | draft                                                              |
|----------+--------------------------------------------------------------------|
| Owner    | Craig Jennings                                                     |
|----------+--------------------------------------------------------------------|
| Reviewer | Craig Jennings                                                     |
|----------+--------------------------------------------------------------------|
| Date     | 2026-06-16                                                         |
|----------+--------------------------------------------------------------------|
| Related  | [[file:../../working/inbox-zero-phase-e/proposed-inbox-zero.org][Phase E proposal]]; [[file:2026-06-15-fix-speedrun-workflow-proposal.org][fix-speedrun proposal]] |
|----------+--------------------------------------------------------------------|

* Summary

Two proposals arrived within a day of each other describing the same capability: have Claude work a batch of small, well-marked tasks autonomously, with a full quality bar per task and no per-step approval gate. The inbox-zero "Phase E" proposal drives it from a tag/priority query on a recurring loop; the "fix speedrun" proposal drives it from an explicit ordered list a human dictates in-session. This spec reconciles both into one feature: a single dedicated workflow, =work-the-backlog.org=, that holds the task-execution logic, with two thin callers feeding it. It also designs the instrumentation that measures whether the autonomy is actually paying off.

* Problem / Context

Craig has a standing backlog of small, solo-doable fixes across several projects, already marked with a tag convention (=:next:=, =:quick:+:solo:=). Doing them by hand one at a time is the bottleneck — each is 30 minutes or less, but the context-switch and the per-commit approval ceremony dominate the actual work. He wants Claude to burn these down unattended: on a recurring loop for the routed inbox case, and on demand when he batches a named list and says "fix speedrun, no approvals until done."

Two separate proposals tried to answer this:

- *Phase E* (in =inbox-zero.org=, edited in =.emacs.d= as a stopgap) bolted autonomous execution onto the inbox-zero workflow's on-demand and loop callers. The sender flagged the seam as the open question: coupling capture-routing with autonomous-implementation pollutes inbox-zero's three existing callers (startup, wrap-up, on-demand), two of which must never execute anything.
- *fix speedrun* (a =.emacs.d= theme-studio session that worked well) is the same execution loop driven by an explicit ordered task set, with end-of-set paging and always-push.

They overlap almost entirely. The execution loop — eligibility gate, act-vs-file decision, per-task quality bar, bounded run — is identical. Only the *input* differs (tag query vs explicit list) and the *session mode* differs (loop default vs no-approvals + always-push + page). Building them as two features would duplicate the execution logic and let the two copies drift. The forces: keep inbox-zero's callers clean, share one execution loop, and make the autonomy safe enough to run unattended on a 30-minute timer without Craig watching.

A second, explicit ask from Craig: instrument this so its effectiveness is measurable. "Gather data on this and create some org-roam articles we can look at later." Autonomous execution that silently makes bad commits is worse than no autonomy; the only way to know which it is, is to measure tasks completed vs deferred vs reverted, and human corrections in the following session, over time.

* Goals and Non-Goals

** Goals
- One workflow, =work-the-backlog.org=, owns the task-execution loop. Both input shapes (tag query, explicit list) and both session modes feed it.
- inbox-zero's three existing callers stay clean: the loop caller chains into =work-the-backlog= *after* routing; startup and wrap-up never touch it.
- "fix speedrun" is a thin named preset, not a second implementation: no-approvals session mode + always-push + end-of-set page, feeding an explicit ordered list.
- Commit autonomy defaults to file-only (surface a diff, no auto-commit). A project opts into autonomous commit+push explicitly via its per-project waiver.
- Hard guardrails: refuse to speedrun any task needing a design decision or carrying data-loss risk without a checkpoint; file a =VERIFY= and move on rather than guess-implement an underspecified task; a per-run cap / kill switch beyond "one task per run."
- A lightweight per-run metrics log plus a periodic synthesis step that writes org-roam KB articles summarizing the trend.

** Non-Goals
- *Not* a replacement for =/start-work=. Tasks needing deliberation, design, or an hour-plus stay with =/start-work= and its approval gates. This feature only touches the small, marked, solo set.
- *Not* a new tag convention. It reads the project's own priority/tag scheme header; it never invents or hardcodes tags across projects.
- *Not* an inbox-routing change. =inbox-zero.org= keeps its A-D phases. The Phase E text added in =.emacs.d= as a stopgap is *removed* and its logic moves here.
- *Not* a multi-project orchestrator. One run works one project's backlog. Cross-project handoff stays with =inbox-send= and the paging reply.
- *Not* a credential-handling or external-API feature. Tasks that touch secrets or external mutations are out of the eligible set by the guardrail.

** Scope tiers
- *v1:* =work-the-backlog.org=; the eligibility gate reading the project's scheme header; the act-vs-file decision with VERIFY-on-ambiguity; file-only commit default with per-project opt-in; the loop caller wiring and inbox-zero Phase E removal; the "fix speedrun" preset with end-of-set =notify --persist= page; the per-run metrics log (structured JSONL).
- *Out of scope:* a token-budget kill switch (cap is a task count in v1); cross-project batch runs; a dashboard or live UI over the metrics.
- *vNext (log to todo.org):* the periodic org-roam synthesis step if it doesn't make v1; a token/cost budget alongside the task-count cap; auto-detection of "human corrected my autonomous commit" from the next session's diff.

* Design

** Overview

The architecture is one execution workflow with two callers and one preset, plus an instrumentation sidecar.

#+begin_example
  inbox-zero loop caller  ──(after Phase D routing)──┐
                                                     ├──▶  work-the-backlog.org  ──▶ metrics log (JSONL)
  "fix speedrun" preset  ──(explicit ordered list)───┘                                      │
       = no-approvals + always-push + end-page                                              ▼
                                                                          periodic synthesis ──▶ org-roam KB articles
#+end_example

=work-the-backlog.org= is the only place the execution loop lives. It takes a *task set* (however assembled) and a *session mode* (which gates commit autonomy and paging), and works the set under a fixed safety contract. The two callers differ only in how they build the task set and which session mode they pass.

This is the seam the Phase E sender asked for: separating capture-routing (inbox-zero) from autonomous-implementation (work-the-backlog) keeps inbox-zero's startup and wrap-up callers — which must never execute anything — untouched. The loop caller is the only one of inbox-zero's callers that chains forward into execution, and it does so as an explicit second step after routing completes, not as a phase buried inside inbox-zero.

** The execution loop (two-altitude: caller's view)

A caller hands =work-the-backlog= three things:

1. *A task set* — either an explicit ordered list of task headings (fix speedrun), or the result of a tag/priority query against =todo.org= (the loop). The workflow does not care which; it receives an ordered list of candidate tasks.
2. *A session mode* — =file-only= (default) or =autonomous-commit= (requires the project's per-project waiver), and a paging flag.
3. *A run cap* — the maximum number of tasks to complete this run.

It returns: per-task outcome (implemented+committed / implemented+diff-surfaced / deferred-VERIFY / deferred-too-large / skipped-ineligible), and a metrics record per task.

** The execution loop (implementer's view)

For the task set, in order, until the run cap is hit:

1. *Eligibility gate* (below). Ineligible → record =skipped-ineligible=, next task.
2. *Scope read* of the relevant code. Cheap; just enough to make the act-vs-file call.
3. *Act-vs-file decision* (below). File → record the deferral reason, next task.
4. *Implement* under the project's commit discipline: TDD red→green→refactor, then =/review-code --staged=, fix all Critical/Important, then close the task per =todo-format.md=.
5. *Commit autonomy branch:*
   - =file-only= → surface the diff, do *not* commit. Record =implemented-diff-surfaced=.
   - =autonomous-commit= → =/voice personal= on the message, commit individually, push per the project's flow. Record =implemented-committed=.
6. *Record metrics* for the task (the JSONL append, below).
7. Decrement the cap. At zero, stop.

After the set: if the paging flag is set, fire the end-of-set page (below). Surface the run summary.

** Eligibility gate

A task is autonomous-safe when *all* hold:

1. *Status is =TODO=* — never =VERIFY=, =DOING=, =DONE=, or =CANCELLED=. =VERIFY= is the "awaiting Craig's manual confirmation" marker; auto-implementing one defeats the manual check it represents.
2. *Tagged per the project's autonomous-safe set* — resolved by reading the project's priority/tag scheme header at the top of its =todo.org=, not by hardcoding. The default reading is =:next:= OR both =:quick:= AND =:solo:=, but a project whose scheme declares a different autonomous-safe tag set overrides that.
3. *Solo-doable* — no input or undecided judgment call from Craig.
4. *Roughly 30 minutes or less* of focused work.

** Act-vs-file decision (the guardrail)

After the scope read, for each eligible candidate:

- *Clear, bounded, solo, ≤ ~30 min* → implement.
- *Needs a design decision, Craig's input, or discussion* → do NOT implement. File a one-line note on the task naming the input it needs; surface it.
- *Carries data-loss risk without a checkpoint* (deletes data, rewrites persisted state, touches external/shared state irreversibly) → do NOT implement. File a =VERIFY= explaining the risk; surface it.
- *Underspecified or already-satisfied* → do NOT guess-implement. File a =VERIFY= noting why (the fix-speedrun "raise max spans to 5 — every cap was already 8" case) and move on.
- *An hour or more* → do NOT implement. File and surface as a =/start-work= candidate.

When unsure which side a task falls on, file rather than implement. A wrong auto-implement costs more than a deferred task — it costs a revert *and* the human correction in the next session that the metrics are designed to catch.

** Session modes and the "fix speedrun" preset

Two orthogonal session-mode dimensions feed the loop:

- *Commit autonomy:* =file-only= (default) or =autonomous-commit=. =autonomous-commit= is honored only when the project carries the per-project waiver (=.emacs.d= and =rulesets= have it; most projects do not). Absent the waiver, a request for =autonomous-commit= degrades to =file-only= and says so.
- *Paging:* on or off. End-of-set only.

"fix speedrun" is the named preset = =autonomous-commit= + always-push + paging-on, fed an *explicit ordered list*. It is not a separate code path; it is a label for that combination of mode flags plus the explicit-list input. The loop caller, by contrast, runs =file-only= (unless the project has the waiver and opts the loop into commits) with paging off, fed the *tag query*.

** Bounding the run and the kill switch

Default cap: one task per run for the loop caller — implement the highest-priority eligible candidate (=[#A]= before =[#B]= before =[#C]=), record, then stop and let the next tick continue. The fix-speedrun preset works the whole explicit list in order (the human bounded it by naming it), still one commit per task.

The kill switch is a hard per-run task cap passed by the caller, independent of "one per run": even fix-speedrun stops at the cap and pages with the remainder listed. A loop that fires every 30 minutes and commits unattended needs a ceiling that a runaway can't exceed.

** End-of-set paging

When the set is done (or the cap is hit), if paging is on, fire one page — end-of-set only, never per-task:

#+begin_src sh
notify alarm "Page" "<project>: <N> done, <M> remaining — <one-line summary>" --persist
#+end_src

=--persist= keeps it on screen until dismissed (the page-me convention). The message carries the project name, the completed count, and the remaining count, so Craig can reply confirming ready + naming the next project in one turn. The page-signal wrapper removed 2026-06-12 is reconciled to =notify= here — there is no separate page-signal call.

* Alternatives Considered

** Fold execution into inbox-zero (the Phase E stopgap shape)
- Good, because it's the smallest diff — the loop caller already runs inbox-zero, so execution is "one more phase."
- Bad, because it couples capture-routing with implementation. inbox-zero has three callers; startup and wrap-up must never execute. A Phase E inside inbox-zero forces both to carry a "skip Phase E" caveat and risks a future caller running it by accident.
- Neutral, because the eligibility-gate and act-vs-file text is identical either way — only its *home* differs.

** Two separate features (keep Phase E and fix-speedrun distinct)
- Good, because each proposal ships as written with no reconciliation work.
- Bad, because the execution loop is duplicated in two places and will drift; a guardrail tightened in one won't reach the other. Two ways to do autonomous execution is two things to audit.
- Neutral, because the input and session-mode differences are real — but they're thin caller-level differences, not a reason to fork the engine.

** Autonomous-commit as the default
- Good, because it's faster end-to-end with no diff to review.
- Bad, because most projects lack the per-project waiver, and an unattended loop committing to a project that never opted in is exactly the failure the file-only default prevents. The blast radius of a bad autonomous commit is a revert plus lost trust in the loop.
- Neutral, because the projects that *do* want it (=.emacs.d=, =rulesets=) opt in explicitly, so the capability is available where it's wanted without being the default everywhere.

* Decisions [/]

** TODO Where the eligibility gate reads its tag set
- Owner / by-when: Craig / spec-review
- Context: Phase E hardcoded =:next:= / =:quick:+:solo:=. Projects' priority/tag schemes vary, and the =todo-format.md= scheme header is the declared source of truth per project.
- Decision: We will read the project's =todo.org= priority/tag scheme header to resolve the autonomous-safe tag set, defaulting to =:next:= OR =:quick:+:solo:= when the header doesn't declare an explicit autonomous-safe set.
- Consequences: easier — one workflow works correctly across projects with different tag vocabularies; harder — a project with no scheme header (or a malformed one) needs a fallback, and the "default reading" has to be specified precisely enough that two projects agree on it.

** TODO The do-not-auto-implement marker set
- Owner / by-when: Craig / spec-review
- Context: =VERIFY= means "awaiting Craig's manual confirmation" in =.emacs.d= and =rulesets=. Other projects may use =VERIFY= differently or not at all. The gate excludes =VERIFY=, =DOING=, =DONE=, =CANCELLED= by status, but the *marker semantics* are what matter.
- Decision: We will define the do-not-auto-implement set as: any status that is not =TODO=, plus any task carrying a project-declared "hold" marker. The canonical default treats =VERIFY= as do-not-implement; a project overrides only by declaring its marker semantics in its scheme header.
- Consequences: easier — the gate is portable and a project can't accidentally have its manual-check tasks auto-run; harder — requires the scheme header to carry marker semantics, which most don't yet, so the default has to be safe-by-omission (exclude anything not plainly =TODO=).

** TODO Commit-autonomy opt-in mechanism
- Owner / by-when: Craig / spec-review
- Context: =file-only= is the default; =.emacs.d= and =rulesets= have a per-project waiver allowing autonomous commits. Where does the workflow *read* that a project has opted in?
- Decision: We will read the opt-in from the project's existing per-project waiver location (the same place the commit discipline's "no approval gate" waiver lives — =notes.org= Workflow State or =CLAUDE.md=), not introduce a new config file.
- Consequences: easier — no new config surface, reuses the existing waiver concept; harder — the waiver's exact location and format must be pinned so the workflow can detect it deterministically, and a project with the commit waiver but *not* wanting the loop to commit needs a way to say "waiver yes, loop-commit no" (two flags, not one).

** TODO Run-cap default and the kill switch shape
- Owner / by-when: Craig / spec-review
- Context: The loop default is one task per run; fix-speedrun works an explicit list. Both need a hard ceiling a runaway can't exceed.
- Decision: We will pass a hard per-run task cap from the caller (loop default 1; fix-speedrun = length of the explicit list, capped at a ceiling), and stop + page with the remainder when the cap is hit. v1 caps by task count, not token budget.
- Consequences: easier — a simple integer the caller controls; bounded blast radius; harder — a task-count cap doesn't bound *cost* (one 30-min task can burn many tokens), so a token budget is vNext, and until then a pathological task can run long within a single cap slot.

** TODO Metrics log location and format
- Owner / by-when: Craig / spec-review
- Context: Per-run metrics must land somewhere structured and queryable, per-project, and survive across sessions for the synthesis step to read.
- Decision: We will append one JSONL record per task to a per-project log at =.ai/metrics/work-the-backlog.jsonl=, git-tracked, with the synthesis step reading the union across projects.
- Consequences: easier — append-only JSONL is trivial to write and =jq=-queryable; per-project keeps it local to where the work happened; harder — a git-tracked log adds churn to every autonomous run's commit (or needs its own commit), and "union across projects" needs the synthesis step to know where every project's log lives.

** TODO Synthesis cadence and trigger
- Owner / by-when: Craig / spec-review
- Context: Craig wants periodic org-roam articles summarizing the data. What triggers synthesis, and how often?
- Decision: We will run synthesis on an explicit trigger ("synthesize backlog metrics") and optionally a weekly scheduled run, writing one KB node per synthesis under =~/org/roam/agents/= per the knowledge-base rule.
- Consequences: easier — explicit trigger means no surprise writes, and the KB rule already governs node shape; harder — a weekly scheduled run needs a scheduler entry and the KB write-classification (personal-only) must gate it so work-project metrics never land in the KB.

* Implementation phases

** Phase 1 — Extract the execution loop into work-the-backlog.org
Write =work-the-backlog.org= holding the eligibility gate, act-vs-file decision, per-task quality bar, and run-cap logic — taking a task set + session mode + cap as input. Remove the stopgap "Phase E" text from =inbox-zero.org= (restore it to its A-D shape) in the same change so there's one home, not two. Tree stays working: inbox-zero reverts to routing-only, and the new workflow is callable but not yet wired to the loop.

** Phase 2 — Wire the two callers
Add the loop caller's chain step (after inbox-zero Phase D, invoke work-the-backlog with the tag query + file-only + cap 1) and the "fix speedrun" preset (explicit list + autonomous-commit + always-push + paging-on). Both go through the same workflow. Tree stays working: each caller is independently testable.

** Phase 3 — File-only vs autonomous-commit gate
Implement the commit-autonomy branch: read the per-project waiver, degrade =autonomous-commit= to =file-only= when absent, surface the degrade. Tree stays working: default file-only behavior is the safe path even before the waiver-read lands.

** Phase 4 — Guardrails and the page
Implement the data-loss / design-decision refusal, the VERIFY-on-ambiguity filing, and the end-of-set =notify alarm ... --persist= page. Tree stays working: guardrails only ever *reduce* what runs, so adding them can't break a passing run.

** Phase 5 — Metrics log
Append the per-task JSONL record at each task outcome. Tree stays working: logging is a side effect that doesn't alter execution.

** Phase 6 — Synthesis to org-roam
Write the synthesis step: read the JSONL union, compute the per-run and trend metrics (below), write a KB node under =~/org/roam/agents/= per the knowledge-base rule, personal-projects-only classification enforced. Tree stays working: synthesis is read-only over the logs plus a KB write.

* Acceptance criteria
- [ ] =work-the-backlog.org= exists and is the only home for the execution loop; =inbox-zero.org= is back to its A-D routing-only shape with no Phase E.
- [ ] The loop caller chains into work-the-backlog after routing; startup and wrap-up never invoke it.
- [ ] "fix speedrun" runs as the preset (autonomous-commit + always-push + end-page) over an explicit ordered list, one commit per task.
- [ ] A task tagged for autonomous execution but at status =VERIFY= / =DOING= / =DONE= / =CANCELLED= is skipped by the gate.
- [ ] The eligibility tag set is read from the project's =todo.org= scheme header, not hardcoded.
- [ ] In a project without the commit waiver, an =autonomous-commit= request degrades to file-only and says so; no commit is made.
- [ ] A task carrying data-loss risk or needing a design decision is refused with a filed VERIFY, not implemented.
- [ ] An underspecified / already-satisfied task files a VERIFY noting why and the run continues.
- [ ] The run stops at the per-run cap and pages with the remaining tasks listed.
- [ ] Each task outcome appends one JSONL record to =.ai/metrics/work-the-backlog.jsonl=.
- [ ] The synthesis step reads the logs and writes a KB node under =~/org/roam/agents/=; it refuses to write for work-classified projects.

* Effectiveness measurement

This section answers Craig's explicit ask: measure whether autonomous-batch execution is actually effective, and build the "gather data → org-roam articles" loop.

** What "effective" means here

The autonomy is effective if it completes real work that *stays* completed — i.e. tasks land green and the next session doesn't have to undo or fix them. The two failure modes to catch are (1) the loop defers everything (over-cautious, no value delivered) and (2) the loop implements badly (commits that get reverted or hand-corrected next session). Both are measurable.

** Per-run metrics (the JSONL record)

One record per task, appended to =.ai/metrics/work-the-backlog.jsonl= at each task outcome:

| Field             | Meaning                                                             |
|-------------------+--------------------------------------------------------------------|
| =ts=              | ISO timestamp of the task outcome                                   |
|-------------------+--------------------------------------------------------------------|
| =run_id=          | UUID shared by all tasks in one run                                |
|-------------------+--------------------------------------------------------------------|
| =project=         | project basename                                                    |
|-------------------+--------------------------------------------------------------------|
| =caller=          | =loop= or =fix-speedrun=                                            |
|-------------------+--------------------------------------------------------------------|
| =task=            | task heading (slug)                                                 |
|-------------------+--------------------------------------------------------------------|
| =outcome=         | implemented-committed / implemented-diff / deferred-verify /        |
|                   | deferred-too-large / skipped-ineligible                            |
|-------------------+--------------------------------------------------------------------|
| =defer_reason=    | for deferrals: needs-input / data-loss / underspecified / too-large |
|-------------------+--------------------------------------------------------------------|
| =wall_clock_s=    | seconds from task start to outcome                                  |
|-------------------+--------------------------------------------------------------------|
| =commit_sha=      | for committed tasks; empty otherwise                               |
|-------------------+--------------------------------------------------------------------|
| =review_findings= | count of /review-code Critical+Important findings on this task      |
|-------------------+--------------------------------------------------------------------|

Per-run rollups computed at synthesis (not stored per record): tasks attempted, completed, VERIFY-deferred, reverted; wall-clock total; commits landed; review findings per commit.

** The corrections signal (the key metric)

The hardest and most valuable metric is *human corrections in the following session* — did Craig revert or hand-fix an autonomous commit? v1 captures the cheap proxy: at synthesis, for each =commit_sha=, check whether a later commit touching the same files reverted it or carries a "fix"/"revert" of that change within N days. A clean run is one where the autonomous commits survive untouched. (Auto-detecting "this later commit corrected that autonomous one" precisely is a vNext refinement; the proxy — reverted-or-touched-soon-after — is good enough to flag a problem run for human review.)

** Where the data lands

Per-project git-tracked JSONL at =.ai/metrics/work-the-backlog.jsonl=. Append-only, =jq=-queryable, survives across sessions and machines via the normal project sync. Git-tracked so the history is auditable and the synthesis step can read it from any clone.

** The synthesis loop (gather → article)

On the "synthesize backlog metrics" trigger (and optionally a weekly scheduled run):

1. Read the JSONL union across the personal projects the synthesizer can see.
2. Compute the rollups and the trend: completion rate over time, defer-reason distribution, review-findings-per-commit trend, and the corrections-signal flag count.
3. Write one org-roam KB node under =~/org/roam/agents/YYYYMMDDHHMMSS-backlog-metrics-<window>.org= per the knowledge-base rule — filetags =:agent:metrics:=, a concise title, the rollup table, the trend narrative, and =[[id:...]]= links to prior synthesis nodes so the series is traceable.
4. Enforce the KB write-classification: *personal projects only*. A work-classified project's metrics never write to the KB — they stay in that project's own =.ai/metrics/= log and the synthesizer reports the refusal per the KB refusal contract.

The KB node is the artifact Craig reviews later — "are the autonomous runs completing more and getting corrected less over the last month?" reads off the trend table without re-querying raw logs.

* Readiness dimensions

- *Data model & ownership:* The task set is read from =todo.org= (project-owned, user-authored). The metrics JSONL is generated, append-only, git-tracked, project-owned. KB nodes are agent-generated under =~/org/roam/agents/= (never overwriting Craig's hand-authored nodes — link only). No editable region is co-owned.
- *Errors, empty states & failure:* Empty task set → report "nothing eligible" and stop. Malformed scheme header → fall back to the default tag reading and surface the fallback. A task that fails mid-implementation → leave the tree working (don't commit a broken state), record the failure outcome, surface it, continue to the next task. No silent data loss: the data-loss guardrail refuses irreversible tasks outright.
- *Security & privacy:* Tasks touching credentials or external mutations are excluded by the data-loss / external-state guardrail. The KB write is personal-projects-only; work metrics never leave the project. No secrets in the JSONL (task slugs and SHAs only).
- *Observability:* The end-of-set page surfaces the run outcome. The per-task surface (implemented / deferred + reason / skipped) is the live progress view. The metrics log + KB synthesis is the long-run observability. A bad run is isolable from the JSONL (which task, which outcome, which review findings).
- *Performance & scale:* Expected counts are small — a handful of tasks per run, one run per 30-min tick. No bottleneck at this scale. The cap bounds the worst case. Synthesis over months of JSONL is still a small file (one record per task).
- *Reuse & lost opportunities:* Reuses =todo-format.md= for task close, =/review-code= and =/voice personal= for the quality bar, =notify= for paging, the knowledge-base rule for KB writes, the per-project waiver for commit-autonomy. No new config file (the opt-in rides the existing waiver). The execution loop is the one new shared asset.
- *Architecture fit & weak points:* Integration points — inbox-zero loop caller (chain after Phase D), the per-project waiver location, =todo.org= scheme header, =~/org/roam/agents/=. Weak point: the commit-autonomy gate depends on deterministically reading the waiver; mitigated by defaulting to file-only when the read is ambiguous (fail safe, not open). Second weak point: a 30-min loop committing unattended; mitigated by the hard cap and file-only default.
- *Config surface:* Per-project — commit-autonomy opt-in (via existing waiver), optional loop-commit flag, optional autonomous-safe tag override in the scheme header. Per-call — task set, session mode, run cap. Defaults: file-only, paging-off (loop) / paging-on (fix-speedrun), cap 1 (loop).
- *Documentation plan:* The workflow file itself is the user/operator doc (matches inbox-zero.org's self-documenting style). The =.emacs.d= stopgap note and the fix-speedrun proposal are superseded by this spec; no separate migration doc needed beyond removing the Phase E text.
- *Dev tooling:* N/A for new build targets — the workflows are prose, exercised by invocation. The metrics JSONL is =jq=-inspectable by hand; a tiny rollup helper may be added under =.ai/scripts/= if the synthesis prose proves to need it (decided at Phase 6, not a v1 prerequisite).
- *Rollout, compatibility & rollback:* Rollout is removing Phase E from inbox-zero and adding work-the-backlog — both prose changes, instantly reversible. Compatibility: inbox-zero's three callers are unchanged except the loop caller gaining a forward chain. Rollback: delete work-the-backlog and the loop chain step; inbox-zero is already back to A-D. The file-only default means the worst pre-rollback state is surfaced diffs, not committed changes.
- *External APIs & deps:* =notify alarm "Page" "<msg>" --persist= verified against =/home/cjennings/.local/bin/notify= and the page-me workflow. =~/org/roam/= KB write path and node shape verified against the knowledge-base rule. No external API calls.

* Risks, Rabbit Holes, and Drawbacks

- *The corrections signal is a proxy, not ground truth.* "A later commit touched the same files" over-counts (legitimate follow-up work) and under-counts (a correction in a different file). It's a flag for human review, not a verdict. Don't rabbit-hole on making it precise in v1 — the proxy plus a human glance is the design.
- *Waiver detection drift.* If the per-project waiver location moves or its format changes, the commit-autonomy gate could mis-read. Mitigation: fail safe to file-only. Pin the waiver format in the Phase 3 decision before building.
- *Unattended-commit blast radius.* The headline risk. Mitigated three ways: file-only default, the hard cap, and the data-loss guardrail. The metrics loop is the fourth layer — it makes a bad run visible after the fact even if the first three let something through.
- *Scope creep into /start-work territory.* The temptation to let "≤ 30 min" stretch. The act-vs-file gate and the "when unsure, file" rule are the brake; keep them strict.

* Testing / Verification / Rollout

Verification is by invocation against a project's real =todo.org=: run the loop caller in file-only mode and confirm it surfaces diffs without committing; run fix-speedrun against a small explicit list in a waiver-carrying project and confirm one commit per task + the end page; plant a =VERIFY=-status task and a data-loss task and confirm both are skipped/refused; confirm the JSONL grows one record per task; run synthesis and confirm a KB node lands (personal project) or is refused (work project). Rollout is the Phase 1-6 sequence, each leaving the tree working; the file-only default makes early phases safe to ship before the commit and paging phases land.

* References / Appendix

- [[file:../../working/inbox-zero-phase-e/proposed-inbox-zero.org][Phase E proposal (inbox-zero stopgap)]] and [[file:../../working/inbox-zero-phase-e/sender-note.org][its sender note with the 5 open questions]].
- [[file:2026-06-15-fix-speedrun-workflow-proposal.org][fix-speedrun proposal]].
- [[file:../../.ai/workflows/inbox-zero.org][inbox-zero.org (canonical, A-D)]] — the routing workflow this feature decouples from.
- =~/code/rulesets/claude-rules/knowledge-base.md= — the org-roam write contract the synthesis step follows.

* Review and iteration history
** 2026-06-16 Tue — author
- What: initial draft reconciling the Phase E and fix-speedrun proposals into one work-the-backlog.org feature, plus the effectiveness-measurement instrumentation.
- Why: two overlapping proposals arrived within a day; building them separately would duplicate the execution loop and let it drift. Craig also asked explicitly for measurement + org-roam synthesis.
- Artifacts: this spec; the two source proposals under docs/design/ and working/inbox-zero-phase-e/.