aboutsummaryrefslogtreecommitdiff
path: root/.ai/workflows/work-the-backlog.org
blob: b0666e7721587ece498a15775321af68ba9d4176 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
#+TITLE: Work the Backlog
#+AUTHOR: Craig Jennings & Claude
#+DATE: 2026-07-02

* Overview

The single home for the autonomous task-execution loop: take a set of marked, solo-doable tasks from the project's =todo.org= and work them unattended, each held to the full quality bar, under a fixed safety contract. Spec: =rulesets/docs/specs/2026-06-16-autonomous-batch-execution-spec.org=.

Two callers feed it, differing only in how they build the task set and which session mode they pass:

- The *inbox auto-loop* (=inbox.org= auto mode) chains here after its routing completes, with a tag/priority query, file-only mode, cap 1.
- The *no-approvals speedrun* preset feeds an explicit ordered list with autonomous-commit + always-push + paging-on, after a pre-flight Q&A that front-loads every decision.

This workflow owns the execution logic — eligibility gate, defer checklist, quality bar, run cap. Callers own input assembly and mode selection. Capture-routing (inbox surfaces) stays entirely in =inbox.org=; this file never reads an inbox.

* When to Use This Workflow

Invoked by its two callers, or directly by phrase:

- *Speedrun triggers:* "speedrun", "no approvals speedrun", "speedrun these: <task set>" — run the no-approvals speedrun preset (below). The word "speedrun" always routes here, even when the phrase also says "no approvals": plain =no-approvals.org= is the general session mode; the speedrun is this workflow's preset over an explicit task set.
- *Loop caller:* =inbox.org= auto mode chains here after its routing (below). Not phrase-triggered.

Manual fallback: "work the backlog" / "work the backlog with <task set>" — gather the three inputs below (ask for whichever are missing, defaulting to file-only mode; default cap is the list length for an explicit set, 1 for a query) and run the loop.

* Inputs — the caller contract

A caller hands this workflow three things:

1. *A task set* — an ordered list of candidate task headings from the project's =todo.org=. Either an explicit ordered list (speedrun) or the result of a tag/priority query (the loop). The loop does not care how the set was assembled; it receives an ordered list of candidates.
2. *A session mode* — two orthogonal flags:
   - *Commit autonomy:* =file-only= (default) or =autonomous-commit=. See "Commit autonomy" below.
   - *Paging:* on or off. End-of-set only.
3. *A run cap* — the hard maximum number of tasks to complete this run.

It returns a per-task outcome and a run summary.

* Outcomes — the per-task vocabulary

Every task in the set ends in exactly one of:

- =implemented-committed= — implemented, committed (and pushed per the project's flow) under =autonomous-commit=.
- =implemented-diff-surfaced= — implemented, diff surfaced, *not* committed (=file-only=).
- =deferred-VERIFY= — a defer-checklist hit; a =VERIFY= filed naming what's missing or risky.
- =dropped-by-craig= — removed from the run at the speedrun pre-flight Q&A ("skip this").
- =skipped-ineligible= — failed the mechanical eligibility gate.
- =failed= — implementation was attempted and abandoned: the tree is left working (never commit a broken state), the failure is surfaced in the run summary, and the run continues to the next task.

The run summary lists each task with its outcome, plus the remaining set when the cap stopped the run.

* The loop

For the task set, in order, until the run cap is hit:

1. *Eligibility gate* (below). Ineligible → record =skipped-ineligible=, next task.
2. *Scope read* of the relevant code. Cheap; just enough to run the defer checklist.
3. *Defer checklist* (below). Any hit → defer: file the =VERIFY= naming the gap and record =deferred-VERIFY= (or, under the speedrun preset, route a quick-question gap to the pre-flight Q&A), next task.
4. *Implement* under the project's commit discipline: TDD red→green→refactor, then =/review-code --staged=, fix all Critical/Important findings, then close the task per =todo-format.md='s completion rules. Decompose into as many logical commits as the change needs — size is not capped. If implementation fails partway, leave the tree working, record =failed=, surface it, and continue to the next task.
5. *Commit autonomy branch:*
   - =file-only= → surface the diff, do *not* commit. Record =implemented-diff-surfaced=.
   - =autonomous-commit= → =/voice personal= on the message, commit individually, push per the project's flow. Record =implemented-committed=.
6. *Record metrics* for the task (the JSONL append — see Metrics below).
7. Decrement the cap. At zero, stop.

After the set: if the paging flag is set, fire the end-of-set page (below). Surface the run summary either way.

* Eligibility gate — mechanical, no judgment

A task is autonomous-safe when *both* hold. This layer is a lookup, not a judgment; all the judgment lives in the defer checklist.

1. *Status is =TODO=* — never =VERIFY=, =DOING=, =DONE=, or =CANCELLED=. =VERIFY= marks "awaiting Craig's input"; auto-implementing one defeats the check it represents. The do-not-implement set is safe-by-omission: anything not plainly =TODO= (plus any project-declared "hold" marker) is out.
2. *Tagged =:solo:=* — the autonomy tag, resolved against the project's priority/tag scheme header in =todo.org= (never hardcoded). =:solo:= carries the hard definition in =todo-format.md=: completable and verifiable without Craig beyond at most one or two quick decisions answerable up front, no design deliberation. A project whose scheme declares a different autonomous-safe tag set overrides the default.

Priority and =:next:= drive *ordering* within the eligible set, not eligibility ([#A] before [#B] before [#C], then the author's ordering). =:quick:= is an effort hint for batching and duration estimates — never a gate.

Task *size* is deliberately absent from this gate. A large but well-specified, decision-free task is in scope and gets decomposed into per-logical-commit chunks during implementation. Size never sends a task away; only *deliberation* or *risk* does (the checklist below).

*No scheme header → don't run.* The gate reads =:solo:= semantics from the project's scheme header; a =todo.org= without one leaves the tag undefined (=todo-format.md= makes the header mandatory). Surface that the header is missing and stop rather than guessing eligibility.

* The defer checklist — act vs file

After the scope read, run each eligible candidate through the checklist. Each item is a concrete, answerable question, not an adjective. *Any* hit — or any "unsure" — defers the task. Only a task that clears every item is implemented.

1. *Test-writability (the keystone).* Can I write the failing test from the task text — plus any decisions gathered up front — without inventing a requirement? *No / unsure* → underspecified. Under the speedrun preset, if the gap is one or two quick answerable questions, route it to the pre-flight Q&A; otherwise file a =VERIFY= noting what's missing. Under the unattended loop, file the =VERIFY= (no one to ask).
2. *Data-loss / irreversible / external operation.* Does implementing it require any of: =rm= of non-scratch data, =git reset --hard= / force-push, =DROP= / =DELETE= / =TRUNCATE=, file truncate/overwrite of persisted content, a schema or data migration, any external or shared-state mutation, any credential touch? *Yes* → do NOT implement; file a =VERIFY= naming the risk. This is the hard safety gate; an upfront answer never overrides it without an explicit checkpoint.
3. *Already-satisfied.* Does the scope read show the desired end-state already holds? *Yes* → file a =VERIFY= noting it and move on. Don't make a no-op change.
4. *Design deliberation.* Does the task carry an unresolved design question, a "weigh these approaches" with real tradeoffs, or a TBD that isn't a quick factual answer? *Yes* → under the speedrun preset, if it collapses to one or two quick questions, route to the pre-flight Q&A; otherwise file and surface as a =/start-work= candidate. Under the loop, file. The discriminator is *quick-answerable question* vs *deliberation* — never task size.

When genuinely unsure which side a task falls on, defer — a wrong auto-implement costs a revert *and* the next-session correction.

** Filing the deferral =VERIFY=

Every checklist hit files a =VERIFY= in the project's =todo.org=, per =todo-format.md='s VERIFY rules:

- *Dedup first.* If a =VERIFY= sibling for this deferral already exists (a prior run filed it), don't file another — record the outcome as =deferred-VERIFY= with a "previously filed" note and move on. The deferred task keeps its =TODO= status and tags, so without this check every subsequent run would re-defer and re-file.
- *Placement:* sibling of the deferred task (the deferred task is the trigger) — a =**= task gets its =VERIFY= at =**=, a =***= sub-task gets it at =***= under the same parent, never deeper.
- *Heading:* carries the question or risk on its own ("VERIFY <topic> — migration touches persisted rows").
- *Body:* which checklist item hit, what's missing or risky, and what answer or action would make the task runnable. For an already-satisfied hit, the evidence that the end-state already holds.

** Routing a quick-question gap (speedrun only)

Under the speedrun preset, a checklist-1 or checklist-4 hit that collapses to one or two quick answerable questions routes to the pre-flight Q&A instead of deferring (see the preset section below). The discriminator: a *quick question* is a factual or preference pick answerable in one line without weighing tradeoffs ("cap at 5 or 8?", "which config key name?"); *deliberation* is anything that needs tradeoffs weighed, options explored, or code read by Craig. A task needing three or more questions isn't quick-question-gapped — it's underspecified; file the =VERIFY=. Checklist item 2 (data-loss / irreversible) never routes to the Q&A: an upfront answer doesn't override the hard safety gate.

The unattended loop has no one to ask — every hit defers there.

* Per-task quality bar

Autonomy changes who approves, not what quality means. Per task, non-negotiable:

- *TDD* per =testing.md=: red first, green, refactor. The keystone checklist item already proved the failing test is writable.
- *Verification* per =verification.md=: fresh evidence, full suite green before any commit.
- *=/review-code --staged=* before every commit; Critical and Important findings block until fixed.
- *=/voice personal=* on every commit message on the =autonomous-commit= path (or the patterns walked inline if the skill is unavailable), message printed inline so the log shows what landed.
- *Task closure* per =todo-format.md=: depth-based completion (keyword + =CLOSED:= at level 2, dated rewrite at level 3+).
- *One logical change per commit.* A large task becomes several commits, not one omnibus.

* Commit autonomy

=file-only= is the default: surface the diff, never commit. =autonomous-commit= is honored only when the project carries the commit-autonomy waiver, read fresh each run — never from memory of past runs or "this project usually allows it."

The waiver lives in the project's =.ai/notes.org= *Workflow State* section as marker lines, the same shape as the workflow markers already there:

#+begin_example
:COMMIT_AUTONOMY: yes
:LOOP_MAY_COMMIT: yes
#+end_example

- =:COMMIT_AUTONOMY: yes= — the project has the waiver. An =autonomous-commit= request (the speedrun preset, or a manual run asking for it) is honored.
- =:LOOP_MAY_COMMIT: yes= — the *unattended loop caller* may also commit. It requires =:COMMIT_AUTONOMY:= alongside it; the split exists because "Craig-initiated speedrun may commit" and "the recurring loop may commit unattended" are different levels of trust. Without this flag the loop stays =file-only= even when the project holds the waiver.

An absent marker means no. Anything other than a plain =yes= value also means no. The read is one grep of the Workflow State section — a lookup, not a judgment.

*The degrade contract.* When a caller requests =autonomous-commit= and the required marker is missing, degrade to =file-only= and surface it in both the run intro and the run summary: "autonomous-commit requested, no :COMMIT_AUTONOMY: waiver in notes.org — running file-only." Never honor the request without the marker, and never drop to file-only silently — the first commits into a project that didn't opt in, the second hides why nothing got committed.

* Bounding the run

The cap is a hard per-run task ceiling passed by the caller — the kill switch a runaway can't exceed:

- *Loop caller default: 1.* Implement the highest-priority eligible candidate, record, stop; the next tick continues.
- *Speedrun: the length of the explicit list*, capped at a ceiling — the human bounded the set by naming it.

Even the speedrun stops at the cap and surfaces (and, with paging on, pages) the remainder. The cap bounds task *count*, not cost; a token budget is logged as vNext.

* Context hygiene — auto-flush between tasks

Task boundaries are clean boundaries by construction: the previous task is closed and committed (or filed), nothing is half-edited. When the context window grows heavy mid-run, run the flush skill's *auto mode* between tasks: checkpoint the session anchor with the remaining task set, session mode, and cap in Next Steps (so the resumed context continues the run blind), arm the self-injection (=.ai/scripts/self-inject.sh= via =tmux run-shell -b=), and end the turn. The fresh context resumes from the anchor and works on. Unattended runs only — the keystroke-collision hazard and the full mechanism live in the flush skill.

* End-of-set page

With paging on, fire one page when the set is done or the cap is hit — end-of-set only, never per-task:

#+begin_src sh
notify info "Page" "<project>: <N> done, <M> remaining — <one-line summary>" --persist
#+end_src

=--persist= keeps it on screen until dismissed, and =info= is the page-me urgency convention (persistent but never crash-scary). The page fires when the set completes *or* the cap stops the run — either way exactly once. The message carries the project name, the completed count, and the remaining count (with skipped tasks noted in the run summary) so Craig can confirm ready and name the next project in one reply. There is no separate page-signal call — =notify= is the paging surface.

* Metrics

Each task outcome appends one JSON line to the project's =.ai/metrics/work-the-backlog.jsonl= — git-tracked, append-only, =jq=-queryable. Create the directory and file on the first append. Logging is a side effect only: a failed append surfaces a warning in the run summary but never blocks, reorders, or aborts execution.

One record per task, written at the moment its outcome is decided:

| Field              | Meaning                                                                                         |
|--------------------+-------------------------------------------------------------------------------------------------|
| =ts=               | ISO-8601 timestamp of the task outcome                                                          |
|--------------------+-------------------------------------------------------------------------------------------------|
| =run_id=           | UUID shared by every record in one run (=uuidgen= at run start)                                 |
|--------------------+-------------------------------------------------------------------------------------------------|
| =project=          | project basename                                                                                |
|--------------------+-------------------------------------------------------------------------------------------------|
| =caller=           | =loop= / =speedrun= / =manual=                                                                  |
|--------------------+-------------------------------------------------------------------------------------------------|
| =task=             | the task heading (slug)                                                                         |
|--------------------+-------------------------------------------------------------------------------------------------|
| =outcome=          | =implemented-committed= / =implemented-diff= / =deferred-verify= / =skipped-ineligible= /       |
|                    | =dropped-by-craig= / =failed=                                                                   |
|--------------------+-------------------------------------------------------------------------------------------------|
| =defer_reason=     | =underspecified= / =data-loss= / =already-satisfied= / =needs-deliberation= — set on            |
|                    | =deferred-verify= records only                                                                  |
|--------------------+-------------------------------------------------------------------------------------------------|
| =upfront_decision= | =true= when a pre-flight answer was recorded and used for this task                             |
|--------------------+-------------------------------------------------------------------------------------------------|
| =wall_clock_s=     | seconds from task start to outcome                                                              |
|--------------------+-------------------------------------------------------------------------------------------------|
| =commit_sha=       | committed tasks: the commit SHA (comma-separated when the task decomposed into several); empty  |
|                    | otherwise                                                                                       |
|--------------------+-------------------------------------------------------------------------------------------------|
| =review_findings=  | count of =/review-code= Critical + Important findings on this task                              |
|--------------------+-------------------------------------------------------------------------------------------------|

The =outcome= slugs map one-to-one onto the outcome vocabulary above (=implemented-diff= is =implemented-diff-surfaced=; =deferred-verify= is =deferred-VERIFY=). Per-run rollups (attempted / completed / deferred / dropped, wall-clock total, findings per commit) are computed at synthesis, not stored per record. The =commit_sha= field is what the synthesis step's corrections signal keys on — whether a later commit reverted or hand-fixed an autonomous one — so never omit it on a committed task.

* Caller: the inbox auto-loop

=inbox.org= auto mode chains here as an explicit second step *after* its routing completes — never as a phase inside inbox processing. When a cycle files new items and Craig answers "run this batch next?" with yes, auto mode invokes this workflow with:

- *Task set:* the eligibility query over the queued/filed items — status =TODO= + =:solo:= per the scheme header, priority-ordered.
- *Session mode:* =file-only=, paging off. (A project carrying both =:COMMIT_AUTONOMY:= and =:LOOP_MAY_COMMIT:= markers opts the loop into commits — see Commit autonomy above.)
- *Cap: 1.* The highest-priority eligible candidate runs, gets recorded, and the loop's next tick (or the next yes) continues from there.

The loop has no human at kickoff of each task, so a needs-quick-decisions task defers with a =VERIFY= — the pre-flight Q&A is a speedrun capability, not a loop one. Startup and wrap-up never invoke this workflow.

* Preset: the no-approvals speedrun

The named preset is a label for one flag combination, not a second code path: *explicit ordered list + =autonomous-commit= + always-push + paging-on*, with every approval front-loaded into a single pre-flight step. "No approvals" means all input first, then hands-off — not no input ever. =autonomous-commit= still requires the =:COMMIT_AUTONOMY:= waiver (Commit autonomy above); without it the preset degrades to =file-only= and says so in the pre-flight intro.

When Craig names a task set and says "speedrun":

1. *Gather* the named task set.
2. *Scope-read and classify* each task against the eligibility gate + defer checklist: *ready* (clears everything), *needs-quick-decisions* (one or two upfront-answerable questions — checklist item 1 or 4), or *drop* (data-loss/irreversible, or deliberation that isn't a quick question).
3. *Order* the list — priority, then the author's ordering / =:next:=.
4. *Intro the work* — present the ordered plan: what will run, what was dropped and why, and the batched questions for the needs-quick-decisions tasks.
5. *Craig answers each question or says "skip this"* — a skip removes the task (recorded =dropped-by-craig=; the task itself stays =TODO=); an answer is recorded so implementation works from the decision, not a guess.
6. *Run the finalized list autonomously* — no further approvals until done. Cap = the list length (the human bounded the set by naming it), still one commit per logical change, always-push per the project's flow, auto-flushing between tasks when the context grows heavy (see Context hygiene above).
7. *End-of-set page* with completed + remaining + skipped.

The batch-ask (step 4-5) is one message: each question names its task, puts the recommended answer at item 1 when there is one (per =interaction.md= — inline numbered, no popup), and offers "skip this" as the last option. Before the run starts, write each answer into its task's body in =todo.org= as a dated line — the implementation works from the recorded decision, and the record survives the session. The Q&A fires only under this preset; the loop caller never asks (its decision-needing tasks defer).

*** Per-item disposition rule

For every item the run picks up (this holds for any executing caller, including an auto-inbox-zero run given a standing yes):

- *Feature-level task* → write a spec first (=spec-create=), don't implement directly. The spec is the run's deliverable for that item.
- *Needs decisions you can't confidently guess* → file it as a =VERIFY= carrying the question (under this preset, one or two quick questions route to the pre-flight Q&A instead).
- *Well-defined* → implement it, taking the time it needs.

This extends the defer checklist: the checklist decides *act vs file*; this rule decides the *shape* of the act.

* Synthesis: metrics → org-roam KB

Trigger: "synthesize backlog metrics" (optionally a weekly scheduled run). This is the read side of the metrics log — Craig's ask was "gather data and create org-roam articles we can look at later," and this step is the second half. It is read-only over the logs plus exactly one KB write.

1. *Gather the JSONL union.* Discover =.ai/metrics/work-the-backlog.jsonl= across the project roots (dirs carrying =.ai/protocols.org= under =~/code=, =~/projects=, =~/.emacs.d=). Classify each project per =knowledge-base.md= (work-root denylist, never inference) before reading it into the union.
2. *Enforce personal-only.* A work-classified or unknown project's metrics never enter the KB write — they stay in that project's own log. Report the exclusion per the KB refusal contract: the classification, a one-line redacted summary, and where the data stayed.
3. *Compute the rollups and trends.* Per run: attempted / completed / deferred (by reason) / dropped / failed, wall-clock total, commits landed, review findings per commit. Trends across runs: completion rate over time, defer-reason distribution, findings-per-commit trend.
4. *Compute the corrections signal* — the key metric. For each =commit_sha= in the window, check that project's history for a later commit (within ~14 days) that reverts it or carries a fix touching the same files. A clean run is one whose autonomous commits survive untouched; a flagged run is what Craig reviews by hand. This is a cheap proxy, not proof — it flags candidates, it doesn't convict.
5. *Write one KB node* at =~/org/roam/agents/YYYYMMDDHHMMSS-backlog-metrics-<window>.org= per =knowledge-base.md=: =:agent:metrics:= filetags, a concise title, the rollup table, the trend narrative, and =[[id:...]]= links to prior synthesis nodes so the series is traceable. Pull before writing, commit and push after — the normal KB session discipline.

The KB node is the artifact Craig reads later: "are the runs completing more and getting corrected less?" should read off the trend table without touching raw logs. Synthesis never mutates the JSONL, todo.org, or any project tree.

* Common Mistakes

1. *Implementing a =VERIFY= or =DOING= task.* The gate is status =TODO= only — a =VERIFY= exists precisely because Craig's input is pending.
2. *Treating =:quick:= as eligibility.* It's an effort hint. =:solo:= is the gate.
3. *Deferring on size.* A large, well-specified, decision-free task runs — decomposed into logical commits. Size is not a checklist item.
4. *Guessing past the keystone.* If the failing test isn't writable from the task text, the task isn't ready. Inventing the requirement is the failure the checklist exists to stop.
5. *Rationalizing through the data-loss list.* "The migration is small" doesn't clear checklist item 2. Enumerated operations defer, full stop.
6. *Committing in =file-only= mode.* The diff is the deliverable; the commit is Craig's.
7. *One omnibus commit for the whole run.* Every logical change is its own reviewed commit.
8. *Skipping =/review-code= or =/voice= because nobody's watching.* Autonomy removes interaction gates, never engineering-discipline gates (same contract as =no-approvals.org=).
9. *Running past the cap.* The cap is the kill switch; hitting it means stop and surface, even mid-set.
10. *Paging per-task.* One page, end of set.
11. *Honoring =autonomous-commit= from memory.* The waiver is the marker line in =notes.org=, read fresh each run. "This project usually allows it" isn't a read.
12. *Re-filing the same deferral =VERIFY= every run.* The deferred task stays =TODO=, so a run that skips the existing-sibling check spams =todo.org= with duplicates.
13. *Routing a data-loss hit to the pre-flight Q&A.* Checklist item 2 is the hard gate — an upfront answer never clears it without an explicit checkpoint.

* Living Document

Refine as the dogfooding signal arrives — the metrics log and the corrections-in-next-session signal are the feedback loop. Fold recurring adjustments in rather than accumulating caller-side workarounds.

* History

Created 2026-07-02 as Phase 1 of the autonomous-batch execution spec, reconciling the inbox-zero "Phase E" proposal and the =.emacs.d= speedrun proposal into one execution loop. The auto-inbox-zero execute step in =inbox.org= reverted to routing-only in the same change so this file is the loop's only home. Phases 2-6 (same day) wired both callers, pinned the commit-autonomy waiver markers, fleshed the defer/Q&A/page mechanics, and added the metrics record + KB synthesis step.