#+TITLE: Cross-Agent Communication Workflow (v5)
#+AUTHOR: Craig Jennings & Claude (homelab + career sessions)
#+DATE: 2026-04-27
#+VERSION: 5

* Status

Draft. Iterating between the homelab and career sessions through a multi-round design discussion. Awaiting Craig's review for promotion to =~/code/rulesets/claude-templates/.ai/workflows/=.

v5 changes from v4:
- *Script absorption.* Seven operational scripts (=cross-agent-send=, =cross-agent-recv=, =cross-agent-watch=, =cross-agent-status=, =cross-agent-discover=, =cross-agent-halt=, =cross-agent-resume=) now own most implementation detail. Their READMEs are the operational source of truth. The spec stays declarative.
- *Failsafe halt.* Layered HALT-file mechanism stops all cross-agent activity on a machine within ~5 min, without visiting individual sessions or restarting Claude Code. =cross-agent-halt= and =cross-agent-resume= are the convenience entry points; every other component checks the HALT file independently.
- *Identity.* Messages are GPG-signed by sender and verified by receiver. Combined with POSIX permissions on =from-agents/= and Tailscale-level network auth, identity becomes a three-layer story.
- *Atomic writes.* Writers MUST use temp-file + rename. =cross-agent-send= handles this; the spec just states the contract.
- *Dedup.* Sequence-collision dedup is now binary SHA-256 equality, not a fuzzy ">90% match" threshold.
- *Cold-start handling.* Layered: =cross-agent-watch= (push notifications via =inotifywait=) is the primary mechanism; startup-workflow check and user-direct-injection are coverage layers.
- *Spec stays roughly the same length but does more protocol work.* Operational detail (rsync retry numbers, inotifywait recipes, peers.toml schema, GPG flags, dedup mechanics) moved to the script READMEs. The spec adds new protocol elements (identity layer, atomic-writes contract, SHA-256 dedup, =escalate= type, =RELEASE_STATUS= values, =REQUIRES_TOOLS= optional field) in the freed space. Total documentation surface (spec + seven READMEs ≈ 1000 lines) is larger than v4's 259 lines, but the spec and the READMEs serve different audiences — protocol-thinkers and CLI-users — and a reader of just the spec can comprehend the protocol without consulting any README.

* When to use

When two Claude sessions in different projects (same machine or different machines on the same Tailscale tailnet) need to coordinate on a shared task that one session can't complete alone — typically because one has tooling, context, or MCP access the other doesn't.

Examples that fit:
- Session A asks session B to apply a workflow patch in B's project, then verify it.
- Session A runs a long task and needs session B to monitor results in B's domain.
- Two sessions co-design a workflow.

Examples that don't fit:
- A simple file handoff that doesn't require iteration.
- A task one session can do alone.
- Cross-tailnet or cross-organization. The protocol is local-tailnet-scoped.

* Protocol

** File location

Each project has =inbox/from-agents/= as its agent-comms mailbox. Create the directory if it doesn't exist; set permissions =chmod 700= and ownership to the user.

- Sender writes to receiver's =inbox/from-agents/=.
- Receiver polls (or watches) =inbox/from-agents/=, *not* the parent =inbox/=.
- The parent =inbox/= stays reserved for human-triage items.
- Out-of-band artifacts (PDFs, datasets) live at =inbox/from-agents/artifacts/=. Reference by relative path in the message body.

The user does NOT write directly to =from-agents/=. To inject input into a running conversation, the user tells one of the agents in that agent's session; the agent writes the input as a normal message attributed to the user.

** File naming

=YYYYMMDDTHHMMSSZ-from-<sender>-<short-conv-id>.org=

- Timestamp is UTC ISO 8601 compact. The trailing =Z= is mandatory.
- =from-<sender>= prefix.
- =<short-conv-id>= is a stable kebab-case slug across the back-and-forth. Reusable across time; ordering relies on filename timestamps.

Frontmatter =#+TIMESTAMP= carries the same instant in local time with explicit offset. The two MUST refer to the same instant.

The implementation (=cross-agent-send=) generates the canonical filename from the message's frontmatter (=CONVERSATION_ID=, current UTC time) and the sender's project context. Senders supply only the message body file; the script handles naming. Senders MUST NOT pre-name files in this format and pass them through; the script overwrites with its own canonical name to ensure consistency and enable the sender-side max-seen sequence-collision-reduction scan.

GPG signatures live in a sibling file =YYYYMMDDTHHMMSSZ-from-<sender>-<short-conv-id>.org.asc=. Receivers verify before processing. See =* Writes are atomic= for the two-file delivery ordering rule.

** Frontmatter

Required:

#+begin_example
#+TITLE: <human-readable subject>
#+CONVERSATION_ID: <stable across the thread>
#+MESSAGE_TYPE: <see types below>
#+SEQUENCE: <integer hint>
#+TIMESTAMP: <ISO 8601 with explicit offset>
#+PROTOCOL_VERSION: 5
#+end_example

Optional:

#+begin_example
#+REQUIRES_TOOLS: <comma-separated tool/MCP slugs, e.g. gmail-mcp, slack-mcp>
#+RELEASE_STATUS: <see release-statuses; valid only on MESSAGE_TYPE: release>
#+WORKFLOW_VERSION: <sender's version of cross-agent-comms.org; informational only in v5 — no enforcement>
#+end_example

Receiver sanity-checks frontmatter before acting. Missing or malformed frontmatter → surface to user, don't proceed. Mismatched =PROTOCOL_VERSION= → receiver writes a =query= asking the originator to upgrade.

** Identity

Messages are GPG-signed by the sender. Receivers verify the detached signature before processing the message body.

The implementation (=cross-agent-send=) signs automatically with the sender's configured key (the user's primary GPG key by default; configurable via =--key= flag or environment). Receivers verify automatically against the keys in their GPG keyring.

Identity is a three-layer story:

1. *Tailscale layer.* Only tailnet members can reach the rsync-over-SSH endpoint at all.
2. *POSIX layer.* =chmod 700= on =from-agents/= means only processes running as the directory's owner can write.
3. *GPG layer.* Sender's signature on each message proves the message originated from a process holding the key.

Three independent layers. Per-user GPG (using existing keys) gives a correctness check more than a security boundary — unsigned messages are almost certainly bugs, not attackers. That's still load-bearing.

** Writes are atomic

Writers MUST use a temp-file + rename pattern (=mktemp= + =mv= within the same filesystem) so receivers never see partial files. The implementation script (=cross-agent-send=) handles this.

Receivers ignore =.tmp.*= files, processing only the final renamed name.

*Two-file ordering.* When a message has a sibling GPG signature file (=.org.asc=), the writer MUST rename the =.asc= to its final name *before* renaming the =.org=. Two =mv= operations are not atomic together — without this ordering, a receiver could read the =.org= in the window between the two renames and fail GPG verify because the =.asc= hasn't landed yet. The rule: receiver only acts on =.org= files, and a =.org= without a corresponding =.asc= means the signature is genuinely missing (not still in flight).

** Sequence numbering

=#+SEQUENCE= is a *hint*, not a strict counter. Canonical order is =#+TIMESTAMP=. Sequences may collide under rapid back-and-forth (both sides write what they think is sequence N near-simultaneously). Treat collision as a normal protocol event.

*Receiver-side dedup rule.* When a new file shares =CONVERSATION_ID= + =SEQUENCE= with an already-processed message, compare SHA-256 hashes. Identical hashes → silent dedup, treat as a retry. Different hashes → process both, ordered by =#+TIMESTAMP=.

*Sender-side collision-reduction (best-effort).* Before picking sequence, scan the receiver's =from-agents/= for the highest existing sequence in this conversation across both sender prefixes. Use =max(seen) + 1=.

** Message types

- *request* — a side asks for work, input, or a decision. Sequence 1 is always =request=.
- *progress* — work-in-progress checkpoint. "Here's where I am, no action needed from you, more coming." Originator's poll loop should NOT page the user on progress messages.
- *query* — either side asks a clarifying question that blocks further work. Originator's poll loop SHOULD surface this immediately. Originator answers and work continues.
- *pushback* — receiver formally disagrees with the request and has *not* started the work. Carries reasoning. Distinct from =query= because the originator's response path differs.
- *complete* — receiver signals the requested work is done. Triggers verification.
- *release* — terminal type. Originator writes after verifying =complete=. Carries =RELEASE_STATUS= to disambiguate the closure mode.
- *escalate* — punts the conversation to the user for adjudication. Both sides pause polling on =escalate=; the user resolves.

Reply expectation is implied by type: =request=, =query=, =pushback=, =escalate= expect a reply; =progress=, =complete=, =release= don't.

** Conversation lifecycle

A conversation is a directed loop between an originator (issued sequence 1) and a receiver:

1. Originator writes =request= (sequence 1). Begins polling for replies.
2. *Optional acknowledgment.* Receiver may write a =progress= at sequence 2 to acknowledge receipt and set expectations. Required if work will take >5 minutes (so the originator's poll loop doesn't waste wakes).
3. *Optional echo-back.* For ambiguous or large requests, receiver writes a =progress= that restates work items and announces "starting now unless you push back within N minutes."
4. Receiver works. May write =progress= updates. =query= mid-work if blocked. =pushback= if the request is wrong.
5. Receiver writes =complete=. Begins polling for =release=.
6. Originator reads, *verifies the deliverable directly*. For subjective deliverables, verification is the originator's editorial accept.
7. If verified: =release= with =RELEASE_STATUS: complete=. If problems: new =request= (next sequence number).
8. Receiver sees =release=, stops polling.

The verification step is load-bearing. =complete= is a *claim*; =release= is *verification*.

** Pushback path

On receiving a =pushback=, the originator chooses:

1. *Revise* — new =request= with adjusted scope.
2. *Insist* — new =request= addressing the pushback's reasoning, standing by direction.
3. *Withdraw* — =release= with =RELEASE_STATUS: withdrawn-after-pushback=.

*Deadlock cap.* After two pushback-insist exchanges, the next message MUST be =MESSAGE_TYPE: escalate=. Both agents pause polling; the user resolves.

** =RELEASE_STATUS= values

| Status | Meaning |
|---+---|
| =complete= | Goal achieved, originator verified |
| =cancelled= | Originator changed their mind mid-conversation |
| =withdrawn-after-pushback= | Originator chose option 3 on receiver's =pushback= |
| =abandoned-after-escalation= | User adjudicated and chose to close the conversation |
| =abandoned-after-timeout= | Receiver auto-closed after originator never returned to verify |

** Async fallback

If the originator session ends between =request= and =complete=, the receiver's =complete= goes unverified. Receiver behavior:

- Polls for =release= up to ~24 hours of cycles (implementation default).
- After timeout, writes a final =progress= message ("treating as terminal-without-verification; originator never returned to release") and stops polling. Receiver does NOT write =release= itself — that would contradict the lifecycle rule that =release= is the originator's terminal action.
- Next time the originator project starts, the unreleased =complete= is surfaced as a startup item. The user can issue a late =release= (with whichever =RELEASE_STATUS= fits) or open a fresh conversation to revisit. =RELEASE_STATUS: abandoned-after-timeout= is used at that point if the user wants to formally close the orphaned thread.

** Escalation

A side writes =escalate= when:
- Pushback-insist deadlock cap reached.
- Conversation has stalled (no productive movement in N exchanges).
- A reply-expecting message has gone unanswered past timeout.

Body summarizes both sides' positions in 60 seconds of reading. Both agents pause polling; the user resolves.

* Implementation notes

This sub-section describes how to operate the protocol. Operational detail lives in the seven scripts' READMEs.

** Recommended scripts

| Script | Replaces user action | README |
|---+---+---|
| =cross-agent-send <dest> <msg>= | Filename generation, GPG sign, atomic write, peer lookup, rsync push, retry+backoff, failure surfacing — seven mechanical sender-side steps. Frontmatter and message body are still author-supplied. | =cross-agent-send.md= |
| =cross-agent-recv <msg>= | Frontmatter sanity-check, =PROTOCOL_VERSION= verify, GPG verify, SHA-256 dedup, =REQUIRES_TOOLS= check — five mechanical receiver-side steps. Output is a structured decision (=process= / =dedup= / =query= / =reject=) the agent acts on. | =cross-agent-recv.md= |
| =cross-agent-watch= | Manually checking inboxes; "did I get a message?" | =cross-agent-watch.md= |
| =cross-agent-status= | Walking each project to count pending messages | =cross-agent-status.md= |
| =cross-agent-discover= | Remembering project topology and reachability | =cross-agent-discover.md= |
| =cross-agent-halt [reason] [--tailnet]= | Visiting each session to stop polling, restarting Claude Code, or hand-killing processes when comms go runaway. =--tailnet= propagates HALT to all peers. | =cross-agent-halt.md= |
| =cross-agent-resume [--tailnet]= | Manually clearing the HALT state and restarting the watcher. Per-session polling does NOT auto-resume — the user re-engages each session explicitly. | =cross-agent-resume.md= |

The scripts are tools the user runs from any terminal. They do not depend on agent context — =cross-agent-status= run from a fresh shell works.

A reader can comprehend this protocol from this spec alone. Script READMEs add operational detail that makes the protocol practical to use, but understanding the protocol's semantics requires only this document.

** Polling

Default cadence: 270 seconds (≈4.5 min). Sits just under the 5-minute prompt-cache TTL.

If a side needs to slow down (heads-down work, idle wait), it writes a =progress= message saying so in prose. The other side adapts. There are no named polling modes.

After ~12 empty polls in a row, the poll loop surfaces the silence to the user.

A future runtime with native filesystem-event support could replace polling for active sessions; =cross-agent-watch= already provides event-driven notifications outside active sessions.

** User multi-tasking

- *Deferral.* If the user's last message in the agent's session was less than 60 seconds ago AND a poll fires, queue the inbox check until either the user sends another message OR 5 minutes pass without further input.
- *Surfacing.* On the next user-facing response: "While we were working on X, a cross-agent message landed from <project>. It's a =<type>= — want me to handle it now or after we finish?"
- *Mid-question.* Answer the user first.
- *Project switch.* If the user moves to the receiver project mid-conversation, the receiver agent surfaces the in-flight thread on first user prompt.
- *Conversation state.* Always include in any response that mentions a cross-agent thread: "<conv-id> at sequence N, awaiting <event>."

** Failure modes

The seven scripts surface most failures with concrete error messages. Spec-level failure modes:

- *Malformed frontmatter on a received file.* Surface to user; do not act.
- *Mismatched =PROTOCOL_VERSION=.* Receiver writes =query= asking originator to upgrade.
- *Missing or invalid GPG signature.* Receiver surfaces "unsigned/unverified message"; refuses to act.
- *Sequence collision* with non-matching SHA-256. Process both, ordered by timestamp.
- *Required tool unavailable.* Receiver checks =REQUIRES_TOOLS= during frontmatter-sanity-check (before any work begins). On a missing tool, receiver writes =query= asking the originator to reframe the request to avoid the unavailable tool. Originator may revise (new =request=) or withdraw (=release= with =RELEASE_STATUS: cancelled=). =query= is the right type rather than =pushback= because missing-tool is a capability gap, not disagreement.
- *Runaway resource usage.* User invokes =cross-agent-halt= globally (or =cross-agent-halt --tailnet= for cross-machine). HALT file stops all components within one polling cycle (~5 min). See =* Halt mechanism= for the layered checks.
- *User halts mid-conversation.* Both sides write a final =progress= note ("HALT fired; pausing"); polling stops within one cadence; conversations resume on explicit per-session re-engage after HALT clears.
- *HALT file accidentally created* (typo, errant =touch=). =cross-agent-status= prominently flags HALT active; user clears with =cross-agent-resume=. Cost: no messages send during the typo window.
- *HALT file unreadable* (perms wrong, partial write). Each component fails-closed (treats as halted) and reports "HALT file present but unreadable; treat as halted." Safer than fail-open.

Operational failures (rsync push fails, watcher dies, peer unreachable) live in the script READMEs' failure-mode tables.

* Halt mechanism

A failsafe to stop all cross-agent activity on a machine without visiting individual sessions or restarting Claude Code. Designed for the runaway-polling case: an agent has spun up conversations with N other agents, polling is eating CPU, and the user needs to stop everything *now*.

** The HALT file

Path: =~/.config/cross-agent-comms/HALT=.

Existence triggers halt across all components on the machine. The file's body may carry an optional human-readable reason (reviewed by the user later when deciding to resume).

User commands:

#+begin_example
$ touch ~/.config/cross-agent-comms/HALT      # halt
$ rm ~/.config/cross-agent-comms/HALT         # resume
#+end_example

Or via convenience scripts (=cross-agent-halt= / =cross-agent-resume=) that also handle the watcher service and cross-machine propagation.

** Layered checks (the failsafe property)

Every component MUST check the HALT file. The "any one component stops the system independently" property is what makes this failsafe — the system doesn't depend on a single point doing the right thing.

| Component | Check timing | Behavior on HALT |
|---+---+---|
| =cross-agent-send= | At start of send + between =.asc= and =.org= rsync + between retry iterations | Refuse to start new send; complete current step then exit. Worst case: one in-flight send finishes within a few seconds. |
| =cross-agent-recv= | Before any verify or dedup | Leave inbound message in place — do NOT dedup, reject, or move. Resume picks it up via cold-start handling. |
| =cross-agent-watch= | At iteration start | Suppress notifications; log only. Continues running, no-op until HALT clears. |
| =cross-agent-status= | At start | Print prominent "⚠ HALT ACTIVE" banner before normal output. Read-only, continues. |
| =cross-agent-discover= | At start | Print HALT banner; continue read-only enumeration. |
| Agent polling loop | First action on every wake | Write a final =progress= note to any active conversation ("HALT fired; pausing"), do NOT reschedule, surface "halt active" to user. Polling decays within one cadence (~5 min). |
| Agent user-facing responses | Every response while HALT is set | Append "(HALT active; cross-agent comms paused)" to the response. On HALT clear, the next response says "(HALT cleared; cross-agent comms ready to resume — say so to re-engage polling)." Persistent, not just first-response — keeps awareness alive. |
| Conversation initiator | Before writing sequence 1 of any new conversation | Refuse and surface to user. |
| Startup workflow | Phase A on session start | If HALT exists, surface immediately and skip cross-agent inbox checks. |

The agent polling-loop check is the load-bearing one for "stops eating CPU." Wake-ups already scheduled fire, but each wake on-HALT is a no-op + reschedule-prevention. Within one polling cadence (~5 min) all polling stops.

*Fail-closed on unreadable HALT.* If the HALT file exists but is unreadable (wrong permissions, partial write), components MUST treat as halted. Safer than fail-open.

** Resume asymmetry (deliberate)

Halt is automatic everywhere. Resume requires explicit user intent per-session.

When the user removes HALT (or runs =cross-agent-resume=), components stop refusing to act, but agent polling does NOT auto-resume. The user must open each session and tell that agent to resume polling for its conversations.

The asymmetry exists because:

1. Auto-resume could silently invert intentional kills. If the user halted because a session was misbehaving, removing HALT shouldn't quietly revive it.
2. Per-session resume forces the user to look at each session and confirm the situation is resolved before re-engaging.

** Cross-machine halt

=cross-agent-halt --tailnet= iterates =peers.toml= and SSH-touches HALT on each peer. Same shape for resume.

Reports per-peer status with non-zero exit on partial halt:

#+begin_example
$ cross-agent-halt --tailnet
Halting velox.local      ✓ (HALT file written)
Halting bastion.local    ✗ (ssh exit 255: no route to host)
Halting locally          ✓ (HALT file written)

PARTIAL HALT: 2/3 machines halted. bastion.local needs manual halt.
Exit 1.
#+end_example

Scripting can detect partial halt via the exit code. Same pattern for =--tailnet= on resume.

* Limitations

- *Local-tailnet only.* Filesystem IPC + rsync over SSH. Cross-tailnet or cross-organization is out of scope.
- *Identity has three layers (Tailscale + POSIX + GPG)* but no message-content encryption. Confidentiality is not the goal; signing is correctness, not secrecy.
- *Single-receiver per conversation.* Fan-out to multiple receivers requires manually orchestrating multiple parallel conversations.
- *Polling is best-effort.* A wake may be delayed by an in-flight tool call until the runtime is idle. =cross-agent-watch= mitigates by offering event-driven notifications.
- *Project-extension drift.* If two projects' =.ai/project-workflows/= modify shared workflow definitions in incompatible ways, cross-agent assumptions can diverge silently. The optional =#+WORKFLOW_VERSION= advisory field is informational only in v5 — no implementation reads or acts on it. A future version may add enforcement on mismatch (e.g. receiver writes =query= asking which side is stale). Today, alignment is verified manually before high-stakes conversations.

* Persistence after release

Conversation files persist by default. The conversation log is the audit trail.

Manual archival is fine if the inbox grows unmanageable. Suggested cadence: once the conversation has been =release='d AND the work it produced has shipped, archive both projects' message files into =.ai/sessions/cross-agent/= as a flat directory — no per-conversation subdirectories. Rename each archived file to lead with the conversation-id so messages from the same conversation cluster on =ls=: =<conv-id>-<TIMESTAMP>-from-<sender>.org= (and the matching =.asc= sibling, if present). Inbox filenames lead with the timestamp because chronological arrival is what matters in =from-agents/=; archives invert that because grouping by conversation is what matters when reading history. Keep the =.asc= signatures alongside the =.org= files in archive — they're small and document the GPG verification chain.

Old messages don't affect protocol behavior (=cross-agent-status='s pending semantics correctly ignore released messages) but the =from-agents/= directory grows indefinitely without manual archival. =cross-agent-status= performance degrades noticeably when a project's =from-agents/= exceeds a few hundred files. =cross-agent-init= (deferred to v6) would include an archival sub-command.

* Open questions

- *=cross-agent-init= and =cross-agent-compose= helper scripts.* =-init= would be one-command project bootstrap (creates =inbox/from-agents/= with =chmod 700=, installs the =cross-agent-watch= systemd path unit, validates peer config, runs a discovery probe). =-compose= would be interactive frontmatter authoring (prompts for required fields, produces a draft message file). Both deferred to v6. Current onboarding requires manual =mkdir= + systemd setup per =cross-agent-watch.md='s install recipe; current message authoring requires writing the file by hand or via a small in-agent template.
- *Hard conversation timeout.* The async-fallback timeout is implementation-default ~24 hours. Right number depends on use case; tighten as patterns emerge.
- *=paused= polling state.* Today there's no clean signal for "pause without ending." Add when first user complaint surfaces.
- *Multi-LLM context.* If we ever bring in a non-Claude agent, the protocol's natural-language framing may need formalization.

* Examples

** =prep-fixup= conversation (2026-04-26 → 2026-04-27)

Eleven exchanges between homelab and career produced the v4 spec by iterative critique-and-simplification. Three real-time sequence collisions during the conversation drove the sequence-as-hint rule that landed in v4 and persists in v5.

Files at =~/projects/{homelab,career}/inbox/from-agents/= named =*-prep-fixup.org=. Worth re-reading when designing future cross-agent flows.

** =comms-cold-start-discovery= conversation (2026-04-27)

The follow-up that produced this v5 spec. Cold-start, watcher tooling, agent discovery, GPG identity, sha256 dedup, atomic writes, POSIX perms, script absorption, and process-vs-text simplification. Tonight's first cold-start in real time (career session went dormant after =prep-fixup= release; Craig's user-injection re-engaged it) is the worked demonstration of the v5 user-injection rule.

Files at =~/projects/{homelab,career}/inbox/from-agents/= named =*-comms-cold-start-discovery.org=.