#+TITLE: Cross-Agent Communication Workflow (v5) #+AUTHOR: Craig Jennings & Claude (homelab + career sessions) #+DATE: 2026-04-27 #+VERSION: 5 * Status Draft. Iterating between the homelab and career sessions through a multi-round design discussion. Awaiting Craig's review for promotion to =~/code/rulesets/claude-templates/.ai/workflows/=. v5 changes from v4: - *Script absorption.* Seven operational scripts (=cross-agent-send=, =cross-agent-recv=, =cross-agent-watch=, =cross-agent-status=, =cross-agent-discover=, =cross-agent-halt=, =cross-agent-resume=) now own most implementation detail. Their READMEs are the operational source of truth. The spec stays declarative. - *Failsafe halt.* Layered HALT-file mechanism stops all cross-agent activity on a machine within ~5 min, without visiting individual sessions or restarting Claude Code. =cross-agent-halt= and =cross-agent-resume= are the convenience entry points; every other component checks the HALT file independently. - *Identity.* Messages are GPG-signed by sender and verified by receiver. Combined with POSIX permissions on =from-agents/= and Tailscale-level network auth, identity becomes a three-layer story. - *Atomic writes.* Writers MUST use temp-file + rename. =cross-agent-send= handles this; the spec just states the contract. - *Dedup.* Sequence-collision dedup is now binary SHA-256 equality, not a fuzzy ">90% match" threshold. - *Cold-start handling.* Layered: =cross-agent-watch= (push notifications via =inotifywait=) is the primary mechanism; startup-workflow check and user-direct-injection are coverage layers. - *Spec stays roughly the same length but does more protocol work.* Operational detail (rsync retry numbers, inotifywait recipes, peers.toml schema, GPG flags, dedup mechanics) moved to the script READMEs. The spec adds new protocol elements (identity layer, atomic-writes contract, SHA-256 dedup, =escalate= type, =RELEASE_STATUS= values, =REQUIRES_TOOLS= optional field) in the freed space. Total documentation surface (spec + seven READMEs ≈ 1000 lines) is larger than v4's 259 lines, but the spec and the READMEs serve different audiences — protocol-thinkers and CLI-users — and a reader of just the spec can comprehend the protocol without consulting any README. * When to use When two Claude sessions in different projects (same machine or different machines on the same Tailscale tailnet) need to coordinate on a shared task that one session can't complete alone — typically because one has tooling, context, or MCP access the other doesn't. Examples that fit: - Session A asks session B to apply a workflow patch in B's project, then verify it. - Session A runs a long task and needs session B to monitor results in B's domain. - Two sessions co-design a workflow. Examples that don't fit: - A simple file handoff that doesn't require iteration. - A task one session can do alone. - Cross-tailnet or cross-organization. The protocol is local-tailnet-scoped. * Protocol ** File location Each project has =inbox/from-agents/= as its agent-comms mailbox. Create the directory if it doesn't exist; set permissions =chmod 700= and ownership to the user. - Sender writes to receiver's =inbox/from-agents/=. - Receiver polls (or watches) =inbox/from-agents/=, *not* the parent =inbox/=. - The parent =inbox/= stays reserved for human-triage items. - Out-of-band artifacts (PDFs, datasets) live at =inbox/from-agents/artifacts/=. Reference by relative path in the message body. The user does NOT write directly to =from-agents/=. To inject input into a running conversation, the user tells one of the agents in that agent's session; the agent writes the input as a normal message attributed to the user. ** File naming =YYYYMMDDTHHMMSSZ-from--.org= - Timestamp is UTC ISO 8601 compact. The trailing =Z= is mandatory. - =from-= prefix. - == is a stable kebab-case slug across the back-and-forth. Reusable across time; ordering relies on filename timestamps. Frontmatter =#+TIMESTAMP= carries the same instant in local time with explicit offset. The two MUST refer to the same instant. The implementation (=cross-agent-send=) generates the canonical filename from the message's frontmatter (=CONVERSATION_ID=, current UTC time) and the sender's project context. Senders supply only the message body file; the script handles naming. Senders MUST NOT pre-name files in this format and pass them through; the script overwrites with its own canonical name to ensure consistency and enable the sender-side max-seen sequence-collision-reduction scan. GPG signatures live in a sibling file =YYYYMMDDTHHMMSSZ-from--.org.asc=. Receivers verify before processing. See =* Writes are atomic= for the two-file delivery ordering rule. ** Frontmatter Required: #+begin_example #+TITLE: #+CONVERSATION_ID: #+MESSAGE_TYPE: #+SEQUENCE: #+TIMESTAMP: #+PROTOCOL_VERSION: 5 #+end_example Optional: #+begin_example #+REQUIRES_TOOLS: #+RELEASE_STATUS: #+WORKFLOW_VERSION: #+end_example Receiver sanity-checks frontmatter before acting. Missing or malformed frontmatter → surface to user, don't proceed. Mismatched =PROTOCOL_VERSION= → receiver writes a =query= asking the originator to upgrade. ** Identity Messages are GPG-signed by the sender. Receivers verify the detached signature before processing the message body. The implementation (=cross-agent-send=) signs automatically with the sender's configured key (the user's primary GPG key by default; configurable via =--key= flag or environment). Receivers verify automatically against the keys in their GPG keyring. Identity is a three-layer story: 1. *Tailscale layer.* Only tailnet members can reach the rsync-over-SSH endpoint at all. 2. *POSIX layer.* =chmod 700= on =from-agents/= means only processes running as the directory's owner can write. 3. *GPG layer.* Sender's signature on each message proves the message originated from a process holding the key. Three independent layers. Per-user GPG (using existing keys) gives a correctness check more than a security boundary — unsigned messages are almost certainly bugs, not attackers. That's still load-bearing. ** Writes are atomic Writers MUST use a temp-file + rename pattern (=mktemp= + =mv= within the same filesystem) so receivers never see partial files. The implementation script (=cross-agent-send=) handles this. Receivers ignore =.tmp.*= files, processing only the final renamed name. *Two-file ordering.* When a message has a sibling GPG signature file (=.org.asc=), the writer MUST rename the =.asc= to its final name *before* renaming the =.org=. Two =mv= operations are not atomic together — without this ordering, a receiver could read the =.org= in the window between the two renames and fail GPG verify because the =.asc= hasn't landed yet. The rule: receiver only acts on =.org= files, and a =.org= without a corresponding =.asc= means the signature is genuinely missing (not still in flight). ** Sequence numbering =#+SEQUENCE= is a *hint*, not a strict counter. Canonical order is =#+TIMESTAMP=. Sequences may collide under rapid back-and-forth (both sides write what they think is sequence N near-simultaneously). Treat collision as a normal protocol event. *Receiver-side dedup rule.* When a new file shares =CONVERSATION_ID= + =SEQUENCE= with an already-processed message, compare SHA-256 hashes. Identical hashes → silent dedup, treat as a retry. Different hashes → process both, ordered by =#+TIMESTAMP=. *Sender-side collision-reduction (best-effort).* Before picking sequence, scan the receiver's =from-agents/= for the highest existing sequence in this conversation across both sender prefixes. Use =max(seen) + 1=. ** Message types - *request* — a side asks for work, input, or a decision. Sequence 1 is always =request=. - *progress* — work-in-progress checkpoint. "Here's where I am, no action needed from you, more coming." Originator's poll loop should NOT page the user on progress messages. - *query* — either side asks a clarifying question that blocks further work. Originator's poll loop SHOULD surface this immediately. Originator answers and work continues. - *pushback* — receiver formally disagrees with the request and has *not* started the work. Carries reasoning. Distinct from =query= because the originator's response path differs. - *complete* — receiver signals the requested work is done. Triggers verification. - *release* — terminal type. Originator writes after verifying =complete=. Carries =RELEASE_STATUS= to disambiguate the closure mode. - *escalate* — punts the conversation to the user for adjudication. Both sides pause polling on =escalate=; the user resolves. Reply expectation is implied by type: =request=, =query=, =pushback=, =escalate= expect a reply; =progress=, =complete=, =release= don't. ** Conversation lifecycle A conversation is a directed loop between an originator (issued sequence 1) and a receiver: 1. Originator writes =request= (sequence 1). Begins polling for replies. 2. *Optional acknowledgment.* Receiver may write a =progress= at sequence 2 to acknowledge receipt and set expectations. Required if work will take >5 minutes (so the originator's poll loop doesn't waste wakes). 3. *Optional echo-back.* For ambiguous or large requests, receiver writes a =progress= that restates work items and announces "starting now unless you push back within N minutes." 4. Receiver works. May write =progress= updates. =query= mid-work if blocked. =pushback= if the request is wrong. 5. Receiver writes =complete=. Begins polling for =release=. 6. Originator reads, *verifies the deliverable directly*. For subjective deliverables, verification is the originator's editorial accept. 7. If verified: =release= with =RELEASE_STATUS: complete=. If problems: new =request= (next sequence number). 8. Receiver sees =release=, stops polling. The verification step is load-bearing. =complete= is a *claim*; =release= is *verification*. ** Pushback path On receiving a =pushback=, the originator chooses: 1. *Revise* — new =request= with adjusted scope. 2. *Insist* — new =request= addressing the pushback's reasoning, standing by direction. 3. *Withdraw* — =release= with =RELEASE_STATUS: withdrawn-after-pushback=. *Deadlock cap.* After two pushback-insist exchanges, the next message MUST be =MESSAGE_TYPE: escalate=. Both agents pause polling; the user resolves. ** =RELEASE_STATUS= values | Status | Meaning | |---+---| | =complete= | Goal achieved, originator verified | | =cancelled= | Originator changed their mind mid-conversation | | =withdrawn-after-pushback= | Originator chose option 3 on receiver's =pushback= | | =abandoned-after-escalation= | User adjudicated and chose to close the conversation | | =abandoned-after-timeout= | Receiver auto-closed after originator never returned to verify | ** Async fallback If the originator session ends between =request= and =complete=, the receiver's =complete= goes unverified. Receiver behavior: - Polls for =release= up to ~24 hours of cycles (implementation default). - After timeout, writes a final =progress= message ("treating as terminal-without-verification; originator never returned to release") and stops polling. Receiver does NOT write =release= itself — that would contradict the lifecycle rule that =release= is the originator's terminal action. - Next time the originator project starts, the unreleased =complete= is surfaced as a startup item. The user can issue a late =release= (with whichever =RELEASE_STATUS= fits) or open a fresh conversation to revisit. =RELEASE_STATUS: abandoned-after-timeout= is used at that point if the user wants to formally close the orphaned thread. ** Escalation A side writes =escalate= when: - Pushback-insist deadlock cap reached. - Conversation has stalled (no productive movement in N exchanges). - A reply-expecting message has gone unanswered past timeout. Body summarizes both sides' positions in 60 seconds of reading. Both agents pause polling; the user resolves. * Implementation notes This sub-section describes how to operate the protocol. Operational detail lives in the seven scripts' READMEs. ** Recommended scripts | Script | Replaces user action | README | |---+---+---| | =cross-agent-send = | Filename generation, GPG sign, atomic write, peer lookup, rsync push, retry+backoff, failure surfacing — seven mechanical sender-side steps. Frontmatter and message body are still author-supplied. | =cross-agent-send.md= | | =cross-agent-recv = | Frontmatter sanity-check, =PROTOCOL_VERSION= verify, GPG verify, SHA-256 dedup, =REQUIRES_TOOLS= check — five mechanical receiver-side steps. Output is a structured decision (=process= / =dedup= / =query= / =reject=) the agent acts on. | =cross-agent-recv.md= | | =cross-agent-watch= | Manually checking inboxes; "did I get a message?" | =cross-agent-watch.md= | | =cross-agent-status= | Walking each project to count pending messages | =cross-agent-status.md= | | =cross-agent-discover= | Remembering project topology and reachability | =cross-agent-discover.md= | | =cross-agent-halt [reason] [--tailnet]= | Visiting each session to stop polling, restarting Claude Code, or hand-killing processes when comms go runaway. =--tailnet= propagates HALT to all peers. | =cross-agent-halt.md= | | =cross-agent-resume [--tailnet]= | Manually clearing the HALT state and restarting the watcher. Per-session polling does NOT auto-resume — the user re-engages each session explicitly. | =cross-agent-resume.md= | The scripts are tools the user runs from any terminal. They do not depend on agent context — =cross-agent-status= run from a fresh shell works. A reader can comprehend this protocol from this spec alone. Script READMEs add operational detail that makes the protocol practical to use, but understanding the protocol's semantics requires only this document. ** Polling Default cadence: 270 seconds (≈4.5 min). Sits just under the 5-minute prompt-cache TTL. If a side needs to slow down (heads-down work, idle wait), it writes a =progress= message saying so in prose. The other side adapts. There are no named polling modes. After ~12 empty polls in a row, the poll loop surfaces the silence to the user. A future runtime with native filesystem-event support could replace polling for active sessions; =cross-agent-watch= already provides event-driven notifications outside active sessions. ** User multi-tasking - *Deferral.* If the user's last message in the agent's session was less than 60 seconds ago AND a poll fires, queue the inbox check until either the user sends another message OR 5 minutes pass without further input. - *Surfacing.* On the next user-facing response: "While we were working on X, a cross-agent message landed from . It's a == — want me to handle it now or after we finish?" - *Mid-question.* Answer the user first. - *Project switch.* If the user moves to the receiver project mid-conversation, the receiver agent surfaces the in-flight thread on first user prompt. - *Conversation state.* Always include in any response that mentions a cross-agent thread: " at sequence N, awaiting ." ** Failure modes The seven scripts surface most failures with concrete error messages. Spec-level failure modes: - *Malformed frontmatter on a received file.* Surface to user; do not act. - *Mismatched =PROTOCOL_VERSION=.* Receiver writes =query= asking originator to upgrade. - *Missing or invalid GPG signature.* Receiver surfaces "unsigned/unverified message"; refuses to act. - *Sequence collision* with non-matching SHA-256. Process both, ordered by timestamp. - *Required tool unavailable.* Receiver checks =REQUIRES_TOOLS= during frontmatter-sanity-check (before any work begins). On a missing tool, receiver writes =query= asking the originator to reframe the request to avoid the unavailable tool. Originator may revise (new =request=) or withdraw (=release= with =RELEASE_STATUS: cancelled=). =query= is the right type rather than =pushback= because missing-tool is a capability gap, not disagreement. - *Runaway resource usage.* User invokes =cross-agent-halt= globally (or =cross-agent-halt --tailnet= for cross-machine). HALT file stops all components within one polling cycle (~5 min). See =* Halt mechanism= for the layered checks. - *User halts mid-conversation.* Both sides write a final =progress= note ("HALT fired; pausing"); polling stops within one cadence; conversations resume on explicit per-session re-engage after HALT clears. - *HALT file accidentally created* (typo, errant =touch=). =cross-agent-status= prominently flags HALT active; user clears with =cross-agent-resume=. Cost: no messages send during the typo window. - *HALT file unreadable* (perms wrong, partial write). Each component fails-closed (treats as halted) and reports "HALT file present but unreadable; treat as halted." Safer than fail-open. Operational failures (rsync push fails, watcher dies, peer unreachable) live in the script READMEs' failure-mode tables. * Halt mechanism A failsafe to stop all cross-agent activity on a machine without visiting individual sessions or restarting Claude Code. Designed for the runaway-polling case: an agent has spun up conversations with N other agents, polling is eating CPU, and the user needs to stop everything *now*. ** The HALT file Path: =~/.config/cross-agent-comms/HALT=. Existence triggers halt across all components on the machine. The file's body may carry an optional human-readable reason (reviewed by the user later when deciding to resume). User commands: #+begin_example $ touch ~/.config/cross-agent-comms/HALT # halt $ rm ~/.config/cross-agent-comms/HALT # resume #+end_example Or via convenience scripts (=cross-agent-halt= / =cross-agent-resume=) that also handle the watcher service and cross-machine propagation. ** Layered checks (the failsafe property) Every component MUST check the HALT file. The "any one component stops the system independently" property is what makes this failsafe — the system doesn't depend on a single point doing the right thing. | Component | Check timing | Behavior on HALT | |---+---+---| | =cross-agent-send= | At start of send + between =.asc= and =.org= rsync + between retry iterations | Refuse to start new send; complete current step then exit. Worst case: one in-flight send finishes within a few seconds. | | =cross-agent-recv= | Before any verify or dedup | Leave inbound message in place — do NOT dedup, reject, or move. Resume picks it up via cold-start handling. | | =cross-agent-watch= | At iteration start | Suppress notifications; log only. Continues running, no-op until HALT clears. | | =cross-agent-status= | At start | Print prominent "⚠ HALT ACTIVE" banner before normal output. Read-only, continues. | | =cross-agent-discover= | At start | Print HALT banner; continue read-only enumeration. | | Agent polling loop | First action on every wake | Write a final =progress= note to any active conversation ("HALT fired; pausing"), do NOT reschedule, surface "halt active" to user. Polling decays within one cadence (~5 min). | | Agent user-facing responses | Every response while HALT is set | Append "(HALT active; cross-agent comms paused)" to the response. On HALT clear, the next response says "(HALT cleared; cross-agent comms ready to resume — say so to re-engage polling)." Persistent, not just first-response — keeps awareness alive. | | Conversation initiator | Before writing sequence 1 of any new conversation | Refuse and surface to user. | | Startup workflow | Phase A on session start | If HALT exists, surface immediately and skip cross-agent inbox checks. | The agent polling-loop check is the load-bearing one for "stops eating CPU." Wake-ups already scheduled fire, but each wake on-HALT is a no-op + reschedule-prevention. Within one polling cadence (~5 min) all polling stops. *Fail-closed on unreadable HALT.* If the HALT file exists but is unreadable (wrong permissions, partial write), components MUST treat as halted. Safer than fail-open. ** Resume asymmetry (deliberate) Halt is automatic everywhere. Resume requires explicit user intent per-session. When the user removes HALT (or runs =cross-agent-resume=), components stop refusing to act, but agent polling does NOT auto-resume. The user must open each session and tell that agent to resume polling for its conversations. The asymmetry exists because: 1. Auto-resume could silently invert intentional kills. If the user halted because a session was misbehaving, removing HALT shouldn't quietly revive it. 2. Per-session resume forces the user to look at each session and confirm the situation is resolved before re-engaging. ** Cross-machine halt =cross-agent-halt --tailnet= iterates =peers.toml= and SSH-touches HALT on each peer. Same shape for resume. Reports per-peer status with non-zero exit on partial halt: #+begin_example $ cross-agent-halt --tailnet Halting velox.local ✓ (HALT file written) Halting bastion.local ✗ (ssh exit 255: no route to host) Halting locally ✓ (HALT file written) PARTIAL HALT: 2/3 machines halted. bastion.local needs manual halt. Exit 1. #+end_example Scripting can detect partial halt via the exit code. Same pattern for =--tailnet= on resume. * Limitations - *Local-tailnet only.* Filesystem IPC + rsync over SSH. Cross-tailnet or cross-organization is out of scope. - *Identity has three layers (Tailscale + POSIX + GPG)* but no message-content encryption. Confidentiality is not the goal; signing is correctness, not secrecy. - *Single-receiver per conversation.* Fan-out to multiple receivers requires manually orchestrating multiple parallel conversations. - *Polling is best-effort.* A wake may be delayed by an in-flight tool call until the runtime is idle. =cross-agent-watch= mitigates by offering event-driven notifications. - *Project-extension drift.* If two projects' =.ai/project-workflows/= modify shared workflow definitions in incompatible ways, cross-agent assumptions can diverge silently. The optional =#+WORKFLOW_VERSION= advisory field is informational only in v5 — no implementation reads or acts on it. A future version may add enforcement on mismatch (e.g. receiver writes =query= asking which side is stale). Today, alignment is verified manually before high-stakes conversations. * Persistence after release Conversation files persist by default. The conversation log is the audit trail. Manual archival is fine if the inbox grows unmanageable. Suggested cadence: once the conversation has been =release='d AND the work it produced has shipped, archive both projects' message files into =.ai/sessions/cross-agent/= as a flat directory — no per-conversation subdirectories. Rename each archived file to lead with the conversation-id so messages from the same conversation cluster on =ls=: =--from-.org= (and the matching =.asc= sibling, if present). Inbox filenames lead with the timestamp because chronological arrival is what matters in =from-agents/=; archives invert that because grouping by conversation is what matters when reading history. Keep the =.asc= signatures alongside the =.org= files in archive — they're small and document the GPG verification chain. Old messages don't affect protocol behavior (=cross-agent-status='s pending semantics correctly ignore released messages) but the =from-agents/= directory grows indefinitely without manual archival. =cross-agent-status= performance degrades noticeably when a project's =from-agents/= exceeds a few hundred files. =cross-agent-init= (deferred to v6) would include an archival sub-command. * Open questions - *=cross-agent-init= and =cross-agent-compose= helper scripts.* =-init= would be one-command project bootstrap (creates =inbox/from-agents/= with =chmod 700=, installs the =cross-agent-watch= systemd path unit, validates peer config, runs a discovery probe). =-compose= would be interactive frontmatter authoring (prompts for required fields, produces a draft message file). Both deferred to v6. Current onboarding requires manual =mkdir= + systemd setup per =cross-agent-watch.md='s install recipe; current message authoring requires writing the file by hand or via a small in-agent template. - *Hard conversation timeout.* The async-fallback timeout is implementation-default ~24 hours. Right number depends on use case; tighten as patterns emerge. - *=paused= polling state.* Today there's no clean signal for "pause without ending." Add when first user complaint surfaces. - *Multi-LLM context.* If we ever bring in a non-Claude agent, the protocol's natural-language framing may need formalization. * Examples ** =prep-fixup= conversation (2026-04-26 → 2026-04-27) Eleven exchanges between homelab and career produced the v4 spec by iterative critique-and-simplification. Three real-time sequence collisions during the conversation drove the sequence-as-hint rule that landed in v4 and persists in v5. Files at =~/projects/{homelab,career}/inbox/from-agents/= named =*-prep-fixup.org=. Worth re-reading when designing future cross-agent flows. ** =comms-cold-start-discovery= conversation (2026-04-27) The follow-up that produced this v5 spec. Cold-start, watcher tooling, agent discovery, GPG identity, sha256 dedup, atomic writes, POSIX perms, script absorption, and process-vs-text simplification. Tonight's first cold-start in real time (career session went dormant after =prep-fixup= release; Craig's user-injection re-engaged it) is the worked demonstration of the v5 user-injection rule. Files at =~/projects/{homelab,career}/inbox/from-agents/= named =*-comms-cold-start-discovery.org=.