aboutsummaryrefslogtreecommitdiff
path: root/.ai/scripts/cross-agent-comms/cross-agent-resume.md
blob: 8aa83575797cc90e46a007c86d9ed61988f160b5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# cross-agent-resume

**Purpose.** Clear the HALT file and restart the watcher service. Counterpart
to `cross-agent-halt`. Resuming agent polling is **explicit per-session** —
this script doesn't auto-revive halted polling loops; you tell each session
to re-engage.

## Usage

```
cross-agent-resume [--tailnet]
```

### Flags

| Flag | Default | Purpose |
|---|---|---|
| `--tailnet` | local only | Clear HALT on every peer in `peers.toml` via SSH over Tailscale. |

## Behavior

### Local resume (default)

1. Remove the HALT file: `rm -f ~/.config/cross-agent-comms/HALT`. (Use `-f`
   so a missing file isn't an error — running resume when not halted is safe.)
2. Restart the watcher service: `systemctl --user start cross-agent-watch.path`.
3. Print a summary:
   ```
   ✓ HALT file removed
   ✓ Watcher service started (cross-agent-watch.path)
   - cross-agent-send and cross-agent-recv will accept new operations.
   - Inbound messages held during halt will be picked up by the watcher.
   - Agent polling does NOT auto-resume. To re-engage polling in a paused
     session, open that Claude session and tell the agent to resume.
   ```
4. Exit 0.

### Cross-tailnet resume (`--tailnet`)

1. Apply local resume steps 1-2 first.
2. Read `peers.toml` for the list of remote machines.
3. For each peer, SSH:
   ```
   ssh <user>@<host> "rm -f ~/.config/cross-agent-comms/HALT && \
                       systemctl --user start cross-agent-watch.path"
   ```
4. Track per-peer success/failure:
   ```
   Resuming velox.local       ✓ (HALT cleared, watcher started)
   Resuming bastion.local     ✗ (ssh exit 255: no route to host)
   Resuming locally           ✓

   PARTIAL RESUME: 2/3 machines resumed. bastion.local still halted.
   ```
5. Exit 0 if all peers resumed; exit 1 on any failure.

## Why agent polling doesn't auto-resume

Two reasons the asymmetry is deliberate:

1. *Auto-resume could silently invert intentional kills.* If you halted
   because a session was misbehaving, removing HALT shouldn't quietly revive
   that session's polling. You re-engage explicitly so you're aware of which
   sessions came back online.

2. *You may want to inspect before resuming.* After a halt, you might want to
   read pending messages, fix configuration, or kill a particular Claude
   session entirely. Per-session resume forces that pause.

## Re-engaging polling in a Claude session

After `cross-agent-resume`, open the relevant Claude session and say something
like:

```
HALT is cleared; resume polling.
```

The agent will check the HALT file (now absent), re-create its polling
schedule, and continue the in-flight conversation from wherever it left off.
The conversation file is intact; the receiver will pick up any new messages
that arrived during the halt window.

## Failure modes

| Symptom | Cause | Fix |
|---|---|---|
| HALT file doesn't exist | Already resumed (or never halted) | OK — `-f` makes this a no-op. |
| `systemctl --user start` fails | Watcher service not installed | Install per `cross-agent-watch.md`'s systemd recipe. |
| `--tailnet` resumes some peers but not others | Same as halt: peer unreachable | Per-peer status reported; resolve manually for unreachable peers. |
| Permission denied removing HALT file | File owned by another user | Check ownership; HALT files should be owned by the running user. |

## Examples

```bash
# Local resume after a halt
cross-agent-resume

# Resume all tailnet peers + local
cross-agent-resume --tailnet
```

## Recovery flow

After a halt:

1. Investigate whatever caused the halt (runaway loop, bad config, etc.).
2. Fix the underlying issue.
3. Run `cross-agent-resume`.
4. Open each Claude session that was polling and tell its agent to re-engage.
5. Confirm operation with `cross-agent-status`.

## See also

- `cross-agent-halt` — counterpart that creates the HALT file.
- `cross-agent-status` — verify HALT cleared and see pending messages.
- `cross-agent-comms.org` — protocol spec, `* Halt mechanism` section.