aboutsummaryrefslogtreecommitdiff
path: root/docs/design
diff options
context:
space:
mode:
Diffstat (limited to 'docs/design')
-rw-r--r--docs/design/2026-06-29-waybar-network-module-spec.org1557
1 files changed, 1422 insertions, 135 deletions
diff --git a/docs/design/2026-06-29-waybar-network-module-spec.org b/docs/design/2026-06-29-waybar-network-module-spec.org
index bf195b8..37b87b0 100644
--- a/docs/design/2026-06-29-waybar-network-module-spec.org
+++ b/docs/design/2026-06-29-waybar-network-module-spec.org
@@ -2,12 +2,26 @@
#+AUTHOR: Craig Jennings & Claude
#+DATE: 2026-06-29
+* Status
+
+Ready for Phase 1; Ready-with-caveats overall. Three Codex review rounds + Craig's
+cj comments are all incorporated — every finding has a disposition and the findings
+cookie reads complete ([31/31]), with no open decisions (enterprise scope settled:
+open + WPA-PSK in v1, 802.1X add/edit vNext, activate-only). The cj comments
+reshaped several decisions (no separate credential store — use NM's own; =net
+doctor= + Makefile console-recovery in v1; rfkill + full-stack-bounce repair;
+airplane module absorbed; VPN a later Phase 5). The only remaining caveats are
+Phase-2/3 build unknowns named under Open items (gtk4-layer-shell anchoring, the
+=captive= =--json= refactor) — not Phase-1 blockers. Phase 1 (indicator + console
+recovery) is ready to build.
+
* Goal
One waybar network component that does the whole job: shows connection state
(including the missing "associated but no internet / captive portal" state),
-manages connections from a dropdown (nmcli-backed, optional GPG-encrypted secret
-store), and runs the network diagnostics and remediation off the same place
+manages connections from a dropdown (nmcli-backed; secrets stay in
+NetworkManager's own store, no separate credential file), and runs the network
+diagnostics and remediation off the same place
(captive-portal detection + forcing, bounce/reset, gateway/DNS checks, speed
test).
@@ -15,8 +29,9 @@ It unifies three todo tasks that are really one feature:
- =[#C]= "archsetup Waybar Wi-Fi module should show no-internet state" — the
indicator state plus the 2026-06-22 roam expansion (bounce, diagnostics, speed
test off the component).
-- =[#B]= "Network-manager dropdown, nmcli-backed with GPG-stored secrets" — the
- management dropdown.
+- =[#B]= "Network-manager dropdown, nmcli-backed" — the management dropdown. (The
+ todo task's original "GPG-stored secrets" framing is superseded: secrets stay in
+ NM's own store, decision 5.)
- The network diagnostics already shipped in =captive= (the hotel/captive-portal
tool, formerly =login-page=) become this module's diagnostics engine rather
than a standalone CLI.
@@ -26,31 +41,61 @@ It unifies three todo tasks that are really one feature:
** In
- *Indicator* — wifi/ethernet icon + signal + SSID, plus an internet sub-state:
online / captive / no-internet / connecting / disconnected / airplane.
+- *Absorbs the airplane module* — the airplane state + toggle move into
+ =custom/net= (airplane is a network concern). Once this ships, the standalone
+ =custom/airplane= module, the =waybar-airplane= + =airplane-mode= scripts, their
+ =tests/=, and the css are deleted (listed under Files touched). The
+ desktop-settings panel (sibling =[#B]=) no longer needs an airplane row.
- *Interface-correct* — targets the wifi (or chosen) device, not the
default-route interface, so an active USB tether or wired link can't mask
wifi state. (Same lesson =captive= fixed; the current =custom/netspeed= keys
off the default route and has the bug.)
- *Connection management (panel)* — list saved connections most-recently-used
- first, live signal for in-range wifi, click to switch; add / edit / remove;
- ethernet↔wifi and wifi↔wifi switching even when a link appears mid-session.
-- *Diagnostics (panel)* — captive probe (204-vs-portal) with the extracted
- portal URL and an Open button; bounce/reset (fresh MAC); gateway ping; DNS
- config + temporary 1.1.1.1 override test.
+ first, live signal for in-range wifi, click to switch; add / edit / remove for
+ open + WPA-PSK; activate any existing saved profile (including enterprise ones
+ NM already stores); ethernet↔wifi and wifi↔wifi switching even when a link
+ appears mid-session.
+- *Diagnostics (panel)* — read-only Diagnose (captive probe 204-vs-portal with
+ the extracted portal URL, gateway ping, DNS config) separated from mutating
+ Repair. Repair has tiers, lightest first: rfkill-unblock, per-connection reset
+ (fresh MAC), full-stack bounce (=nmcli networking off/on=, then restart
+ NetworkManager if that fails), and the temporary 1.1.1.1 override test. Each
+ Repair action confirms and verifies cleanup.
- *Speed test (panel)* — down/up/ping with a progress indicator and last-result
- shown.
-- *Credential store* — optional GPG-encrypted connection+secret file under
- =~/.config=, opt-in (default unencrypted), passphrase cadence via gpg-agent
- TTL. Supplements NetworkManager, does not replace it.
+ shown, via the already-installed =speedtest-go --json=.
+- *Connection secrets* — none of our own. Settings and passwords live where NM
+ already keeps them: =/etc/NetworkManager/system-connections/*.nmconnection=
+ (root-only =0600=, the PSK/EAP secret stored inline). We read/write them through
+ nmcli, which handles the privilege. No separate file, no GPG, no gpg-agent — one
+ fewer dependency, and NM's store is already the secure-at-rest source of truth.
- *Persistence* — connectivity probe result cached in the runtime dir so the
bar reads it cheaply between probes.
+- *Observability* — a redacted JSONL event log so a post-failure session can
+ diagnose without re-running destructive actions.
** Out (v1, note for later)
- No replacement of NetworkManager's connection engine. NM stays the thing that
connects; we drive it via nmcli.
-- No VPN / wireguard management (separate tooling already exists).
-- No per-connection captive-portal auto-login automation beyond opening the
- portal page.
+- No add/edit *form* for WPA-Enterprise / 802.1X in v1. The reason is effort vs
+ payoff: 802.1X has many interdependent fields (CA cert, client cert, identity,
+ anonymous identity, phase-2 auth) where a wrong entry silently fails auth, so a
+ trustworthy form is a lot of UI for connections Craig rarely adds (open +
+ WPA-PSK covers home, hotels, and phone hotspots). v1 still *activates* existing
+ saved enterprise profiles and points editing at =nmtui=/=nmcli=. Settled
+ (Craig, 2026-06-29): enterprise add/edit is vNext — 24 saved profiles on velox,
+ 0 enterprise, so the form would be unused UI; if one ever appears nmtui adds it
+ once and the module activates it thereafter.
+- No per-connection captive-portal *auto-login* in v1. (That would mean storing a
+ portal's login form answers — room number, surname, a checkbox — and replaying
+ them automatically when a known portal is detected, so the page never appears.
+ Out for v1 because every portal's form differs and it means storing per-venue
+ answers; v1 just opens the portal for you.)
- No graphing/history of speed-test results beyond the last run.
+- No static-IP / proxy / metered / MAC-randomization editing in v1 (activate
+ existing, edit elsewhere).
+- No VPN / WireGuard management in v1, but it's a planned later phase (Phase 5),
+ not a permanent exclusion — it folds the existing archsetup wireguard tooling
+ into the same panel/CLI.
- The desktop-settings dropdown (sibling =[#B]=) is a separate module, but it
shares the GTK4 layer-shell panel shell built here.
@@ -58,11 +103,12 @@ It unifies three todo tasks that are really one feature:
Three layers. Keep the bar cheap, the panel rich, the logic in one tested place.
-1. *Engine* — a =net= Python package (src-layout, pytest), exposing a CLI. Wraps
+1. *Engine* — a =net= Python package (src-layout, unittest), exposing a CLI. Wraps
every nmcli op and owns the diagnostics. Emits JSON. This is the testable
- core (fake =nmcli= / =curl= on PATH, like the existing =waybar-netspeed= and
- =waybar-sysmon= test harnesses). Precedent: pocketbook is Python in the
- dotfiles repo; =wtimer= is Python for the same testability reason.
+ core (fake =nmcli= / =curl= / =speedtest-go= on PATH, like the existing
+ =waybar-netspeed= and =waybar-sysmon= test harnesses). Precedent: pocketbook is
+ Python in the dotfiles repo; =wtimer= is Python for the same testability
+ reason.
2. *Indicator* — a thin =waybar-net= script that calls =net status --json= and
renders icon + signal + state + tooltip. Replaces =custom/netspeed=
(throughput folds into the tooltip).
@@ -75,15 +121,48 @@ How the existing pieces map in:
interactive portal-force flow (sudo reset, DNS override, browser launch). Its
cheap portal-detection logic is mirrored natively in the engine for the fast
status path so the bar never blocks on a subprocess. =captive= stays a usable
- standalone CLI.
+ standalone CLI. The refactor (below) extracts its probe + reset into functions
+ the engine can call non-interactively.
- =waybar-netspeed= (sh, shipped) — retired; its throughput sampling moves into
the engine's status output and renders in the indicator tooltip only.
- =nmcli= — the connection backend for every op.
Language note: the engine is Python; the indicator is a thin Python or sh
-wrapper over =net status --json=. The bar path must stay fast (sub-100ms for the
-cheap poll), so the indicator does no network I/O itself — it reads link state
-and the cached connectivity result.
+wrapper over =net status --json=. The bar path must stay fast (see Performance
+budgets), so the indicator does no network I/O itself — it reads link state and
+the cached connectivity result.
+
+* Repository + dependencies
+
+- *Code lives in the dotfiles repo* (=~/.dotfiles=), not archsetup. The =net=
+ package sits in-tree like pocketbook (src-layout, unittest, Makefile target);
+ =waybar-net= and the =net= CLI entry live in the hyprland tier
+ (=hyprland/.local/bin/=). Tests under =tests/net/= and =tests/waybar-net/=.
+ archsetup owns only the *dependency install*, not the code.
+- *archsetup installs the deps* in its Hyprland step: =gtk4-layer-shell=,
+ =python-gobject=, plus =nmcli=/=curl=/=resolvectl=/=rfkill= (already present via
+ NetworkManager/curl/systemd/util-linux). Speed test uses =speedtest-go= (AUR
+ =speedtest-go-bin=, already installed on velox); archsetup adds it to the AUR
+ list. librespeed-cli is the documented fallback if a self-hosted LibreSpeed
+ server is ever wanted. No =gpg= dependency (secrets live in NM's own store).
+- *Daily-drivers*: a stowed-script + AUR-dep feature, so ratio needs the same
+ =git pull= + stow + the archsetup-added deps. Note the manual dep step in the
+ rollout.
+
+** Makefile targets (console recovery is a first-class path)
+=net doctor= and the diagnostics are reachable from a bare TTY when waybar and
+the GUI are down — that's the case where you most need them. The dotfiles
+Makefile carries targets that wrap the =net= CLI so "get back online" is one make
+command from the console:
+- =make online= — =net doctor --fix= (diagnose, then apply the lightest repair:
+ rfkill-unblock → reset → bounce → open portal). The headline recovery target.
+- =make net-doctor= — =net doctor= (read-only diagnose + recommendation).
+- =make net-status= / =make net-diagnose= / =make net-portal= / =make net-reset=
+ / =make net-bounce= — the individual ops.
+- =make test= — already runs =tests/*=; the =net= package's unittest suites are
+ collected the same way.
+These intentionally need only nmcli/curl/rfkill (no GUI, no waybar, no Python
+GTK), so they work from a TTY on a broken graphical session.
* Connectivity model — split cadence
@@ -91,208 +170,1416 @@ The indicator polls every ~2s, but a real internet/captive probe every 2s wastes
battery and can re-trigger a captive portal. So split it:
- *Fast path (every poll, cheap, no network)* — interface, type, SSID, signal,
- IPv4 presence, throughput sample. From nmcli / sysfs only.
+ IPv4 presence, throughput sample. From nmcli / sysfs only. No network I/O.
- *Slow path (cached, TTL ~45s)* — the actual internet/captive probe (the 204
check + meta-refresh portal extraction). Result cached at
=$XDG_RUNTIME_DIR/waybar/net-connectivity.json= with a timestamp.
The indicator reads the cache each poll. When the cache is older than the TTL,
-=net status= kicks =net probe= in the background (non-blocking) and renders the
-last cached sub-state meanwhile. A user-triggered diagnose/reconnect refreshes
-the cache immediately. This keeps the bar responsive and the portal un-poked.
+=net status= kicks =net probe= in the background (spawn + detach, never awaited)
+and renders the last cached sub-state meanwhile. A user-triggered
+diagnose/reconnect refreshes the cache immediately. This keeps the bar
+responsive and the portal un-poked.
+
+** Concurrency, atomicity, staleness
+- *Single-flight* — =net probe= takes a lock file at
+ =$XDG_RUNTIME_DIR/waybar/net-probe.lock= (flock, non-blocking). A second probe
+ while one runs is a no-op, so a flapping 2s poll can't pile up overlapping
+ probes.
+- *Atomic writes* — the cache is written to a temp file + =os.replace= (atomic
+ rename), so a reader never sees a half-written cache. Same pattern as =wtimer=.
+- *Max probe runtime* — the probe has a hard timeout (≤ 6s total: curl
+ =--max-time 5= + slack). On timeout it writes an =unknown= result, never hangs.
+- *Stale classes* the indicator distinguishes: fresh (< TTL), stale (TTL..3×TTL,
+ shown with a subdued/aging hint), expired (> 3×TTL → treat as unknown),
+ unknown (no cache / probe failed). The bar never shows a confident "online"
+ past the expired threshold.
+- *Invalidation* — the cache records the iface + SSID + active-connection UUID it
+ was taken under; a change in any of them invalidates it immediately (a
+ reconnect must not show the old network's verdict).
+- *Crash cleanup* — a stale lock older than the max runtime is ignored/reclaimed.
+
+* Performance budgets (hot path)
+
+The bar exec path (=waybar-net= → =net status=) must stay responsive:
+- *Budget*: =net status= returns in < 100ms typical, < 250ms worst case.
+- *No sleeping in the bar path.* Throughput is sampled from two reads of
+ =/sys/class/net/<iface>/statistics/{rx,tx}_bytes= across the *waybar poll
+ interval itself* (delta since the last cached sample + timestamp), not via an
+ in-process =sleep= like the old =waybar-netspeed=. The cache holds the prior
+ counters.
+- *Subprocess cap*: at most one =nmcli= invocation on the hot path (a single
+ =nmcli -t -f ...= multi-field query), plus sysfs reads. Never a per-field
+ nmcli call.
+- *Every subprocess has a timeout* (=nmcli --wait 2=, =subprocess timeout=). On
+ timeout or error the indicator emits a degraded JSON state (class
+ =net-degraded=, a neutral glyph) rather than blocking or crashing waybar.
+- *Benchmark test*: a fake slow =nmcli= asserts =net status= still returns within
+ budget by falling back to the degraded state.
* Engine — =net= CLI surface
All subcommands take =--json= where a machine reads them. Pure formatting/state
-functions under the CLI; IO (nmcli, curl, file) at the edges.
+functions under the CLI; IO (nmcli, curl, file) at the edges. Every subcommand
+exits non-zero with a JSON error envelope (see JSON schemas) on failure.
- =net status [--json] [--iface IF]= — fast link state + cached connectivity
- sub-state + throughput. The indicator's source.
+ sub-state + throughput. The indicator's source. Never does network I/O.
- =net probe [--iface IF]= — run the connectivity/captive probe now, update the
- cache, print online | captive (+ portal URL) | no-internet. Mirrors =captive='s
- cheap detection natively.
+ cache (single-flight, atomic), print online | captive (+ portal URL) |
+ no-internet | unknown. Mirrors =captive='s cheap detection natively.
- =net list [--json]= — saved connections, MRU order, active flag, plus in-range
wifi with signal.
-- =net up <id|ssid>= / =net down [--iface IF]= — switch / disconnect.
-- =net add= / =net edit <id>= / =net remove <id>= — manage connections; sync the
- GPG store (below).
+- =net up <uuid>= / =net down [--iface IF]= — switch / disconnect. Operates on
+ UUID, not name (see nmcli contract).
+- =net add= / =net edit <uuid>= / =net remove <uuid>= — manage connections
+ (open + WPA-PSK) through nmcli; the secret lands in NM's own
+ =.nmconnection=. Enterprise profiles are activate-only.
- =net rescan [--iface IF]= — wifi rescan.
-- =net diagnose [--json]= — full report: gateway ping, DNS config, DNS 1.1.1.1
- override test, captive probe. Shells to =captive= for the interactive/sudo
- parts; native for the read-only parts.
+- =net diagnose [--json]= — read-only report: gateway ping, DNS config, captive
+ probe. The structured contract below. Doubles as the post-failure snapshot.
+- =net repair <action> [--json]= — mutating remediation, lightest first:
+ =rfkill= (unblock + radio on), =reset= (fresh MAC), =bounce= (full-stack:
+ =nmcli networking off/on=, escalating to =systemctl restart NetworkManager=),
+ =dns-test= (temporary 1.1.1.1 override, auto-reverted). Each confirms via the
+ caller and verifies cleanup.
+- =net doctor [--json] [--fix]= — one-shot "get me online" mode for the console:
+ runs the full diagnose, then applies the lightest repair that fits (unblock
+ rfkill, reset, bounce, open portal) — read-only without =--fix=, acting with
+ it. The TTY recovery path when waybar/the GUI is down (see the Makefile
+ targets).
- =net portal= — run =captive='s portal-force flow (reset if needed, extract +
open the portal page).
-- =net reset [--hardware-mac]= — fresh-MAC reconnect (=captive='s =fresh_mac=).
-- =net speedtest [--json]= — librespeed run; down/up/ping.
+- =net speedtest [--json]= — =speedtest-go --json= run; down/up/ping.
+
+* nmcli contract
+
+The command wrapper is the reliability boundary; SSIDs and connection names
+contain spaces, colons, duplicates, hidden names, and non-ASCII. Rules:
+
+- *Terse, field-selected output*: =nmcli -t -f <fields> --escape yes ...= and
+ =nmcli -g <fields> ...= (get-values) for single-value reads. Parse with the
+ documented escaping (=\:= and =\\=); never naive =cut -d:=.
+- *UUID is the handle.* Every saved-profile op (=up=, =down=, =modify=, =delete=)
+ uses the connection UUID, never the display name — names duplicate and contain
+ separators. =net list= surfaces UUIDs; the panel maps row → UUID.
+- *Wait budgets*: activation/deactivation use =nmcli --wait <n>= with an explicit
+ budget (hot-path reads =--wait 2=; activation =--wait 30=). No unbounded waits.
+- *Connectivity*: NM's own =nmcli networking connectivity= can return
+ =none/portal/limited/full/unknown=. Use it as a *cheap hint* on the fast path
+ when present, but the authoritative captive verdict is still our own probe
+ (NM's portal detection is coarser and config-dependent).
+- *Parser tests* (fake nmcli fixtures): escaped colons and backslashes in SSIDs,
+ embedded newlines, duplicate connection names, hidden SSID (empty name),
+ non-ASCII SSID, the wired-appears-mid-session case, and the multi-active case
+ (wifi + tether both up).
+
+* JSON schemas
+
+Versioned (="v": 1=) envelopes so tests lock the contract. Sketches (fields
+nullable unless noted):
+
+- =status=: ={v, iface, type: wifi|ethernet|none, ssid, signal, ipv4,
+ gateway, throughput: {rx_bps, tx_bps}, connectivity: online|captive|no-internet|unknown,
+ connectivity_age_s, connectivity_class: fresh|stale|expired|unknown, state:
+ online|captive|no-internet|connecting|disconnected|airplane|wired|degraded}=.
+- =probe=: ={v, result: online|captive|no-internet|unknown, portal_url, http_code,
+ redirect_host, elapsed_ms, ts}=.
+- =list=: ={v, connections: [{uuid, name, type, active, last_used, signal,
+ in_range, security}]}=.
+- =diagnose=: ={v, steps: [<diagnostic step, see contract>], overall:
+ ok|warn|fail}=.
+- =speedtest=: ={v, down_mbps, up_mbps, ping_ms, server, elapsed_ms, ts}=.
+- error envelope (any command): ={v, error: {code, message, detail, partial:
+ bool}}= with a non-zero exit.
+
+* Diagnostics contract
+
+=net diagnose --json= returns an ordered list of steps. Each step is the unit the
+panel renders and the log records:
+
+- =id= — stable identifier (e.g. =link=, =dhcp=, =gateway-ping=, =dns-config=,
+ =dns-resolve=, =http-probe=, =portal=).
+- =status= — =pending | running | pass | warn | fail | skipped=.
+- =title= — short human label.
+- =evidence= — redacted detail (the value seen), per the redaction rules.
+- =elapsed_ms=.
+- =safety= — =read-only= or =mutating= (diagnose steps are all read-only).
+- =next_action= — what the user/agent should do on warn/fail (e.g. "open portal",
+ "reset connection", "switch network").
+
+Repair actions (=net repair=) carry the same shape but =safety: mutating=, plus a
+=cleanup_verified: bool= field (e.g. the DNS override was reverted) and a
+terminal =cleanup-unverified= status when revert can't be confirmed.
+
+** Diagnose vs Repair (read-only vs mutating)
+The panel separates them visually and behaviorally:
+- *Diagnose* — probe, gateway ping, DNS config read, captive check. No state
+ change, no sudo, runnable freely.
+- *Repair* — reset (fresh MAC, deletes+recreates the NM profile), DNS override
+ test (mutates resolver, auto-reverts), portal force. Each needs an explicit
+ confirm, shows that it's privacy/state-changing, and verifies cleanup. A
+ Repair whose cleanup can't be verified ends in a visible =cleanup-unverified=
+ state, never a silent success.
+
+* Failure states, messages, recovery
+
+Each row below gives the *exact, final* user-facing string (not a template) with
+=<placeholders>= for redacted evidence, plus the evidence field included and the
+next action. The string is canonical: every surface renders the same text, so
+there's one source of truth.
+
+Per-surface rendering of the canonical string:
+- *Indicator* — the matching glyph + CSS class; the string is the tooltip
+ (untruncated).
+- *Notification* (=notify=) — title = the failure label, body = the string.
+- *CLI* — the string on stderr; =--json= puts it in =error.message= with the
+ evidence in =error.detail= and a stable =error.code=.
+- *Panel* — the string as the section banner, with the diagnostic step's evidence
+ shown beneath.
+Evidence is always redacted per the redaction rules (SSID/host shown; PSK/EAP/
+portal tokens never).
+
+- *associated, no DHCP* — "Connected to <SSID>, no IP (DHCP failed)" →
+ evidence: SSID, iface → reset / reconnect.
+- *no-internet* — "On <SSID>, no internet (gateway reachable, no route out)" →
+ diagnose / switch network.
+- *captive* — "Captive portal at <host> — login required" → Open portal.
+- *DNS hijack* — "DNS is being redirected (portal)" → Open portal.
+- *DNS broken* — "DNS not resolving (hotel DNS down); 1.1.1.1 works" → use
+ override / report.
+- *HTTP intercepted* — "Traffic is being intercepted before it leaves" → Open
+ portal.
+- *sudo declined* — "Reset needs admin; it was declined — nothing changed" →
+ retry with auth.
+- *command timed out* — "<op> timed out; the system was left unchanged" → retry.
+- *partial mutation* — "<op> partially applied: <what>; rolled back to <state>"
+ → review.
+- *missing speedtest-go* — "speedtest-go not installed" → install hint.
+- *no wifi hardware* (desktop) — wifi rows hidden; ethernet-only view.
+- *wifi rfkill-blocked* — "WiFi is blocked (rfkill)" → unblock. The indicator
+ detects a soft-blocked radio (=rfkill list= shows the radio off though hardware
+ is present) and shows this distinct from disconnected. =net repair rfkill= (and
+ =net doctor --fix= as its first step) runs =rfkill unblock wifi= + =nmcli radio
+ wifi on= and reconnects. This is the framework-laptop case: an out-of-power
+ shutdown sometimes leaves wifi soft-blocked at next boot, and yes — the module
+ recovers it (the rfkill state is the indicator; the rfkill repair / doctor is
+ the one-step fix). A *hard* block (physical switch) is reported as
+ not-recoverable-in-software with that message.
+- *wifi rfkill hard-blocked* — "WiFi is blocked by the hardware switch" →
+ evidence: rfkill hard state → flip the physical switch.
+- *wrong password / missing secret* — "Saved password for <SSID> was rejected" →
+ evidence: SSID, NM auth-failure reason → re-enter the password.
+- *enterprise auth/cert failure* — "Enterprise login failed for <SSID> (802.1X)"
+ → evidence: SSID, EAP failure reason → edit the profile in nmtui/nmcli.
+- *upstream / AP / provider* — "On <SSID>, link is fine but the network has no
+ uplink" → evidence: gateway reachable, no route out, not a portal → switch
+ network or contact the venue.
+- *VPN-routed* — "Connected; internet is routed through a VPN (<dev>)" →
+ evidence: default route on a tun/wg device or non-NM DNS owner → check the VPN,
+ not WiFi.
+- *HTTP interception, no parseable portal URL* — "A portal is intercepting
+ traffic but didn't give a login link" → evidence: HTTP code, redirect host →
+ opens neverssl + the gateway page to log in manually.
+- *DNS override cleanup unverified* — "Couldn't confirm DNS was restored after the
+ test" → evidence: iface, attempted revert → revert DNS manually
+ (=resolvectl revert <iface>=).
+
+Each message names whether the system was left unchanged, partially changed (with
+what), or fully changed, so the user knows the residue.
+
+* Doctor: escalation, classification, terminal states
+
+=net doctor= diagnoses, classifies the failure, then (with =--fix=) applies the
+*lightest* repair that fits and re-checks — it never loops destructive repairs
+against a failure they can't fix. Each failure resolves to one of four outcomes,
+and the doctor stops at any terminal one:
+
+- =fixable= — a local repair should help. Escalate lightest-first: rfkill-unblock
+ → reset (fresh MAC) → bounce (full stack) → portal, re-probing after each, and
+ stop as soon as the probe returns online.
+- =needs-user-action= (terminal) — no reset/bounce will help; doctor stops and
+ names the exact next step. Covers: wrong WPA password / missing NM secret
+ (enter the password), locked keyring or polkit denial (retry with auth),
+ enterprise 802.1X cert/identity failure (edit the profile in =nmtui=/=nmcli=),
+ captive portal login-required (open the portal + accept terms). Doctor must not
+ delete/recreate the profile against these — that loses the saved password and
+ makes things worse.
+- =upstream-not-local= (terminal) — the local link is up but the problem is past
+ it: AP has no uplink, gateway down/dropping traffic, DHCP server broken, ISP
+ outage, portal backend failing. =diagnose= proves it (link up + IP + gateway
+ reachable, but no route out and not a captive redirect), and =doctor --fix=
+ stops after local repairs are exhausted with "local repairs tried; likely
+ upstream/AP/provider" + the evidence. Next action: switch networks or contact
+ the venue.
+- =deferred/vpn= (terminal for v1) — an active VPN / policy route / non-NM
+ resolver owns the default route or DNS, so "no internet" may be the VPN's fault,
+ not WiFi's. v1 *detects* this (default route on a =tun/wg= device, or DNS owned
+ by something other than the NM link) and classifies it separately — "link is
+ fine; internet is VPN-routed" — rather than misclassifying it as a WiFi failure.
+ v1 does not repair it (VPN management is Phase 5); it names the VPN as the likely
+ owner and stops.
+
+** DNS handling in doctor (explicit per class)
+- *Captive DNS hijack* — open the portal (the hijack clears on login). No DNS
+ mutation.
+- *Broken resolver, 1.1.1.1 works* — doctor offers an explicit *temporary* 1.1.1.1
+ override as a repair with cleanup verification (auto-revert, =cleanup_verified=);
+ without =--fix= it only recommends the command. It does not leave a permanent
+ resolver change.
+- *Port-53 / egress blocked* (even 1.1.1.1 fails) — terminal =upstream-not-local=;
+ doctor stops, since it's not locally fixable.
+
+* Failure-mode coverage
+
+For each common field failure: does =net diagnose= detect it, can =net doctor
+--fix= repair it, and what terminal user action remains when it can't. (The
+=needs-user-action= / =upstream-not-local= / =deferred/vpn= outcomes are defined
+above.)
+
+| Failure mode | diagnose detects | doctor --fix | terminal user action |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| rfkill soft block | yes | yes (unblock) | none |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| rfkill hard block | yes | no | flip the physical switch |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| no wifi hardware | yes | n/a | use ethernet |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| associated, no DHCP | yes | yes (reset/bounce) | none, else switch network |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| gateway unreachable | yes | yes (bounce) | switch network if it |
+| | | | persists |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| captive DNS hijack | yes | opens portal | log in at the portal |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| broken DNS, 1.1.1.1 works | yes | yes (temp override, | report the venue's DNS |
+| | | auto-reverted) | |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| HTTP captive portal | yes | opens portal | log in at the portal |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| HTTP interception, no | yes | opens neverssl + gateway | log in manually |
+| parseable URL | | | |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| upstream / AP outage | yes (link up, no route out) | no (stops after local) | switch network / contact |
+| | | | venue |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| wrong WPA password / | yes | no | enter the password |
+| missing secret | | | |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| enterprise auth / cert | yes | no | edit the profile in |
+| failure | | | nmtui/nmcli |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| duplicate SSID / | yes (UUID-keyed) | yes (activate by UUID) | none |
+| connection-name | | | |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| hidden SSID | yes | yes (connect by name) | enter SSID + password |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| multiple active links | yes | n/a | pick the interface |
+| (wifi+tether) | | | |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| wedged NetworkManager | yes | yes (bounce → restart NM) | none, else reboot |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| slow / hung command | yes (degraded) | retries within budget | retry |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| stale / corrupt cache | yes | self-heals (atomic + | none |
+| | | invalidation) | |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| DNS cleanup failure | yes | flags cleanup-unverified | revert DNS manually |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| missing speedtest backend | yes | n/a | install speedtest-go |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| VPN / policy-routing | yes (route/DNS ownership) | no (deferred to Phase 5) | check the VPN |
+| interference | | | |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+
+* Observability — logging + redaction
+
+- *Event log*: JSONL at =$XDG_STATE_HOME/net/events.jsonl= (fallback
+ =~/.local/state/net/events.jsonl=), size-rotated (e.g. 1 MB × 3). Every
+ mutating op and probe appends an event: =ts, op, argv (redacted), exit_code,
+ stderr_tail, elapsed_ms, iface, nm_uuid, probe_url_class, http_code,
+ redirect_host, cache_event=.
+- *Redaction (always on)*: PSKs, EAP identities/passwords, NM secrets, and
+ portal query tokens are never logged. MAC addresses, full IPs, and SSID are
+ redacted when configured (=redact_mac=, =redact_ip=, =redact_ssid= in config).
+- *Post-failure diagnosis*: =net doctor --json= is the snapshot + recommendation
+ (diagnose plus the suggested repair), =net diagnose --json= the raw report, and
+ the event log the history. =net doctor= is the console-recoverable entry point
+ (reachable as =make online= / =make net-doctor=).
+- *Secret-leak tests*: assert no PSK/EAP/portal-token ever appears in any JSON
+ output, log line, or error message.
* Indicator (task #C — Phase 1, the fast win)
** States (internet sub-state on top of link state)
- online — associated and the probe returned 204. Normal icon.
- captive — associated, probe hit a portal. Distinct glyph + warning CSS class;
- tooltip names the portal host; left-click opens the panel's diagnostics with
- the portal ready to open.
+ tooltip names the portal host; left-click opens diagnostics with the portal
+ ready to open (Phase 2+; see interactions for the Phase-1 interim).
- no-internet — associated, probe failed (no portal, no 204). Distinct glyph +
warning class.
+- degraded — =net status= couldn't read link state within budget (slow/failed
+ nmcli). Neutral glyph, =net-degraded= class. Never blocks the bar.
+- rfkill-blocked — the radio is soft-blocked (=rfkill=), distinct from
+ disconnected. Distinct glyph; the fix is =net repair rfkill= / =net doctor=.
- connecting / disconnected / airplane / wired — as today, plus wired shown
- correctly even when it appears after session start.
+ correctly even when it appears after session start. (airplane is now this
+ module's state, absorbed from the retired airplane module.)
** Glyphs
Nerd-font codepoints, final values verified live before merge (same discipline
as wtimer). Reuse the signal-strength ramp already in =waybar-netspeed=; add a
-captive/no-internet overlay glyph.
+captive / no-internet / degraded overlay glyph.
** Tooltip
SSID + signal + IPv4 + gateway + the throughput readout (absorbed from
-netspeed) + the last probe result and age.
+netspeed) + the last probe result and its age (stale/expired hinted).
-** Interactions (no keyboard-modifier clicks — waybar can't qualify clicks by
-modifier; the panel hosts the rich actions)
-- left-click — open the panel.
-- right-click — quick reconnect / bounce (=net reset=, no panel).
-- middle-click — run =net portal= (force the captive page).
-- scroll — cycle nothing in v1 (reserved; could cycle saved connections later).
+** Interactions (phase-aware; no keyboard-modifier clicks — waybar can't qualify
+clicks by modifier, so the rich actions live in the panel, not ctrl/super-click)
+Clicks never block the bar: each dispatches a detached background job and reports
+via =notify=, single-flight per action.
+- *Phase 1 (no panel yet)*: left-click runs =net probe= + notify (refreshes the
+ state on demand) and keeps the existing =pypr toggle network= scratchpad as the
+ interim manager; right-click runs =net repair reset= in the background +
+ notify; middle-click runs =net portal=.
+- *Phase 2+ (panel exists)*: left-click opens the panel (focused on the relevant
+ section — diagnostics when captive); right/middle keep the background
+ reset/portal shortcuts.
* Panel (tasks #B + #C diagnostics — Phases 2-3)
-GTK4 + gtk4-layer-shell, pocketbook scaffold (src-layout package, pytest,
+GTK4 + gtk4-layer-shell, pocketbook scaffold (src-layout package, unittest,
Makefile, gtk4-layer-shell anchored dropdown under the bar). One panel shell,
reused by the future desktop-settings panel.
Sections:
1. *Connections* — list, MRU-first, active marked, live signal bars for in-range
wifi; row click switches; buttons for add / edit / remove; a rescan control.
-2. *Diagnostics* — buttons: Probe (204/captive, shows portal URL + Open),
- Bounce/Reset, Gateway ping, DNS override test. Streaming output area.
-3. *Speed test* — Run button, progress, down/up/ping result + last-run line.
+2. *Diagnose* (read-only) — Probe (204/captive, shows portal URL + Open), Gateway
+ ping, DNS config. Streaming step output (the diagnostics contract).
+3. *Repair* (mutating, confirmed) — tiered lightest-first: Unblock rfkill, Reset
+ (fresh MAC), Bounce (full stack), DNS override test, Force portal. A "Get me
+ online" button runs =net doctor --fix= (the auto-escalating sequence).
+4. *Speed test* — Run button, progress, down/up/ping result + last-run line.
+
+** Panel state, cancellation, permissions
+State machines for: connection-list loading, rescan-in-progress,
+activation-in-progress, diagnose-running, repair-running, speedtest-running. Plus
+the real terminal states on this two-machine fleet: no-wifi-hardware (desktop →
+ethernet-only view) and missing speedtest-go. (No GPG-key state — there's no
+credential store; secrets live in NM.) ("No NetworkManager" is not a modeled
+state — NM is always present
+on these machines; if nmcli is somehow absent the panel shows a single hard-error
+and exits.) Long operations show elapsed time and are cancellable where the
+underlying op allows (rescan, speedtest, probe); clearly non-cancellable ones
+(an in-flight activation) show elapsed + a disabled control. Permission-denied
+(sudo/polkit declined) is a first-class outcome with the "nothing changed"
+message, never a silent failure.
Interaction-pattern catalog (=~/code/rulesets/patterns/=) principles that apply:
- transient-state-buttons — all the network levers in one place, reachable by
one chord (the bar click), state visible.
- default-most-common-friction-proportional — connections MRU-ordered so the
- common pick is first; destructive ops (remove) get a confirm, switching does
- not.
+ common pick is first; destructive ops (remove) and privacy-changing ones
+ (reset, override) get a confirm, switching does not.
- one-prompt-picker-typed-prefix — if the connection picker ever goes
keyboard-driven, kind (wifi/eth/saved/in-range) + name in one typed picker.
+** Panel UX flow (settle before Phase 2)
+The concrete interaction defaults, so the GTK build isn't inventing them:
+- *Default focus*: the Connections section, current connection's row selected. If
+ the indicator opened the panel because of a captive/no-internet state, focus
+ Diagnose instead with the relevant action highlighted.
+- *Row content*: glyph (signal bars / wired / active check) + name + a secondary
+ line (security type, "active"/last-used). The active row is visually pinned at
+ top of its group.
+- *Buttons*: one *primary* per section (Connections: Connect to the selected row;
+ Diagnose: Run diagnose; Repair: "Get me online"; Speed test: Run). Secondary
+ actions (add / edit / remove / rescan; individual repair tiers) are smaller and
+ grouped.
+- *Disabled rules*: Connect disabled on the already-active row; Repair tiers
+ disabled while one runs; Speed test disabled while running; add/edit disabled
+ for enterprise (with the "edit in nmtui/nmcli" hint).
+- *Confirmations* (exact wording): Reset → "Reset <SSID>? This drops the
+ connection and reconnects with a new MAC."; Bounce → "Restart networking? All
+ links drop briefly."; DNS override → "Temporarily set DNS to 1.1.1.1 for the
+ test? It reverts automatically."; Remove → "Forget <SSID>? The saved password is
+ deleted."
+- *"Get me online" reporting*: shows each escalation step live (Unblock rfkill →
+ Reset → Bounce → Portal) with per-step pass/fail and stops at the first that
+ restores internet or at a terminal state, naming the next action.
+- *After close*: the bar reflects the new state immediately (signal/refresh on
+ next poll); a running speedtest/diagnose keeps running and notifies on finish
+ (panel close doesn't cancel it).
+- *Keyboard*: Esc closes; Tab moves between sections; arrows move rows; Enter
+ fires the section primary; the connection list is type-to-filter.
+
* Connection management (nmcli)
-- Every op via nmcli: =device status=, =connection show=, =con up/down=,
- =con add/modify/delete=, =dev wifi rescan/list=.
+- Every op via nmcli per the nmcli contract above (terse, escaped, UUID-keyed,
+ bounded =--wait=).
- MRU ordering from NM's =connection.timestamp= (last activated), descending.
- Ethernet appears in the list whenever a wired device is present, selectable at
any time; switching just brings the chosen connection up.
-- TDD with a fake =nmcli= on PATH returning canned output, asserting the exact
- nmcli command sequence (behavior, not implementation) — the established
- pattern in =tests/waybar-netspeed= and =tests/layout-navigate=.
-
-* Credential storage (GPG)
-
-- Store: =~/.config/net/connections.<ext>= — connection definitions + secrets
- (PSK/EAP), one record per connection, with =last_used=.
-- *Default unencrypted* (=connections.json=). Encryption is opt-in: when enabled,
- =connections.json.gpg= encrypted to Craig's private key (=c@cjennings.net=).
-- Passphrase cadence via gpg-agent: once per session (long cache TTL or
- decrypt-and-hold), once per hour (=default-cache-ttl=), or never (plaintext).
- Configured in =~/.config/net/config=.
-- *Supplements NM, does not replace it.* NM's own store
- (=/etc/NetworkManager/system-connections=, root-only) stays the source of
- truth that actually connects. The GPG store is a portable, user-owned export +
- secret vault. =net add/edit/remove= writes both (nmcli + store). =net import=
- rebuilds NM connections from the store on a fresh machine. They sync on every
- mutating op; NM wins on conflict for the live connection.
-
-* Diagnostics + speed test
-
-- Diagnostics reuse =captive= verbatim for the interactive flow (=net portal=,
- =net reset=, the 1.1.1.1 DNS override test) and mirror its cheap probe natively
- for =net status= / =net probe=. No logic is duplicated by hand beyond the small
- portal-URL parser, which is already unit-tested in =tests/captive=.
-- Speed test: *librespeed-cli* (no account, self-hostable, AUR), chosen over
- Ookla speedtest-cli. =net speedtest --json= parses its JSON; the panel shows a
- progress indicator and the down/up/ping result.
+- *Mutation safety + rollback*: switching keeps the current connection up until
+ the new one activates successfully (=nmcli --wait 30=); on failure it does not
+ tear down the working link, surfaces the failure, and leaves the prior
+ connection active. =net down= notes that NM may auto-reactivate a profile and
+ reports the post-op active connection so the user isn't surprised. A switch that
+ needs a password it doesn't have prompts (or fails with "password required"),
+ never silently strands. The exact NM command sequence (preflight active-state
+ read → activate target → verify default route → on failure, confirm prior
+ still up) is pinned in the engine and tested against fake nmcli.
+- *Add/edit scope*: open + WPA-PSK only in v1. Existing saved profiles of any
+ type (including enterprise) can be *activated*; editing an enterprise profile
+ shows "edit via nmtui/nmcli" rather than a broken partial form.
+
+* Connection secrets (no separate store)
+
+Per Craig's call: don't build a parallel credential store. Settings and secrets
+live where NetworkManager already keeps them, so there's one source of truth and
+no extra dependency (no GPG, no gpg-agent, no =~/.config/net/connections=).
+
+- *Where secrets live*: =/etc/NetworkManager/system-connections/<name>.nmconnection=,
+ root-owned =0600=, with the PSK/EAP secret stored inline (the default
+ =secret-flags=0= "owned by NM"). That's already secure-at-rest (root-only) and
+ is what =nmcli= reads/writes.
+- *How we touch them*: every add/edit/remove goes through =nmcli= (=connection add
+ / modify / delete=), which writes the =.nmconnection= with the right ownership
+ and perms. We never read or write =system-connections= files directly (root) and
+ never copy a secret out of them.
+- *No export / import / sync* — there's nothing to sync. A new machine gets its
+ connections the way it always has (the user joins, or restores NM profiles),
+ not from a tool-specific vault.
+- *config file*: =~/.config/net/config= still exists, but only for non-secret
+ preferences (speedtest server, redaction flags, probe TTL). It holds no
+ credentials.
+- *No secret leakage*: PSK/EAP never appear in =net=' =--json= output, the event
+ log, or error text (tested) — even though NM is the store, our surfaces must not
+ echo a secret =nmcli= happens to return.
+
+* Speed test
+
+- Backend: *=speedtest-go=* (=--json=, =--server=, =--no-download/--no-upload=),
+ already installed on velox (AUR =speedtest-go-bin=). No new dependency for v1.
+ librespeed-cli is the documented fallback for a self-hosted LibreSpeed server.
+- =net speedtest --json= parses speedtest-go's JSON into the =speedtest= schema.
+- *Server policy*: auto-select nearest by default; allow a pinned server id in
+ =~/.config/net/config=.
+- *Timeout + cancellation*: a hard run timeout (e.g. 60s); the panel run is
+ cancellable (kills the child). Offline / rate-limited / no-server errors map to
+ the failure-message table.
+- *Tests*: fixture JSON (success) and fixture stderr (offline, no server,
+ malformed output) drive =net speedtest= parsing without touching the network.
+
+* Help + documentation
+
+In-app help has three layers, each reachable in the situation it's needed:
+
+- *CLI help (works from a dead-GUI TTY)*: =net --help= lists the subcommands in
+ one screen; =net <cmd> --help= documents each (flags, what it mutates, the
+ console-recovery targets). The Makefile targets are self-describing (=make help=
+ lists =online= / =net-doctor= / etc. with one-line descriptions). This is the
+ layer that matters most when you're at a console with no network.
+- *Panel help (in the GUI)*: a small =?= affordance in the panel header opens an
+ inline help pane — what each section does, which Repair actions mutate state,
+ what the indicator glyphs/colors mean. Per-control tooltips on the less-obvious
+ buttons (rfkill, bounce, DNS override). No external help browser.
+- *User guide (the durable doc)*: a README / docs page covering every command,
+ the indicator states + glyphs, the panel sections, the config file keys, the
+ recovery make targets, troubleshooting (the failure-message table), and
+ rollback. Written so a future session — or Craig six months out — can operate
+ and recover the module from the doc alone.
+
+The failure-message table above is the single source of truth for the
+troubleshooting text; the guide and the panel help both render from it rather
+than restating it.
+
+* Enhancement radar
+
+Low-cost adjacent affordances, each dispositioned so cheap wins aren't lost and
+the v1 panel stays focused. (Several are already in v1 by virtue of other
+sections; marked here so the consideration is visible.)
+
+| Enhancement | Disposition | Reason |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Open / copy portal URL | v1 | already in the captive flow; trivial Open + Copy |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Forget network | v1 | it's the remove op, already specced |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Rescan now | v1 | already a Connections control |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Retry with hardware MAC | v1 | captive already has --hardware-mac; expose in Repair |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Pin speedtest server | v1 | already a config key |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Copy redacted doctor report | v1 | cheap, serves the observability/support goal |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Show last good network / result | vNext | needs small history persistence |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Watch mode for net doctor | vNext | a --watch loop; handy at a TTY, not v1-critical |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Actionable desktop notifications | vNext | dunst supports actions; extra wiring |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Keyboard connection picker (fuzzel) | vNext | the typed-prefix pattern; panel covers v1 |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| QR-code share / import WiFi | rejected | low value for a personal 2-machine setup; phones do QR |
+|-------------------------------------+-------------+--------------------------------------------------------|
* Waybar wiring
- Replace =custom/netspeed= with =custom/net= in the bar's module list (same
slot).
- Module def: =exec: waybar-net=, =return-type: json=, =interval: 2=, a =signal=
- for on-demand refresh (next free signal after wtimer's 14), =on-click: <open
- panel>=, =on-click-right: net reset=, =on-click-middle: net portal=.
-- Remove the old =on-click: pypr toggle network= scratchpad once the panel
- replaces it.
+ for on-demand refresh (next free signal after wtimer's 14), =on-click=,
+ =on-click-right=, =on-click-middle= per the phase-aware interactions (each
+ dispatches a detached job, never blocks).
+- Remove the old =on-click: pypr toggle network= scratchpad only once the panel
+ replaces it (Phase 2); Phase 1 keeps it as the interim manager.
* Testing plan (TDD)
-- *Engine* — fake =nmcli= + fake =curl= on PATH; assert command sequences and
- parsed/emitted JSON for status, list, up/down, add/edit/remove, probe,
- diagnose, speedtest. Pure state/format functions tested directly.
+- *Engine (normal)* — fake =nmcli= + =curl= + =speedtest-go= on PATH; assert
+ command sequences and parsed/emitted JSON for status, list, up/down,
+ add/edit/remove, probe, diagnose, repair, speedtest. Pure state/format
+ functions tested directly. JSON schemas locked by example.
- *Portal parser* — already covered in =tests/captive= (Normal/Boundary/Error +
the real SONIFI body). The engine's native probe reuses the same cases.
+- *nmcli parsing* — escaped colon/backslash/newline in SSID, duplicate names,
+ hidden SSID, non-ASCII, wired-mid-session, multi-active (wifi+tether).
+- *Failure + concurrency (the risky classes)* — slow/hung nmcli/curl/speedtest
+ (degraded state within budget), concurrent =net status= probe refresh
+ (single-flight), corrupt cache (recovered), stale cache after SSID change
+ (invalidated), permission denied / sudo declined, DNS-override cleanup failure
+ (=cleanup-unverified=), NM partial activation (rollback keeps prior link),
+ secret redaction, missing speedtest-go, no wifi hardware, rfkill soft/hard
+ block.
+- *Doctor classification* — fixture-driven =net doctor= over fake nmcli/curl
+ asserting the right terminal classification + that =--fix= stops before
+ destructive repairs: auth failures (=needs-user-action=), upstream/AP failure
+ (=upstream-not-local=), VPN-routed failure (=deferred/vpn=), and the DNS classes
+ (hijack → portal, broken-but-1.1.1.1-works → offered override, egress-blocked →
+ upstream). Assert the failure-mode coverage table's "detects / repairs / terminal
+ action" holds for each row.
- *Indicator* — drive =net status --json= through =waybar-net=, assert the JSON
- the bar renders for each state (online / captive / no-internet / wired /
- disconnected), interface override via env like =WAYBAR_NETSPEED_IFACE=.
-- *Panel* — pocketbook-style: test the backing logic (list ordering, op
- dispatch, store read/write, gpg round-trip with a test key), not the GTK
- widgets.
-- *GPG store* — round-trip encrypt/decrypt with a throwaway test key; sync
- on add/edit/remove; import rebuilds NM ops (asserted against fake nmcli).
+ per state (online / captive / no-internet / degraded / wired / disconnected /
+ rfkill), iface override via env.
+- *Panel* — pocketbook-style: backing logic (list ordering, op dispatch,
+ state-machine transitions), not GTK widgets.
+- *NM secrets / no-leak* — add/edit writes the secret into NM via nmcli (asserted
+ against fake nmcli, never to a tool-owned file); assert no PSK/EAP appears in any
+ =--json=, log line, or error (there is no credential store to round-trip).
+- *Live checklist (gated out of the suite)* — a "Manual testing and validation"
+ task per phase for the real-network states (captive at a hotel, no-internet,
+ switch under load, reset, speedtest) that can't be faked.
+
+** Harness + coverage gate
+The concrete contract, matching the repo's existing convention (not pytest — the
+dotfiles suites are =unittest=, run by =make test= as =python3 -m unittest= over
+=tests/*/test_*.py=; 33 suites today):
+- *Framework*: =unittest=. Each suite is =tests/<name>/test_<name>.py=
+ (=tests/net/=, =tests/waybar-net/=), collected by the existing =make test= loop
+ — no new runner, no pytest dependency.
+- *Fakes on a temp PATH*: =fake-nmcli=, =fake-curl=, =fake-speedtest-go=,
+ =fake-rfkill=, =fake-resolvectl= live as executable stubs in =tests/<name>/=
+ (the =tests/layout-navigate/fake-hyprctl= pattern). A fixture file encodes the
+ command→canned-output map and the stub appends each invocation to a log the test
+ asserts against. Subprocess timeouts are simulated by a stub that sleeps past the
+ budget; =net status= must still return the degraded state.
+- *Waybar wrappers end-to-end*: =waybar-net= is run as a subprocess with the fake
+ PATH and the env overrides (iface, cache path), asserting the emitted JSON — same
+ as =tests/waybar-netspeed=.
+- *Coverage*: coverage.py is absent system-wide (and not importable), so coverage
+ runs in a throwaway venv (=python3 -m venv=, =pip install coverage=, =coverage
+ run -m unittest=, =coverage report=) — the method the wtimer suite used (95%).
+ Target: *branch* coverage over =net/= and the wrapper, ≥ 90% on the pure
+ classifier/parser modules.
-* Files touched (planned)
+** Coverage as a gap-finder, not a number (per phase)
+Line coverage alone misses the branches that matter here, so each phase ends with
+a *coverage-gap pass*, not just a percentage:
+- After the first green run, read the branch report and map every uncovered branch
+ to either a new test or a consciously-excluded live-only behavior (with a comment
+ or a Manual-testing entry naming it).
+- *Branch coverage is required* for the pure logic: the doctor classifier (every
+ outcome — fixable / needs-user-action / upstream-not-local / deferred-vpn), the
+ cleanup-unverified path, the redaction paths, the degraded hot-path fallback, the
+ timeout branches, and the portal/nmcli parsers.
+- A phase isn't "done" until its coverage-gap pass is recorded — uncovered logic is
+ either tested or explicitly excused, never silently uncovered.
+* Files touched (planned, all in =~/.dotfiles=)
+
+- =net/= package (src-layout, like pocketbook) — engine + panel.
- =hyprland/.local/bin/waybar-net= — the indicator (replaces =waybar-netspeed=).
-- =hyprland/.local/bin/net= — engine CLI entry (or a package console-script).
-- =net/= package (src-layout, like pocketbook) — engine + panel, in the dotfiles
- repo (or in-tree as pocketbook is).
-- =hyprland/.config/waybar/config= — swap =custom/netspeed= → =custom/net=.
-- =hyprland/.config/waybar/style.css= — captive / no-internet state classes.
+- =hyprland/.local/bin/net= — engine CLI entry (console-script shim).
+- =hyprland/.config/waybar/config= — swap =custom/netspeed= → =custom/net=;
+ remove =custom/airplane=.
+- =hyprland/.config/waybar/style.css= — captive / no-internet / degraded /
+ rfkill classes; remove airplane classes.
- =tests/net/=, =tests/waybar-net/= — suites.
-- =~/.config/net/= — config + connection store (machine-local; not stowed
- content beyond a seed config).
-- =captive= — minor refactor so the engine can reuse its probe cleanly.
+- =captive= — refactor: extract probe + reset into functions callable
+ non-interactively (a =--json= probe mode) so the engine reuses them.
+- =~/.config/net/config= — seed config (probe TTL, speedtest server, redaction
+ flags). No secrets; not a credential store.
+- dotfiles =Makefile= — add the console-recovery targets (=online=, =net-doctor=,
+ =net-status=, =net-diagnose=, =net-portal=, =net-reset=, =net-bounce=).
+- *Deletions once net ships* (the airplane module is absorbed):
+ =hyprland/.local/bin/waybar-airplane=, =hyprland/.local/bin/airplane-mode=,
+ =tests/waybar-airplane/=, =tests/airplane-mode/=, and the =custom/airplane=
+ module + its css.
+- archsetup Hyprland step — add =gtk4-layer-shell=, =python-gobject=,
+ =speedtest-go-bin= to the install lists (the only archsetup change; no =gpg=
+ added, secrets stay in NM's store).
-* Resolved decisions (this session, Craig's calls)
+* Resolved decisions (Craig's calls + this response)
1. Panel UI tech → GTK4 + gtk4-layer-shell, shared pocketbook scaffold (one
panel shell, reused by the desktop-settings sibling).
2. Engine language → Python =net= package; shells out to =captive= for the
portal-force flow, native cheap probe for the bar path.
3. Connectivity probe → split cadence (fast link poll every 2s + slow cached
- internet/captive probe, TTL ~45s).
+ internet/captive probe, TTL ~45s) with single-flight + atomic cache.
4. No keyboard-modifier clicks (waybar can't qualify them) — the panel hosts the
- rich actions; bar clicks are left=panel, right=reset, middle=portal.
-5. GPG store supplements NM; NM stays the source of truth.
+ rich actions; bar clicks dispatch detached jobs (phase-aware).
+5. No separate credential store (Craig's call, cj). Secrets live in NM's own
+ =system-connections= (root =0600=, inline), touched via nmcli. No GPG, no
+ gpg-agent, no =~/.config/net/connections=. Supersedes the earlier GPG-store
+ design.
6. =custom/netspeed= absorbed into =custom/net=; throughput moves to the tooltip.
-7. Speed-test backend → librespeed-cli.
+7. Speed-test backend → =speedtest-go= (already installed), not a new
+ librespeed-cli dependency; librespeed-cli is the self-hosted fallback.
+8. Code lives in the dotfiles repo; archsetup only installs deps.
+9. v1 add/edit scope = open + WPA-PSK; enterprise/802.1X is activate-only,
+ add/edit is vNext (settled by Craig 2026-06-29 — no enterprise networks in his
+ history, so the form would be unused UI).
+10. =net doctor= is in v1 (Craig's call, cj) — a one-shot diagnose+fix mode,
+ reachable from a TTY via =make online= / =make net-doctor=. (The earlier
+ "defer the doctor/bundle command" decision is reversed.)
+11. Diagnose (read-only) and Repair (mutating, confirmed) are separated in the
+ panel and the CLI; Repair is tiered lightest-first (rfkill → reset → bounce).
+12. =custom/net= absorbs the airplane module (Craig's call, cj); the standalone
+ airplane module + scripts + tests are deleted once net ships.
+13. Repair includes a full-stack bounce and an rfkill-unblock (Craig's calls,
+ cj) — the latter recovers the framework-laptop post-power-loss soft-block.
+14. VPN / WireGuard is a planned Phase 5 (Craig's call, cj), not a permanent
+ exclusion.
-* Phasing
+* Implementation phases
-- *Phase 1 — Indicator (task #C).* =net status= + =net probe= (native cheap
- probe, reusing captive's logic) + =waybar-net= + the split-cadence cache + CSS
- states. Ships the no-internet/captive state on the bar. Smallest, highest
- value, fully testable without the GTK panel.
+- *Phase 1 — Indicator + console recovery (task #C).* =net status= + =net probe=
+ (native cheap probe, reusing captive's logic) + the =captive= probe refactor +
+ =waybar-net= + the split-cadence cache (single-flight, atomic, stale classes) +
+ CSS states (incl. rfkill) + performance budget. Plus the CLI-only recovery path:
+ =net repair= tiers (rfkill / reset / bounce), =net doctor [--fix]=, and the
+ Makefile targets (=make online= etc.) — all testable without the GTK panel.
+ Absorbs the airplane state and removes the standalone airplane module. Interim
+ left-click keeps the existing scratchpad until the panel lands.
+ - *Acceptance*: fresh-login waybar smoke test shows correct state on
+ online/captive/no-internet/wired/rfkill; =net status= stays within budget
+ under a fake slow nmcli (degraded state); =net doctor --fix= recovers a
+ soft-blocked radio from a TTY; the live captive checklist passes at a real
+ portal; the airplane state works and the old airplane module is gone;
+ reverting = swap =custom/netspeed= + =custom/airplane= back.
- *Phase 2 — Panel shell + connection management (task #B core).* GTK4
- layer-shell scaffold + =net list/up/down/add/edit/remove/rescan= + MRU list.
+ layer-shell scaffold + =net list/up/down/add/edit/remove/rescan= + MRU list +
+ mutation safety/rollback + panel state machines.
+ - *Acceptance*: switch wifi↔wifi and ethernet↔wifi without stranding; a failed
+ switch leaves the prior link up; add/edit open + WPA-PSK writes the secret to
+ NM; remove confirms; panel states render for loading/rescan/activation.
- *Phase 3 — Diagnostics + speed test in the panel.* Wire =net diagnose= /
- =net portal= / =net reset= / =net speedtest= into the panel; portal Open
- button.
-- *Phase 4 — GPG credential store.* Opt-in encryption, cadence config, NM sync,
- import.
+ =net repair= / =net doctor= / =net portal= / =net speedtest= into the Diagnose
+ vs Repair sections; the "Get me online" button; portal Open button; speedtest
+ progress + cancel.
+ - *Acceptance*: diagnose runs read-only; each repair tier confirms + verifies
+ cleanup (DNS override reverts, shown); speedtest result parses from
+ speedtest-go and a fixture-driven failure shows the right message.
+- *Phase 4 — Docs + rollout.* In-app help (=net --help= / per-command help, the
+ panel help affordance), README/user-guide (commands, panel, config,
+ troubleshooting, the make targets, rollback), and the manual dep step on ratio.
+ - *Acceptance*: =net --help= and each subcommand's help are complete; the
+ user-guide covers every command + the recovery targets; ratio rollout
+ documented.
+- *Phase 5 — VPN / WireGuard (future).* Fold the existing archsetup wireguard
+ tooling into the same panel + CLI (=net vpn ...=). Out of the v1 milestone;
+ specced separately when picked up.
* Open items / risks
- gtk4-layer-shell dropdown anchoring under a waybar module needs the same
- positioning work pocketbook solved; reuse it.
-- librespeed-cli availability + a default server choice (public list vs a pinned
- server) to confirm before Phase 3.
+ positioning work pocketbook solved; reuse it. (Phase 2.)
+- The =captive= refactor must keep the standalone CLI behavior identical while
+ exposing a non-interactive =--json= probe; covered by the existing
+ =tests/captive= suite plus new probe-mode tests. (Phase 1.)
+- speedtest-go server selection variance (nearest-server flor) — pin a server in
+ config if results are noisy. (Phase 3.)
- The background-probe kick from =net status= must be truly non-blocking (spawn +
- detach) so a slow/hanging probe never stalls the bar.
-- NM↔GPG-store conflict policy on edit needs a concrete rule (NM wins for the
- live connection) — confirm during Phase 4.
+ detach); enforced by the single-flight lock and the performance benchmark test.
* Rollback
Each phase is independent. The indicator (Phase 1) is a drop-in replacement for
-=custom/netspeed=; reverting is swapping the module back in the config. The panel
-and store are additive — not wiring their clicks / not enabling encryption leaves
-the bar working as before.
+=custom/netspeed= (and =custom/airplane=); reverting is swapping those modules
+back in the config and restoring their scripts. The panel is additive — not
+wiring its clicks leaves the bar working as before. No credential store to roll
+back (secrets stay in NM throughout).
+
+* Review findings [31/31]
+
+** DONE Define the structured diagnostics contract :blocking:
+The spec says the engine "emits JSON" and that diagnostics "reuse =captive=
+verbatim", but the current =~/.dotfiles/common/.local/bin/captive= flow is a
+human-readable bash script that mixes diagnostics, sudo prompts, DNS mutation,
+browser launch, and terminal prose. A GTK panel cannot reliably turn that into
+clear state, progress, cancellation, or useful error messages. Define the
+machine contract before implementation: every diagnostic step should have a
+stable id, status (=pending/running/pass/warn/fail/skipped=), redacted evidence,
+elapsed time, safety outcome, and next action. Keep =captive= as the interactive
+CLI, but either refactor reusable probe/reset functions behind =net diagnose
+--json= or make =captive= expose a non-interactive JSON mode. This blocks the
+panel and logging work because otherwise the implementer must invent the
+boundary.
+
+Disposition: accept — added the "Diagnostics contract" section (per-step id /
+status / evidence / elapsed / safety / next_action) and the =captive= =--json=
+probe-mode refactor under Architecture + Files touched.
+
+** DONE Specify user-facing failure messages and recovery actions :blocking:
+The spec names failure states like =no-internet=, =captive=, failed probe,
+failed reset, missing DNS, and missing speed-test backend, but it does not define
+the messages the user sees or what each message tells them to do next. For this
+feature, "error" is not enough: a user needs to know whether WiFi is associated,
+whether DHCP succeeded, whether DNS is hijacked/broken, whether HTTP is
+intercepted, whether sudo was declined, whether a command timed out, and whether
+the system was left unchanged or partially changed. Add a message table for the
+indicator, panel, and CLI with: failure class, visible text, evidence included,
+redaction rule, and next action. This is blocking because UX quality here is the
+product, not an implementation detail.
+
+Disposition: accept — added the "Failure states, messages, recovery" section
+covering each class, the visible message, the "what changed" residue note, and
+the next action across indicator/panel/CLI.
+
+** DONE Define the debug log and redacted support bundle :blocking:
+There is no observability section. When this fails in a hotel or cafe, an agent
+needs enough evidence to diagnose it without rerunning destructive actions. Add
+log location, rotation/retention, JSONL event schema, command argv logging,
+exit-code/stderr capture, elapsed time, selected iface, NM active connection
+UUID, probe URL class, HTTP code, redirect host, DNS servers, and cache
+read/write events. Also define a =net doctor --json= or =net debug-bundle=
+command that emits redacted status, recent log events, dependency versions, and
+a reproduction command. Redact SSID if configured, MAC addresses, portal query
+tokens, PSKs, EAP identities/passwords, IPs when requested, and all GPG/NM
+secrets. This blocks implementation readiness because post-failure diagnosis is
+currently left to ad hoc terminal spelunking.
+
+Disposition: modify — accepted the JSONL event log, the schema, and the redaction
+rules in full (new "Observability" section). Deferred the dedicated =net
+debug-bundle= / =net doctor= command to vNext: for a single-user tool =net
+diagnose --json= (the snapshot) plus the event log (the history) cover
+post-failure diagnosis; a bundle command is gold-plating for v1. Recorded under
+Out + Resolved decision 10.
+
+** DONE Pin the nmcli parsing and timeout contract :blocking:
+The spec lists nmcli operations but not the exact fields, output modes, escaping
+rules, ID semantics, or timeouts. This is risky because SSIDs and connection
+names can contain spaces, colons, duplicates, hidden names, and non-ASCII; the
+current =waybar-netspeed= already had an SSID parsing bug. The nmcli manual
+documents =--terse=, =--get-values=, =--escape=, =--wait=, ID/UUID/path
+selection, =passwd-file=, and built-in connectivity states
+(=none/portal/limited/full/unknown=) at
+https://man.archlinux.org/man/nmcli.1.en. The spec should require UUIDs for
+saved-profile operations, explicit =--wait= budgets, parser tests for escaped
+colons/backslashes/newlines/duplicate names/hidden SSIDs, and a decision on when
+to use or ignore =nmcli networking connectivity [check]=. This is blocking
+because the command wrapper is the core reliability boundary.
+
+Disposition: accept — added the "nmcli contract" section: terse + =--escape= +
+=--get-values=, UUID-keyed ops, explicit =--wait= budgets, NM connectivity as a
+cheap hint (our probe authoritative), and the parser test matrix.
+
+** DONE Define cache concurrency, atomicity, and stale-state behavior :blocking:
+=net status= may spawn =net probe= whenever the cache is stale, but the spec
+does not define locking, process coalescing, atomic writes, crash cleanup, or
+what happens when the probe hangs. With a 2s Waybar interval, a bad network could
+start overlapping probes, corrupt the runtime cache, or keep showing stale
+"online" while the link is gone. Add a single-flight lock under
+=$XDG_RUNTIME_DIR/waybar=, atomic write+rename for cache updates, max probe
+runtime, stale age classes (fresh/stale/expired/unknown), cache invalidation on
+iface/SSID/connection UUID change, and tests for concurrent =net status= calls.
+This blocks the fast-path design because it is the main performance and
+correctness risk.
+
+Disposition: accept — added "Concurrency, atomicity, staleness" under the
+Connectivity model: flock single-flight, temp+rename atomic write, ≤6s probe
+timeout, fresh/stale/expired/unknown classes, iface/SSID/UUID invalidation, stale
+lock reclaim, plus concurrency tests in the test plan.
+
+** DONE Bound hot-path performance with measured budgets :blocking:
+The spec says the cheap poll should be sub-100ms, but the proposed fast path
+still may call multiple =nmcli= commands every two seconds, read sysfs, parse
+throughput, and maybe spawn a background probe. The existing =waybar-netspeed=
+had a deliberate sleep for throughput sampling; replacing it must define how
+throughput is sampled without sleeping in the bar path. Add a per-command budget
+for =waybar-net= and =net status=, a maximum number of subprocesses on the hot
+path, a timeout for every subprocess, benchmark tests with fake slow =nmcli=,
+and a rule that the indicator emits a degraded JSON state rather than blocking.
+This is blocking because Waybar custom modules can visibly freeze or lag when
+their exec path stalls.
+
+Disposition: accept — added the "Performance budgets" section: <100ms typical /
+<250ms worst, throughput sampled across the poll interval (no in-process sleep),
+one nmcli call max on the hot path, timeouts on every subprocess, the degraded
+state, and a fake-slow-nmcli benchmark test.
+
+** DONE Make click actions non-blocking and visible :blocking:
+Waybar right-click runs =net reset= and middle-click runs =net portal= directly.
+Those operations can require sudo, open browsers, mutate DNS, delete/recreate NM
+profiles, or hang on network commands, but Waybar click handlers provide no
+panel, terminal, progress, or cancellation surface by default. Define whether
+right/middle click instead opens the panel focused on the action, dispatches a
+background job with notifications, or is removed from v1. If kept, specify
+single-flight behavior, how sudo/polkit prompts surface, how success/failure is
+reported, and how the user can inspect logs. This blocks UX readiness because
+the fastest remediation path is currently the easiest place to hide failure.
+
+Disposition: modify — accepted the concern; made the interactions phase-aware and
+non-blocking. Every click dispatches a detached, single-flight background job and
+reports via =notify=; sudo surfaces through polkit/the normal prompt; failures go
+to the notify + the event log. In Phase 1 (no panel) left-click runs probe +
+notify and keeps the scratchpad; from Phase 2 left-click opens the panel focused
+on the action. Recorded in the Indicator "Interactions" subsection.
+
+** DONE Specify connection mutation safety and rollback :blocking:
+The spec says row click switches connections and remove gets a confirm, but it
+does not define what happens when a switch partially succeeds, disconnects the
+current working link, needs a password, loses the default route, or triggers
+auto-activation. The nmcli manual warns that =connection down= does not prevent
+future auto-activation and may internally block a profile until user action.
+Define preflight, the exact NM command sequence, whether the old active
+connection is kept until the new one proves usable, when rollback is attempted,
+how long activation waits, and what the panel says when rollback fails. This is
+blocking because the module can strand the user offline.
+
+Disposition: accept — added "Mutation safety + rollback" under Connection
+management: keep the prior link up until the target activates (=--wait 30=), no
+teardown on failure, password-required surfaced not stranded, =net down= reports
+post-op active state + the auto-reactivation caveat, and the pinned NM command
+sequence is tested against fake nmcli.
+
+** DONE Define the credential-store security model :blocking:
+The GPG store is described as optional and default-unencrypted, but the spec does
+not define file modes, schema, secret-source rules, import/export prompts,
+recipient verification, stale secret handling, or what is logged. It also says
+NM remains source of truth while the user-owned store contains PSK/EAP secrets,
+which creates two truth sources for sensitive data. Add a precise schema,
+=0600= file creation with parent-dir permissions, encrypted-recipient checks,
+plaintext warning text, explicit opt-in flow, redaction requirements, behavior
+when NM has a secret not in the store, behavior when the store has a secret NM
+rejects, and tests for no secret leakage in JSON/logs/errors. This blocks Phase
+4 and the full spec because otherwise the implementer must make security
+decisions mid-code.
+
+Disposition: accept — rewrote "Credential storage" with the versioned schema,
+=0600= file / =0700= dir, recipient verification on opt-in, the plaintext
+warning, secret-source rule (entered/exported, never harvested from root store),
+the two-source reconciliation policy (NM wins live, store wins for what NM
+lacks, stale-secret flagging), and the no-leak tests.
+
+** DONE Define EAP, enterprise WiFi, and unsupported connection behavior :blocking:
+The store says "PSK/EAP" and connection management says add/edit, but there is
+no v1 contract for WPA-Enterprise fields, certificates, identity vs anonymous
+identity, hidden networks, static IP, proxy settings, metered flags, MAC
+randomization, or 802.1X prompt behavior. Either scope v1 to open/WPA-PSK plus
+existing saved-profile activation, or define the minimum EAP form and the
+unsupported-state messages. This blocks add/edit/import because enterprise WiFi
+is too sensitive to hand-wave.
+
+Disposition: modify (scope) — scoped v1 to open + WPA-PSK add/edit, with
+*activation* of any existing saved profile (including enterprise). Enterprise /
+802.1X add/edit, static-IP, proxy, metered, and MAC-randomization editing are
+vNext, shown as "edit via nmtui/nmcli". Recorded in Scope/Out, Connection
+management, and Resolved decision 9.
+
+** DONE Split read-only diagnostics from mutating remediation :blocking:
+The panel's diagnostics section includes probe, bounce/reset, gateway ping, and
+DNS override test in one area, while =captive= currently performs resets and
+temporary DNS changes as part of its flow. Users need to know which buttons are
+read-only and which mutate NM profiles, MAC mode, DNS, or browser state. Add
+separate "Diagnose" and "Repair" actions, confirmations for destructive or
+privacy-changing operations, explicit cleanup verification for DNS override, and
+a terminal state when cleanup is unverified. This blocks readiness because
+network repair must not surprise the user or leave hidden residue.
+
+Disposition: accept — split the panel into a read-only Diagnose section and a
+confirmed, mutating Repair section (and split the CLI into =net diagnose= vs =net
+repair=). Added =cleanup_verified= + a terminal =cleanup-unverified= state to the
+diagnostics contract.
+
+** DONE Define panel state, cancellation, and permissions UX :blocking:
+The panel sections list buttons and a streaming output area, but not loading
+states, disabled states, empty states, keyboard/focus behavior, cancellation, or
+permission-denied handling. Add panel state machines for connection list loading,
+rescan in progress, activation in progress, diagnostics running, speedtest
+running, and no NetworkManager/no WiFi/no permissions/no GPG key/no
+librespeed-cli. Each long operation should be cancellable where possible or
+clearly non-cancellable with an elapsed-time display. This blocks the GTK work
+because without it the implementer must invent the user flow.
+
+Disposition: modify — accepted the state-machine requirement (added "Panel state,
+cancellation, permissions"), but scoped the state set to what can actually occur
+on the two-machine fleet: dropped "no NetworkManager" as a modeled state (NM is
+always present; a missing nmcli is a single hard-error exit) and kept
+no-wifi-hardware, missing speedtest-go, no-GPG-key, plus the in-progress states
+with elapsed-time + cancellation where the op allows.
+
+** DONE Verify speed-test dependency, server choice, and failure contract :blocking:
+The spec chooses =librespeed-cli= and notes availability/default-server research
+as an open risk, but Phase 3 still depends on parsing its JSON and showing
+progress. I checked the upstream project page
+(https://github.com/librespeed/speedtest-cli) and the AUR URL named by search is
+not sufficient as a verified package/install contract in this spec. Add the
+exact package name/source to install, command version expected, JSON shape,
+server-selection policy, timeout, cancellation behavior, offline/rate-limited
+messages, and tests with fixture JSON and fixture stderr. This blocks Phase 3
+because speed-test failure modes are otherwise undefined.
+
+Disposition: modify — verified live and changed the backend: =speedtest-go= (AUR
+=speedtest-go-bin=, 1.x) is already installed on velox and supports =--json=,
+=--server=, =--no-download/--no-upload=, so v1 needs no new dependency.
+librespeed-cli (AUR =librespeed-cli= / =-bin=) is the documented self-hosted
+fallback. Added the "Speed test" section with server policy, timeout,
+cancellation, the failure-message mapping, and fixture-JSON/stderr tests.
+
+** DONE Define dependency installation and repo boundaries :blocking:
+The files touched section alternates between archsetup paths and the external
+dotfiles repo, while pocketbook has been folded into this repo and its previous
+archsetup provisioning was intentionally removed. The spec should state where
+the =net= package actually lives, which repository owns the scripts/tests,
+whether =gtk4-layer-shell=, =python-gobject=, =librespeed-cli=, =gpg=, =nmcli=,
+=curl=, and =resolvectl= are installed by archsetup or assumed present, and the
+Makefile targets for test/lint/install. This blocks implementation because the
+current path plan can produce code that is not installed on a fresh machine.
+
+Disposition: accept — added the "Repository + dependencies" section: all code in
+=~/.dotfiles= (=net/= package in-tree like pocketbook, scripts in the hyprland
+tier, tests under =tests/=), archsetup owns only the dep install
+(=gtk4-layer-shell=, =python-gobject=, =speedtest-go-bin=; nmcli/curl/resolvectl
+already present), Makefile =make test= collects the package suite, and a
+daily-drivers note for ratio. Rewrote Files touched to match.
+
+** DONE Expand the test plan for failure, concurrency, and live verification :blocking:
+The testing plan covers normal parsing and fake command sequences, but it misses
+the riskiest behaviors: slow/hung =nmcli=/=curl=/=librespeed=, concurrent
+=net status= cache refresh, corrupt cache, stale cache after SSID change,
+permission denied, sudo declined, DNS override cleanup failure, NM partial
+activation, duplicate connection names, secret redaction, missing optional
+dependencies, no WiFi hardware, wired+tether+WiFi ambiguity, portal redirect
+tokens, and Waybar click handlers. Add unit/fixture tests for each class plus a
+manual/live checklist gated out of the normal suite. This is blocking because
+the current plan would leave the exact "things that can go wrong here" mostly
+untested.
+
+Disposition: accept — rewrote the Testing plan with the "Failure + concurrency"
+class (slow/hung commands, single-flight, corrupt/stale cache, perm-denied,
+cleanup-failure, partial activation, redaction, missing deps, no-wifi,
+multi-active) and a per-phase live checklist gated out of the suite.
+
+** DONE Define status JSON schemas and compatibility rules
+The spec says all subcommands take =--json= but does not define schemas. Add
+versioned JSON examples for =status=, =probe=, =list=, =diagnose=, =speedtest=,
+and error envelopes, including nullable fields and unknown/degraded states. This
+is non-blocking for product direction but should be fixed before code so tests
+can lock the CLI contract.
+
+Disposition: accept — added the "JSON schemas" section with versioned (=v:1=)
+envelopes for status / probe / list / diagnose / speedtest and a shared error
+envelope, including the degraded/unknown states.
+
+** DONE Rename or alias the phasing section for workflow compatibility
+The spec has a usable =Phasing= section, but the spec-review workflow expects an
+=Implementation phases= section that can be lifted into =todo.org=. Rename it or
+add an alias heading during response. This is non-blocking because the existing
+phase decomposition is understandable, but aligning the heading prevents future
+workflow friction.
+
+Disposition: accept — renamed =Phasing= → =Implementation phases= and added
+per-phase acceptance criteria.
+
+** DONE Add documentation and rollout acceptance checks
+Rollback is described, but docs and rollout are thin. Add README/user-guide
+updates for commands, panel behavior, config file, GPG opt-in, troubleshooting,
+and rollback; add acceptance checks for each phase, including a fresh-login
+Waybar smoke test and restoring =custom/netspeed=. This is non-blocking but
+important for handing the feature to a future session without re-discovery.
+
+Disposition: accept — added per-phase acceptance criteria under Implementation
+phases (incl. the fresh-login waybar smoke test and the =custom/netspeed=
+restore), a Phase 4 "Docs + rollout", and (answering Craig's cj follow-up) a
+dedicated "Help + documentation" section with the three help layers (CLI help,
+panel help affordance, user guide).
+
+** DONE Add a failure-mode coverage table :blocking:
+The spec now names many individual network failures, but it still does not carry
+one compact coverage matrix that says, for each common failure mode, whether
+=net diagnose= detects it, whether =net doctor --fix= can repair it, and what
+terminal user action remains when it cannot. Add a table covering at least:
+rfkill soft block, rfkill hard block, no WiFi hardware, associated/no DHCP,
+gateway unreachable, captive DNS hijack, broken DNS where 1.1.1.1 works, HTTP
+portal, HTTP interception without a parseable portal URL, upstream/AP outage,
+wrong WPA password or missing secret, enterprise auth/cert failure, duplicate
+SSID/connection-name ambiguity, hidden SSID, multiple active links, wedged
+NetworkManager, slow/hung command, stale/corrupt cache, DNS cleanup failure,
+missing speedtest backend, and VPN/routing interference. This blocks because
+Craig asked for confidence that the diagnostics and doctor cover the real field
+failures, and prose scattered across sections is too easy to misread.
+
+Disposition: accept — added the "Failure-mode coverage" section: a 22-row table
+(every mode the finding named) with detect / doctor-fix / terminal-action
+columns, conformed to the org-table standard (rules under every row, ≤120).
+
+** DONE Pin DNS repair semantics in doctor :blocking:
+The spec diagnoses DNS hijack, broken hotel DNS, and the temporary 1.1.1.1
+override test, but =net doctor --fix= does not say whether it merely recommends
+the override, applies a temporary override during recovery, or leaves DNS alone
+after diagnosis. Define the exact behavior for each DNS class: captive hijack
+should open the portal, broken DNS where 1.1.1.1 works should either offer an
+explicit temporary repair with cleanup verification or recommend the command,
+and port-53/egress blocking should stop as upstream/not locally fixable. This is
+blocking because DNS is one of the most common "connected but unusable" failures
+and the current doctor contract is ambiguous.
+
+Disposition: accept — added "DNS handling in doctor (explicit per class)" under
+the new Doctor section: hijack → open portal (no DNS mutation); broken-but-1.1.1.1
+→ explicit temporary override with cleanup verification under =--fix=, recommend
+otherwise; egress-blocked → terminal =upstream-not-local=.
+
+** DONE Make auth failures terminal user-action states :blocking:
+Wrong WPA password, missing NM secret, locked keyring/polkit denial, enterprise
+802.1X certificate/identity failure, and portal login-required are not fixed by
+resetting or bouncing NetworkManager. The doctor sequence should classify these
+as =needs-user-action= terminal states, stop before looping through destructive
+repairs, and tell the user the exact next action (enter password, edit profile in
+=nmtui=/=nmcli=, accept portal terms, provide cert/identity, or retry with
+admin auth). This blocks because repeated reset/bounce against auth failures is
+slow, noisy, and can make the network state worse without helping.
+
+Disposition: accept — added the =needs-user-action= terminal outcome to the
+Doctor section: wrong password / missing secret / keyring-or-polkit denial /
+802.1X cert-or-identity failure / portal-login-required all stop the doctor before
+any destructive repair and name the exact next step.
+
+** DONE Define upstream/AP/provider failure terminal states :blocking:
+Some failures are not client-repairable: AP has no uplink, hotel gateway is
+down, DHCP server is broken, gateway drops traffic, ISP outage, or captive
+portal backend is failing. The spec should define how =diagnose= proves "local
+link is up but upstream is broken" and how =doctor --fix= stops after local
+repairs are exhausted with a clear message like "local repairs tried; likely
+upstream/AP/provider" plus the evidence. This blocks because users need to know
+when to stop poking the laptop and switch networks or contact the venue.
+
+Disposition: accept — added the =upstream-not-local= terminal outcome: diagnose
+proves link-up + IP + gateway-reachable but no route out and no captive redirect;
+=doctor --fix= stops after local repairs with "local repairs tried; likely
+upstream/AP/provider" + evidence → switch network / contact venue.
+
+** DONE Decide how VPN and policy routing affect v1 diagnosis
+VPN/WireGuard management is Phase 5, but active VPNs, policy routes, DNS
+overrides, and firewall killswitches can break apparent internet access in v1.
+The current spec does not say whether v1 detects active VPN/policy routing and
+classifies "network is fine, VPN route/DNS is broken" separately from WiFi
+failure. Add either a v1 diagnostic check for active VPN/default-route/DNS
+ownership with a "deferred repair" outcome, or explicitly state that VPN-routed
+failures are out of scope and may be misclassified. This is blocking if Craig
+expects the module to diagnose normal daily-driver network failures while VPN
+tooling remains separate.
+
+Disposition: accept (chose the detect-and-classify option) — v1 detects an active
+VPN / non-NM default route / non-NM DNS owner and classifies =deferred/vpn= ("link
+is fine; internet is VPN-routed"), distinct from a WiFi failure. v1 does not
+repair it (VPN management is Phase 5); it names the VPN as the likely owner and
+stops. Added to the Doctor section + the coverage table + a doctor-classification
+test.
+
+** DONE Remove stale GPG-store references from the resolved spec
+The spec now decides "no separate credential store; secrets live in
+NetworkManager", but the Testing plan still mentions =gpg round-trip= and =GPG
+store= tests, and the panel-state list still mentions a no-GPG-key state. Remove
+those stale references and replace them with NM-secret/no-secret-leak tests.
+This is non-blocking for product behavior but blocking for implementation
+clarity: otherwise tests will be written for a credential store that no longer
+exists.
+
+Disposition: accept — replaced the Testing-plan =gpg round-trip= / =GPG store=
+bullets with an "NM secrets / no-leak" test (add/edit writes the secret via nmcli;
+assert no PSK/EAP in any JSON/log/error; no store to round-trip) and dropped the
+=no-GPG-key= panel state. Residue from the cj-comment pass that dropped the store.
+
+** DONE Reconcile status, goal, and task text before implementation :blocking:
+The spec status says "Implementation-ready with caveats" and "Phase 1 ready to
+build", but the body still has an unresolved enterprise add/edit VERIFY, the
+Goal still says "optional GPG-encrypted secret store", and the unified task title
+still names "GPG-stored secrets" even though the accepted design removed the
+store. Before implementation, make the top-level status, goal, scope, task
+mapping, and resolved decisions agree with the current design. This blocks
+readiness because a developer starting from the top of the file would still build
+or plan around abandoned GPG-store behavior.
+
+Disposition: accept — fixed the Goal ("secrets stay in NM's own store"), the
+=[#B]= task-mapping line (notes the "GPG-stored secrets" framing is superseded by
+decision 5), the enterprise VERIFY (now resolved → Status updated), and corrected
+the stale =pytest= mentions to =unittest= (the repo's actual harness). Top-of-file
+status/goal/scope/decisions now agree with the design.
+
+** DONE Resolve enterprise add/edit scope or make the caveat explicit :blocking:
+The spec still says "One open question for Craig: pull enterprise add/edit into
+v1?" and points to a VERIFY in =todo.org=. That is a real product-scope decision:
+if enterprise add/edit is in v1, panel forms, nmcli command sequences, tests,
+error messages, and docs change materially; if it is out, the UI must consistently
+show activate-only with "edit in nmtui/nmcli". Decide it in the spec before
+implementation, or downgrade the status to =Ready with caveats= with this exact
+accepted caveat. As written, the spec cannot be plain =Ready=.
+
+Disposition: accept — Craig decided (2026-06-29): enterprise add/edit is vNext,
+activate-only in v1. Settled in the Status line, the Scope/Out bullet, decision 9,
+and the VERIFY (now DONE in todo.org). The UI shows activate-only with "edit in
+nmtui/nmcli" consistently. Evidence: 24 saved profiles, 0 enterprise.
+
+** DONE Define the concrete test harness and coverage gate :blocking:
+The spec says TDD, fake binaries on PATH, and benchmark tests, but it does not
+define the actual harness contract: pytest vs unittest for the =net= package,
+where fake =nmcli=/=curl=/=speedtest-go=/=rfkill=/=resolvectl= live, how test
+fixtures encode command histories, how subprocess timeouts are simulated, how
+Waybar scripts are executed end-to-end, and how coverage is run. Add the exact
+Makefile targets (=test=, =test-unit= or package-local =pytest=), pytest config,
+coverage command (e.g. branch coverage over =net/= and =waybar-net= wrappers),
+minimum threshold, and the rule for reading the coverage report to add missing
+tests before declaring a phase done. This blocks readiness because "what is the
+test harness?" is still answerable only by analogy to older suites.
+
+Disposition: accept — added the "Harness + coverage gate" section. Corrected the
+premise: the repo is =unittest= (=make test= → =python3 -m unittest=, 33 suites),
+not pytest. Pinned the fake-binary stub convention (=tests/<name>/fake-*= on a
+temp PATH), the fixture command→output map, timeout simulation, the end-to-end
+=waybar-net= subprocess run, and coverage via a throwaway venv (coverage.py is
+absent system-wide) with a ≥90% branch target on the pure modules.
+
+** DONE Use coverage to find missing behavior, not just report a percentage :blocking:
+The spec does not say how coverage findings affect implementation. For this
+feature, line coverage alone can miss the important holes: doctor classification
+branches, cleanup-unverified paths, redaction paths, degraded hot-path fallbacks,
+timeout branches, and auth/upstream/VPN terminal states. Define coverage review
+criteria per phase: branch coverage for pure classifiers and parsers, named
+untested branches allowed only with comments or manual-check entries, and a
+required "coverage gap pass" after the first green test run that maps uncovered
+logic back to tests or consciously excluded live-only behavior. This blocks
+readiness because the current test plan is broad but does not force the suite to
+expose missing edge tests.
+
+Disposition: accept — added the "Coverage as a gap-finder, not a number (per
+phase)" subsection: branch coverage required for the doctor classifier (every
+outcome), cleanup-unverified, redaction, degraded-fallback, timeout, and the
+parsers; a mandatory coverage-gap pass after the first green run mapping each
+uncovered branch to a test or a named live-only exclusion; a phase isn't done
+until that pass is recorded.
+
+** DONE Convert error classes into exact user-facing strings and evidence fields :blocking:
+The failure table and doctor outcomes classify errors well, but many messages
+are still templates or descriptions rather than final text. Add exact strings
+for indicator tooltip, notification, CLI stderr, JSON =error.message=, and panel
+banner/step text for every failure-mode row, including cases doctor cannot fix:
+wrong password, missing secret, enterprise cert failure, upstream/AP/provider
+failure, VPN-routed failure, hard rfkill block, DNS cleanup failure, speedtest
+missing, and HTTP interception without parseable URL. For each string, specify
+the redacted evidence included and the next action. This blocks UX readiness
+because "useful error" is only testable once the actual text and evidence are
+defined.
+
+Disposition: accept — rewrote the Failure states section: each row now carries the
+exact final string (with =<placeholder>= evidence), the evidence field, and the
+next action, plus a per-surface rendering rule (indicator tooltip / notify /
+CLI+JSON error.message+detail+code / panel banner all render the one canonical
+string). Added the missing doctor-unfixable rows: hard rfkill, wrong password /
+missing secret, enterprise cert failure, upstream/AP/provider, VPN-routed, HTTP
+interception without a parseable URL, and DNS cleanup-unverified.
+
+** DONE Add an enhancement disposition table
+The spec captures several good enhancements (doctor, Makefile recovery, rfkill,
+airplane absorption, VPN phase), but it does not show that low-cost adjacent
+enhancements were considered and accepted/deferred/rejected. Add a small radar
+table for likely affordances: copy redacted doctor report, open/copy portal URL,
+retry with hardware MAC, forget network, rescan now, pin speedtest server, show
+last good network/result, watch mode for =net doctor=, desktop notification
+actions, QR-code/share WiFi import/export, and keyboard picker. Mark each
+=v1=, =vNext=, or =rejected= with a one-line reason. This is non-blocking, but it
+prevents accidental loss of cheap UX wins and keeps the v1 panel focused.
+
+Disposition: accept — added the "Enhancement radar" table dispositioning all the
+named affordances: open/copy portal URL, forget network, rescan, hardware-MAC
+retry, pin speedtest server, copy redacted doctor report = v1; last-good
+network/result, doctor watch mode, actionable notifications, keyboard picker =
+vNext; QR-share = rejected (low value for a 2-machine personal setup).
+
+** DONE Tighten the panel UX flow before Phase 2
+The panel has sections and state machines, but not a concrete interaction flow:
+default focused section, row content, primary/secondary buttons, disabled-state
+rules, confirmation wording for reset/bounce/DNS override, how "Get me online"
+reports each escalation, what stays visible after the panel closes, and keyboard
+navigation. Add a short UX flow spec or wire-level outline before Phase 2. This
+is non-blocking for Phase 1, but it blocks Phase 2 implementation because a GTK
+panel can easily become noisy or surprising if these defaults are invented while
+coding.
+
+Disposition: accept — added the "Panel UX flow (settle before Phase 2)"
+subsection: default focus (Connections, or Diagnose when opened from a captive
+state), row content, one primary button per section, disabled-state rules, exact
+confirmation wording for reset/bounce/DNS-override/remove, the live "Get me
+online" escalation reporting, what survives panel close, and keyboard nav.
+
+* Review and iteration history
+
+** 2026-06-29 Mon @ 17:00:39 -0400 — Codex — reviewer
+
+- *What changed or was recommended:* Rubric: =Not ready=. Applied the
+ spec-review workflow and added blocking findings for diagnostics structure,
+ user-facing errors, observability, nmcli contracts, cache concurrency,
+ hot-path performance, Waybar click actions, mutation rollback, credential
+ security, unsupported WiFi types, panel states, speed-test dependency
+ verification, install boundaries, and test coverage.
+- *Why:* The spec has the right broad shape, but this feature combines a status
+ indicator, NetworkManager mutation, sudo-backed remediation, captive-portal
+ probing, secret storage, and GTK UI. Too many failure modes were still left for
+ implementers to invent during coding.
+- *Artifacts:* Findings recorded in =Review findings=. Local code read included
+ =~/.dotfiles/hyprland/.local/bin/waybar-netspeed=,
+ =~/.dotfiles/common/.local/bin/captive=,
+ =~/.dotfiles/tests/waybar-netspeed/test_waybar_netspeed.py=,
+ =~/.dotfiles/tests/captive/test_captive.py=, pocketbook's GTK layer-shell
+ scaffold, current Waybar config, and the linked =todo.org= tasks. External
+ checks included the Arch nmcli man page and upstream LibreSpeed CLI page.
+
+** 2026-06-29 Mon @ 17:35:00 -0400 — Claude Code (archsetup) — responder
+
+- *What changed:* Processed all 18 findings — 13 accepted, 5 modified, 0
+ rejected. Accepts added new sections: Repository + dependencies, Performance
+ budgets, nmcli contract, JSON schemas, Diagnostics contract (+ Diagnose vs
+ Repair), Failure states/messages/recovery, Observability, cache
+ concurrency/atomicity/staleness, mutation safety + rollback, the rewritten
+ credential security model, the expanded test plan, JSON schemas, the renamed
+ Implementation phases with per-phase acceptance criteria, and a Status line.
+ Modifies: deferred the =net debug-bundle= command (log + =diagnose --json=
+ cover it), made click actions phase-aware + non-blocking, scoped enterprise
+ WiFi to activate-only (add/edit vNext), trimmed the panel state set to states
+ that occur on the two-machine fleet, and switched the speed-test backend to the
+ already-installed =speedtest-go= (verified live) instead of a new
+ librespeed-cli dependency.
+- *Why:* Codex's review was strong and mostly right; the modifies are scope calls
+ for a single-user, two-machine tool (avoid gold-plating, use installed deps,
+ don't model impossible states) and one phasing reconciliation (the panel is
+ Phase 2, so Phase-1 clicks can't open it).
+- *Artifacts:* All 18 findings completed =DONE= in place with dispositions. Live
+ check: =pacman -Qq | grep speed= (speedtest-go-bin installed; =speedtest-go
+ --json= confirmed). Status now "review incorporated (Codex, 2026-06-29)",
+ implementation-ready with the named Phase-2/3 caveats; Phase 1 ready to build.
+
+** 2026-06-29 Mon @ 17:37:58 -0400 — Claude Code (archsetup) — responder (cj comments)
+
+- *What changed:* Folded in Craig's 10 cj comments on the spec. Design changes:
+ (1) dropped the separate credential store entirely — secrets stay in NM's own
+ =system-connections= (root =0600=, inline), touched via nmcli; no GPG, no
+ gpg-agent (rewrote the secrets section, decision 5, dropped Phase 4 + the gpg
+ dep). (2) Added =net doctor [--fix]= + Makefile console-recovery targets
+ (=make online= etc.) as a first-class TTY path; reversed the earlier
+ defer-the-doctor call (decision 10). (3) Added a full-stack =bounce= repair and
+ an =rfkill= unblock repair + indicator state — the rfkill one recovers the
+ framework-laptop post-power-loss soft-block Craig hit. (4) =custom/net= absorbs
+ the airplane module; the standalone airplane scripts/tests/module are deleted on
+ ship. (5) Moved VPN/WireGuard from "out" to a planned Phase 5. (6) Added a
+ "Help + documentation" section (CLI help / panel help / user guide). Answered
+ the enterprise-defer rationale and the captive-auto-login explanation inline.
+- *Why:* Craig's comments simplify (NM-only secrets, fewer deps) and harden the
+ recovery story (doctor + make targets + rfkill/bounce reachable from a dead-GUI
+ console — his stated need). Enterprise add/edit in v1 is the one open call,
+ raised as a VERIFY.
+- *Artifacts:* All 10 cj blocks removed. Live checks: airplane scripts/tests
+ present (confirmed deletion targets); =/etc/NetworkManager/system-connections/=
+ is root =0600= with inline secrets (confirms the NM-only secret model);
+ =rfkill= present. VERIFY filed in =todo.org= for the enterprise-scope decision.
+
+** 2026-06-29 Mon @ 17:49:24 -0400 — Codex — reviewer
+
+- *What changed or was recommended:* Added a second feedback batch from the
+ common network failure-mode review: require a diagnose/doctor coverage table,
+ pin DNS repair semantics, make auth failures terminal user-action states,
+ define upstream/AP/provider terminal states, decide how active VPN/policy
+ routing affects v1 diagnosis, and remove stale GPG-store references.
+- *Why:* The design now has a solid recovery shape, but the field question is
+ sharper: for each common failure, does =diagnose= catch it, does =doctor --fix=
+ fix it, or does the user need to switch networks/provide credentials/contact
+ the venue? That mapping needs to live in the spec, not in chat.
+- *Artifacts:* Findings appended under =Review findings= after reviewing the
+ current spec's Diagnostics, Failure states, Doctor, Testing plan, and
+ responder history.
+
+** 2026-06-29 Mon @ 18:06:22 -0400 — Codex — reviewer
+
+- *What changed or was recommended:* Added a readiness pass focused on
+ development-time questions, missed enhancements, UX cleanliness, error wording,
+ and test/coverage mechanics. New findings require reconciling stale status/goal
+ text, resolving or explicitly caveating enterprise add/edit, defining the
+ concrete harness and coverage gate, making coverage actionable, turning failure
+ classes into exact user-facing strings, adding an enhancement disposition
+ table, and tightening the Phase-2 panel UX flow.
+- *Why:* The network failure model is now strong, but implementation would still
+ force developers to invent test infrastructure, coverage expectations, final
+ error copy, and several UX details. The spec should decide those up front or
+ explicitly mark them as caveats.
+- *Artifacts:* Findings appended under =Review findings= after reading the
+ updated spec, current archsetup Makefile test targets, pocketbook's pyproject,
+ and the linked =todo.org= enterprise VERIFY.
+
+** 2026-06-29 Mon @ 17:55:08 -0400 — Claude Code (archsetup) — responder (2nd review)
+
+- *What changed:* Processed all 6 second-batch findings — 6 accepted, 0 modified,
+ 0 rejected. Added a "Doctor: escalation, classification, terminal states"
+ section with four outcomes (=fixable= escalating lightest-first; =needs-user-
+ action=, =upstream-not-local=, =deferred/vpn= terminal) and explicit per-class
+ DNS handling. Added a "Failure-mode coverage" table (22 rows: detect / doctor-fix
+ / terminal action, org-table-standard conformant, ≤120). Cleaned the last
+ GPG-store residue (Testing plan → NM-secret/no-leak test; dropped the no-GPG-key
+ panel state). Added a doctor-classification test class.
+- *Why:* The findings sharpen the doctor from "escalate repairs" to "classify,
+ then stop at the right terminal state" — the safety property that keeps
+ =doctor --fix= from looping destructive repairs against a wrong password or an
+ upstream outage. The VPN finding got the detect-and-classify option (not repair;
+ that's Phase 5) so v1 doesn't misread a VPN-routed failure as WiFi down. The
+ coverage table is the single artifact Craig asked for to trust the field
+ coverage.
+- *Artifacts:* All 6 findings =DONE= in place; cookie =[24/24]=. Table conformed
+ via =wrap-org-table.el= (max line 120). No new findings introduced; no scope
+ expansion past what the findings asked. Spec remains implementation-ready, the
+ enterprise add/edit VERIFY still the one open decision.
+
+** 2026-06-29 Mon @ 18:24:51 -0400 — Claude Code (archsetup) — responder (3rd review)
+
+- *What changed:* Processed the third review (7 findings). The enterprise-scope
+ finding closed first on Craig's call (vNext, activate-only). The other 6, all
+ accepted: (1) reconciled the top-of-file text — fixed the Goal's GPG-store
+ wording, the =[#B]= task-mapping line, the resolved enterprise VERIFY, and
+ corrected the stale =pytest= mentions to =unittest= (the repo's real harness).
+ (2) Added a "Harness + coverage gate" section (unittest, fake-binary stubs on a
+ temp PATH, venv coverage, ≥90% branch on pure modules). (3) Added a per-phase
+ "coverage as a gap-finder" pass. (4) Rewrote the Failure states section to exact
+ final strings + evidence fields + a per-surface rendering rule, and added the
+ missing doctor-unfixable rows. (5) Added the "Enhancement radar" table
+ (v1/vNext/rejected). (6) Added the "Panel UX flow" subsection.
+- *Why:* The findings close the gap between "design decided" and "a developer can
+ start": the harness/coverage contract, the exact UX strings, and the panel flow
+ are the things otherwise invented mid-code. The =pytest=→=unittest= correction
+ was a real defect — the spec contradicted the repo's actual test convention.
+- *Artifacts:* All 31 findings =DONE=; cookie =[31/31]=. Both new tables conformed
+ via =wrap-org-table.el= (coverage 120, radar 110). Harness verified against the
+ live repo (33 unittest suites, =make test=, coverage.py absent → venv). Status
+ raised to "Ready for Phase 1; Ready-with-caveats overall" — no open decisions
+ remain.