#+TITLE: Waybar Network Module — Design Spec #+AUTHOR: Craig Jennings & Claude #+DATE: 2026-06-29 * Status *Phase 1 SHIPPED* (2026-06-29, dotfiles =5254bd8=..=c095a22=, 10 commits): engine (=net status/probe/diagnose/repair/doctor/portal=) + =waybar-net= indicator + split-cadence cache + redacted event log + Makefile recovery targets + airplane absorption. 160 net tests; pure modules ≥90% branch. One as-built deviation from this spec: airplane absorption is *display-only* (Craig's call, option 1) — net shows the airplane state but the =airplane-mode= low-power toggle is KEPT (it does radios + CPU + brightness + services, not a network concern); only =waybar-airplane= + =custom/airplane= + =waybar-netspeed= were deleted. See decision 12. Phases 2-5 remain. Live waybar eyeball is under todo.org "Manual testing and validation". Ready for Phase 1; Ready-with-caveats overall. Three Codex review rounds + Craig's cj comments are all incorporated — every finding has a disposition and the findings cookie reads complete ([31/31]), with no open decisions (enterprise scope settled: open + WPA-PSK in v1, 802.1X add/edit vNext, activate-only). The cj comments reshaped several decisions (no separate credential store — use NM's own; =net doctor= + Makefile console-recovery in v1; rfkill + full-stack-bounce repair; airplane module absorbed; VPN a later Phase 5). The only remaining caveats are Phase-2/3 build unknowns named under Open items (gtk4-layer-shell anchoring, the =captive= =--json= refactor) — not Phase-1 blockers. Phase 1 (indicator + console recovery) is ready to build. * Goal One waybar network component that does the whole job: shows connection state (including the missing "associated but no internet / captive portal" state), manages connections from a dropdown (nmcli-backed; secrets stay in NetworkManager's own store, no separate credential file), and runs the network diagnostics and remediation off the same place (captive-portal detection + forcing, bounce/reset, gateway/DNS checks, speed test). It unifies three todo tasks that are really one feature: - =[#C]= "archsetup Waybar Wi-Fi module should show no-internet state" — the indicator state plus the 2026-06-22 roam expansion (bounce, diagnostics, speed test off the component). - =[#B]= "Network-manager dropdown, nmcli-backed" — the management dropdown. (The todo task's original "GPG-stored secrets" framing is superseded: secrets stay in NM's own store, decision 5.) - The network diagnostics already shipped in =captive= (the hotel/captive-portal tool, formerly =login-page=) become this module's diagnostics engine rather than a standalone CLI. * Scope ** In - *Indicator* — wifi/ethernet icon + signal + SSID, plus an internet sub-state: online / captive / no-internet / connecting / disconnected / airplane. - *Absorbs the airplane module* — the airplane state + toggle move into =custom/net= (airplane is a network concern). Once this ships, the standalone =custom/airplane= module, the =waybar-airplane= + =airplane-mode= scripts, their =tests/=, and the css are deleted (listed under Files touched). The desktop-settings panel (sibling =[#B]=) no longer needs an airplane row. - *Interface-correct* — targets the wifi (or chosen) device, not the default-route interface, so an active USB tether or wired link can't mask wifi state. (Same lesson =captive= fixed; the current =custom/netspeed= keys off the default route and has the bug.) - *Connection management (panel)* — list saved connections most-recently-used first, live signal for in-range wifi, click to switch; add / edit / remove for open + WPA-PSK; activate any existing saved profile (including enterprise ones NM already stores); ethernet↔wifi and wifi↔wifi switching even when a link appears mid-session. - *Diagnostics (panel)* — read-only Diagnose (captive probe 204-vs-portal with the extracted portal URL, gateway ping, DNS config) separated from mutating Repair. Repair has tiers, lightest first: rfkill-unblock, per-connection reset (fresh MAC), full-stack bounce (=nmcli networking off/on=, then restart NetworkManager if that fails), and the temporary 1.1.1.1 override test. Each Repair action confirms and verifies cleanup. - *Speed test (panel)* — down/up/ping with a progress indicator and last-result shown, via the already-installed =speedtest-go --json=. - *Connection secrets* — none of our own. Settings and passwords live where NM already keeps them: =/etc/NetworkManager/system-connections/*.nmconnection= (root-only =0600=, the PSK/EAP secret stored inline). We read/write them through nmcli, which handles the privilege. No separate file, no GPG, no gpg-agent — one fewer dependency, and NM's store is already the secure-at-rest source of truth. - *Persistence* — connectivity probe result cached in the runtime dir so the bar reads it cheaply between probes. - *Observability* — a redacted JSONL event log so a post-failure session can diagnose without re-running destructive actions. ** Out (v1, note for later) - No replacement of NetworkManager's connection engine. NM stays the thing that connects; we drive it via nmcli. - No add/edit *form* for WPA-Enterprise / 802.1X in v1. The reason is effort vs payoff: 802.1X has many interdependent fields (CA cert, client cert, identity, anonymous identity, phase-2 auth) where a wrong entry silently fails auth, so a trustworthy form is a lot of UI for connections Craig rarely adds (open + WPA-PSK covers home, hotels, and phone hotspots). v1 still *activates* existing saved enterprise profiles and points editing at =nmtui=/=nmcli=. Settled (Craig, 2026-06-29): enterprise add/edit is vNext — 24 saved profiles on velox, 0 enterprise, so the form would be unused UI; if one ever appears nmtui adds it once and the module activates it thereafter. - No per-connection captive-portal *auto-login* in v1. (That would mean storing a portal's login form answers — room number, surname, a checkbox — and replaying them automatically when a known portal is detected, so the page never appears. Out for v1 because every portal's form differs and it means storing per-venue answers; v1 just opens the portal for you.) - No graphing/history of speed-test results beyond the last run. - No static-IP / proxy / metered / MAC-randomization editing in v1 (activate existing, edit elsewhere). - No VPN / WireGuard management in v1, but it's a planned later phase (Phase 5), not a permanent exclusion — it folds the existing archsetup wireguard tooling into the same panel/CLI. - The desktop-settings dropdown (sibling =[#B]=) is a separate module, but it shares the GTK4 layer-shell panel shell built here. * Architecture Three layers. Keep the bar cheap, the panel rich, the logic in one tested place. 1. *Engine* — a =net= Python package (src-layout, unittest), exposing a CLI. Wraps every nmcli op and owns the diagnostics. Emits JSON. This is the testable core (fake =nmcli= / =curl= / =speedtest-go= on PATH, like the existing =waybar-netspeed= and =waybar-sysmon= test harnesses). Precedent: pocketbook is Python in the dotfiles repo; =wtimer= is Python for the same testability reason. 2. *Indicator* — a thin =waybar-net= script that calls =net status --json= and renders icon + signal + state + tooltip. Replaces =custom/netspeed= (throughput folds into the tooltip). 3. *Panel* — a GTK4 + gtk4-layer-shell app (mirrors pocketbook's structure) that imports the engine. Hosts connection management, diagnostics, and the speed test. How the existing pieces map in: - =captive= (bash, shipped) — the engine shells out to it for the heavy, interactive portal-force flow (sudo reset, DNS override, browser launch). Its cheap portal-detection logic is mirrored natively in the engine for the fast status path so the bar never blocks on a subprocess. =captive= stays a usable standalone CLI. The refactor (below) extracts its probe + reset into functions the engine can call non-interactively. - =waybar-netspeed= (sh, shipped) — retired; its throughput sampling moves into the engine's status output and renders in the indicator tooltip only. - =nmcli= — the connection backend for every op. Language note: the engine is Python; the indicator is a thin Python or sh wrapper over =net status --json=. The bar path must stay fast (see Performance budgets), so the indicator does no network I/O itself — it reads link state and the cached connectivity result. * Repository + dependencies - *Code lives in the dotfiles repo* (=~/.dotfiles=), not archsetup. The =net= package sits in-tree like pocketbook (src-layout, unittest, Makefile target); =waybar-net= and the =net= CLI entry live in the hyprland tier (=hyprland/.local/bin/=). Tests under =tests/net/= and =tests/waybar-net/=. archsetup owns only the *dependency install*, not the code. - *archsetup installs the deps* in its Hyprland step: =gtk4-layer-shell=, =python-gobject=, plus =nmcli=/=curl=/=resolvectl=/=rfkill= (already present via NetworkManager/curl/systemd/util-linux). Speed test uses =speedtest-go= (AUR =speedtest-go-bin=, already installed on velox); archsetup adds it to the AUR list. librespeed-cli is the documented fallback if a self-hosted LibreSpeed server is ever wanted. No =gpg= dependency (secrets live in NM's own store). - *Daily-drivers*: a stowed-script + AUR-dep feature, so ratio needs the same =git pull= + stow + the archsetup-added deps. Note the manual dep step in the rollout. ** Makefile targets (console recovery is a first-class path) =net doctor= and the diagnostics are reachable from a bare TTY when waybar and the GUI are down — that's the case where you most need them. The dotfiles Makefile carries targets that wrap the =net= CLI so "get back online" is one make command from the console: - =make online= — =net doctor --fix= (diagnose, then apply the lightest repair: rfkill-unblock → reset → bounce → open portal). The headline recovery target. - =make net-doctor= — =net doctor= (read-only diagnose + recommendation). - =make net-status= / =make net-diagnose= / =make net-portal= / =make net-reset= / =make net-bounce= — the individual ops. - =make test= — already runs =tests/*=; the =net= package's unittest suites are collected the same way. These intentionally need only nmcli/curl/rfkill (no GUI, no waybar, no Python GTK), so they work from a TTY on a broken graphical session. * Connectivity model — split cadence The indicator polls every ~2s, but a real internet/captive probe every 2s wastes battery and can re-trigger a captive portal. So split it: - *Fast path (every poll, cheap, no network)* — interface, type, SSID, signal, IPv4 presence, throughput sample. From nmcli / sysfs only. No network I/O. - *Slow path (cached, TTL ~45s)* — the actual internet/captive probe (the 204 check + meta-refresh portal extraction). Result cached at =$XDG_RUNTIME_DIR/waybar/net-connectivity.json= with a timestamp. The indicator reads the cache each poll. When the cache is older than the TTL, =net status= kicks =net probe= in the background (spawn + detach, never awaited) and renders the last cached sub-state meanwhile. A user-triggered diagnose/reconnect refreshes the cache immediately. This keeps the bar responsive and the portal un-poked. ** Concurrency, atomicity, staleness - *Single-flight* — =net probe= takes a lock file at =$XDG_RUNTIME_DIR/waybar/net-probe.lock= (flock, non-blocking). A second probe while one runs is a no-op, so a flapping 2s poll can't pile up overlapping probes. - *Atomic writes* — the cache is written to a temp file + =os.replace= (atomic rename), so a reader never sees a half-written cache. Same pattern as =wtimer=. - *Max probe runtime* — the probe has a hard timeout (≤ 6s total: curl =--max-time 5= + slack). On timeout it writes an =unknown= result, never hangs. - *Stale classes* the indicator distinguishes: fresh (< TTL), stale (TTL..3×TTL, shown with a subdued/aging hint), expired (> 3×TTL → treat as unknown), unknown (no cache / probe failed). The bar never shows a confident "online" past the expired threshold. - *Invalidation* — the cache records the iface + SSID + active-connection UUID it was taken under; a change in any of them invalidates it immediately (a reconnect must not show the old network's verdict). - *Crash cleanup* — a stale lock older than the max runtime is ignored/reclaimed. * Performance budgets (hot path) The bar exec path (=waybar-net= → =net status=) must stay responsive: - *Budget*: =net status= returns in < 100ms typical, < 250ms worst case. - *No sleeping in the bar path.* Throughput is sampled from two reads of =/sys/class/net//statistics/{rx,tx}_bytes= across the *waybar poll interval itself* (delta since the last cached sample + timestamp), not via an in-process =sleep= like the old =waybar-netspeed=. The cache holds the prior counters. - *Subprocess cap*: at most one =nmcli= invocation on the hot path (a single =nmcli -t -f ...= multi-field query), plus sysfs reads. Never a per-field nmcli call. - *Every subprocess has a timeout* (=nmcli --wait 2=, =subprocess timeout=). On timeout or error the indicator emits a degraded JSON state (class =net-degraded=, a neutral glyph) rather than blocking or crashing waybar. - *Benchmark test*: a fake slow =nmcli= asserts =net status= still returns within budget by falling back to the degraded state. * Engine — =net= CLI surface All subcommands take =--json= where a machine reads them. Pure formatting/state functions under the CLI; IO (nmcli, curl, file) at the edges. Every subcommand exits non-zero with a JSON error envelope (see JSON schemas) on failure. - =net status [--json] [--iface IF]= — fast link state + cached connectivity sub-state + throughput. The indicator's source. Never does network I/O. - =net probe [--iface IF]= — run the connectivity/captive probe now, update the cache (single-flight, atomic), print online | captive (+ portal URL) | no-internet | unknown. Mirrors =captive='s cheap detection natively. - =net list [--json]= — saved connections, MRU order, active flag, plus in-range wifi with signal. - =net up = / =net down [--iface IF]= — switch / disconnect. Operates on UUID, not name (see nmcli contract). - =net add= / =net edit = / =net remove = — manage connections (open + WPA-PSK) through nmcli; the secret lands in NM's own =.nmconnection=. Enterprise profiles are activate-only. - =net rescan [--iface IF]= — wifi rescan. - =net diagnose [--json]= — read-only report: gateway ping, DNS config, captive probe. The structured contract below. Doubles as the post-failure snapshot. - =net repair [--json]= — mutating remediation, lightest first: =rfkill= (unblock + radio on), =reset= (fresh MAC), =bounce= (full-stack: =nmcli networking off/on=, escalating to =systemctl restart NetworkManager=), =dns-test= (temporary 1.1.1.1 override, auto-reverted). Each confirms via the caller and verifies cleanup. - =net doctor [--json] [--fix]= — one-shot "get me online" mode for the console: runs the full diagnose, then applies the lightest repair that fits (unblock rfkill, reset, bounce, open portal) — read-only without =--fix=, acting with it. The TTY recovery path when waybar/the GUI is down (see the Makefile targets). - =net portal= — run =captive='s portal-force flow (reset if needed, extract + open the portal page). - =net speedtest [--json]= — =speedtest-go --json= run; down/up/ping. * nmcli contract The command wrapper is the reliability boundary; SSIDs and connection names contain spaces, colons, duplicates, hidden names, and non-ASCII. Rules: - *Terse, field-selected output*: =nmcli -t -f --escape yes ...= and =nmcli -g ...= (get-values) for single-value reads. Parse with the documented escaping (=\:= and =\\=); never naive =cut -d:=. - *UUID is the handle.* Every saved-profile op (=up=, =down=, =modify=, =delete=) uses the connection UUID, never the display name — names duplicate and contain separators. =net list= surfaces UUIDs; the panel maps row → UUID. - *Wait budgets*: activation/deactivation use =nmcli --wait = with an explicit budget (hot-path reads =--wait 2=; activation =--wait 30=). No unbounded waits. - *Connectivity*: NM's own =nmcli networking connectivity= can return =none/portal/limited/full/unknown=. Use it as a *cheap hint* on the fast path when present, but the authoritative captive verdict is still our own probe (NM's portal detection is coarser and config-dependent). - *Parser tests* (fake nmcli fixtures): escaped colons and backslashes in SSIDs, embedded newlines, duplicate connection names, hidden SSID (empty name), non-ASCII SSID, the wired-appears-mid-session case, and the multi-active case (wifi + tether both up). * JSON schemas Versioned (="v": 1=) envelopes so tests lock the contract. Sketches (fields nullable unless noted): - =status=: ={v, iface, type: wifi|ethernet|none, ssid, signal, ipv4, gateway, throughput: {rx_bps, tx_bps}, connectivity: online|captive|no-internet|unknown, connectivity_age_s, connectivity_class: fresh|stale|expired|unknown, state: online|captive|no-internet|connecting|disconnected|airplane|wired|degraded}=. - =probe=: ={v, result: online|captive|no-internet|unknown, portal_url, http_code, redirect_host, elapsed_ms, ts}=. - =list=: ={v, connections: [{uuid, name, type, active, last_used, signal, in_range, security}]}=. - =diagnose=: ={v, steps: [], overall: ok|warn|fail}=. - =speedtest=: ={v, down_mbps, up_mbps, ping_ms, server, elapsed_ms, ts}=. - error envelope (any command): ={v, error: {code, message, detail, partial: bool}}= with a non-zero exit. * Diagnostics contract =net diagnose --json= returns an ordered list of steps. Each step is the unit the panel renders and the log records: - =id= — stable identifier (e.g. =link=, =dhcp=, =gateway-ping=, =dns-config=, =dns-resolve=, =http-probe=, =portal=). - =status= — =pending | running | pass | warn | fail | skipped=. - =title= — short human label. - =evidence= — redacted detail (the value seen), per the redaction rules. - =elapsed_ms=. - =safety= — =read-only= or =mutating= (diagnose steps are all read-only). - =next_action= — what the user/agent should do on warn/fail (e.g. "open portal", "reset connection", "switch network"). Repair actions (=net repair=) carry the same shape but =safety: mutating=, plus a =cleanup_verified: bool= field (e.g. the DNS override was reverted) and a terminal =cleanup-unverified= status when revert can't be confirmed. ** Diagnose vs Repair (read-only vs mutating) The panel separates them visually and behaviorally: - *Diagnose* — probe, gateway ping, DNS config read, captive check. No state change, no sudo, runnable freely. - *Repair* — reset (fresh MAC, deletes+recreates the NM profile), DNS override test (mutates resolver, auto-reverts), portal force. Each needs an explicit confirm, shows that it's privacy/state-changing, and verifies cleanup. A Repair whose cleanup can't be verified ends in a visible =cleanup-unverified= state, never a silent success. * Failure states, messages, recovery Each row below gives the *exact, final* user-facing string (not a template) with == for redacted evidence, plus the evidence field included and the next action. The string is canonical: every surface renders the same text, so there's one source of truth. Per-surface rendering of the canonical string: - *Indicator* — the matching glyph + CSS class; the string is the tooltip (untruncated). - *Notification* (=notify=) — title = the failure label, body = the string. - *CLI* — the string on stderr; =--json= puts it in =error.message= with the evidence in =error.detail= and a stable =error.code=. - *Panel* — the string as the section banner, with the diagnostic step's evidence shown beneath. Evidence is always redacted per the redaction rules (SSID/host shown; PSK/EAP/ portal tokens never). - *associated, no DHCP* — "Connected to , no IP (DHCP failed)" → evidence: SSID, iface → reset / reconnect. - *no-internet* — "On , no internet (gateway reachable, no route out)" → diagnose / switch network. - *captive* — "Captive portal at — login required" → Open portal. - *DNS hijack* — "DNS is being redirected (portal)" → Open portal. - *DNS broken* — "DNS not resolving (hotel DNS down); 1.1.1.1 works" → use override / report. - *HTTP intercepted* — "Traffic is being intercepted before it leaves" → Open portal. - *sudo declined* — "Reset needs admin; it was declined — nothing changed" → retry with auth. - *command timed out* — " timed out; the system was left unchanged" → retry. - *partial mutation* — " partially applied: ; rolled back to " → review. - *missing speedtest-go* — "speedtest-go not installed" → install hint. - *no wifi hardware* (desktop) — wifi rows hidden; ethernet-only view. - *wifi rfkill-blocked* — "WiFi is blocked (rfkill)" → unblock. The indicator detects a soft-blocked radio (=rfkill list= shows the radio off though hardware is present) and shows this distinct from disconnected. =net repair rfkill= (and =net doctor --fix= as its first step) runs =rfkill unblock wifi= + =nmcli radio wifi on= and reconnects. This is the framework-laptop case: an out-of-power shutdown sometimes leaves wifi soft-blocked at next boot, and yes — the module recovers it (the rfkill state is the indicator; the rfkill repair / doctor is the one-step fix). A *hard* block (physical switch) is reported as not-recoverable-in-software with that message. - *wifi rfkill hard-blocked* — "WiFi is blocked by the hardware switch" → evidence: rfkill hard state → flip the physical switch. - *wrong password / missing secret* — "Saved password for was rejected" → evidence: SSID, NM auth-failure reason → re-enter the password. - *enterprise auth/cert failure* — "Enterprise login failed for (802.1X)" → evidence: SSID, EAP failure reason → edit the profile in nmtui/nmcli. - *upstream / AP / provider* — "On , link is fine but the network has no uplink" → evidence: gateway reachable, no route out, not a portal → switch network or contact the venue. - *VPN-routed* — "Connected; internet is routed through a VPN ()" → evidence: default route on a tun/wg device or non-NM DNS owner → check the VPN, not WiFi. - *HTTP interception, no parseable portal URL* — "A portal is intercepting traffic but didn't give a login link" → evidence: HTTP code, redirect host → opens neverssl + the gateway page to log in manually. - *DNS override cleanup unverified* — "Couldn't confirm DNS was restored after the test" → evidence: iface, attempted revert → revert DNS manually (=resolvectl revert =). Each message names whether the system was left unchanged, partially changed (with what), or fully changed, so the user knows the residue. * Doctor: escalation, classification, terminal states =net doctor= diagnoses, classifies the failure, then (with =--fix=) applies the *lightest* repair that fits and re-checks — it never loops destructive repairs against a failure they can't fix. Each failure resolves to one of four outcomes, and the doctor stops at any terminal one: - =fixable= — a local repair should help. Escalate lightest-first: rfkill-unblock → reset (fresh MAC) → bounce (full stack) → portal, re-probing after each, and stop as soon as the probe returns online. - =needs-user-action= (terminal) — no reset/bounce will help; doctor stops and names the exact next step. Covers: wrong WPA password / missing NM secret (enter the password), locked keyring or polkit denial (retry with auth), enterprise 802.1X cert/identity failure (edit the profile in =nmtui=/=nmcli=), captive portal login-required (open the portal + accept terms). Doctor must not delete/recreate the profile against these — that loses the saved password and makes things worse. - =upstream-not-local= (terminal) — the local link is up but the problem is past it: AP has no uplink, gateway down/dropping traffic, DHCP server broken, ISP outage, portal backend failing. =diagnose= proves it (link up + IP + gateway reachable, but no route out and not a captive redirect), and =doctor --fix= stops after local repairs are exhausted with "local repairs tried; likely upstream/AP/provider" + the evidence. Next action: switch networks or contact the venue. - =deferred/vpn= (terminal for v1) — an active VPN / policy route / non-NM resolver owns the default route or DNS, so "no internet" may be the VPN's fault, not WiFi's. v1 *detects* this (default route on a =tun/wg= device, or DNS owned by something other than the NM link) and classifies it separately — "link is fine; internet is VPN-routed" — rather than misclassifying it as a WiFi failure. v1 does not repair it (VPN management is Phase 5); it names the VPN as the likely owner and stops. ** DNS handling in doctor (explicit per class) - *Captive DNS hijack* — open the portal (the hijack clears on login). No DNS mutation. - *Broken resolver, 1.1.1.1 works* — doctor offers an explicit *temporary* 1.1.1.1 override as a repair with cleanup verification (auto-revert, =cleanup_verified=); without =--fix= it only recommends the command. It does not leave a permanent resolver change. - *Port-53 / egress blocked* (even 1.1.1.1 fails) — terminal =upstream-not-local=; doctor stops, since it's not locally fixable. * Failure-mode coverage For each common field failure: does =net diagnose= detect it, can =net doctor --fix= repair it, and what terminal user action remains when it can't. (The =needs-user-action= / =upstream-not-local= / =deferred/vpn= outcomes are defined above.) | Failure mode | diagnose detects | doctor --fix | terminal user action | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | rfkill soft block | yes | yes (unblock) | none | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | rfkill hard block | yes | no | flip the physical switch | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | no wifi hardware | yes | n/a | use ethernet | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | associated, no DHCP | yes | yes (reset/bounce) | none, else switch network | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | gateway unreachable | yes | yes (bounce) | switch network if it | | | | | persists | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | captive DNS hijack | yes | opens portal | log in at the portal | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | broken DNS, 1.1.1.1 works | yes | yes (temp override, | report the venue's DNS | | | | auto-reverted) | | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | HTTP captive portal | yes | opens portal | log in at the portal | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | HTTP interception, no | yes | opens neverssl + gateway | log in manually | | parseable URL | | | | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | upstream / AP outage | yes (link up, no route out) | no (stops after local) | switch network / contact | | | | | venue | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | wrong WPA password / | yes | no | enter the password | | missing secret | | | | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | enterprise auth / cert | yes | no | edit the profile in | | failure | | | nmtui/nmcli | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | duplicate SSID / | yes (UUID-keyed) | yes (activate by UUID) | none | | connection-name | | | | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | hidden SSID | yes | yes (connect by name) | enter SSID + password | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | multiple active links | yes | n/a | pick the interface | | (wifi+tether) | | | | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | wedged NetworkManager | yes | yes (bounce → restart NM) | none, else reboot | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | slow / hung command | yes (degraded) | retries within budget | retry | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | stale / corrupt cache | yes | self-heals (atomic + | none | | | | invalidation) | | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | DNS cleanup failure | yes | flags cleanup-unverified | revert DNS manually | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | missing speedtest backend | yes | n/a | install speedtest-go | |----------------------------+-----------------------------+-----------------------------+-----------------------------| | VPN / policy-routing | yes (route/DNS ownership) | no (deferred to Phase 5) | check the VPN | | interference | | | | |----------------------------+-----------------------------+-----------------------------+-----------------------------| * Observability — logging + redaction - *Event log*: JSONL at =$XDG_STATE_HOME/net/events.jsonl= (fallback =~/.local/state/net/events.jsonl=), size-rotated (e.g. 1 MB × 3). Every mutating op and probe appends an event: =ts, op, argv (redacted), exit_code, stderr_tail, elapsed_ms, iface, nm_uuid, probe_url_class, http_code, redirect_host, cache_event=. - *Redaction (always on)*: PSKs, EAP identities/passwords, NM secrets, and portal query tokens are never logged. MAC addresses, full IPs, and SSID are redacted when configured (=redact_mac=, =redact_ip=, =redact_ssid= in config). - *Post-failure diagnosis*: =net doctor --json= is the snapshot + recommendation (diagnose plus the suggested repair), =net diagnose --json= the raw report, and the event log the history. =net doctor= is the console-recoverable entry point (reachable as =make online= / =make net-doctor=). - *Secret-leak tests*: assert no PSK/EAP/portal-token ever appears in any JSON output, log line, or error message. * Indicator (task #C — Phase 1, the fast win) ** States (internet sub-state on top of link state) - online — associated and the probe returned 204. Normal icon. - captive — associated, probe hit a portal. Distinct glyph + warning CSS class; tooltip names the portal host; left-click opens diagnostics with the portal ready to open (Phase 2+; see interactions for the Phase-1 interim). - no-internet — associated, probe failed (no portal, no 204). Distinct glyph + warning class. - degraded — =net status= couldn't read link state within budget (slow/failed nmcli). Neutral glyph, =net-degraded= class. Never blocks the bar. - rfkill-blocked — the radio is soft-blocked (=rfkill=), distinct from disconnected. Distinct glyph; the fix is =net repair rfkill= / =net doctor=. - connecting / disconnected / airplane / wired — as today, plus wired shown correctly even when it appears after session start. (airplane is now this module's state, absorbed from the retired airplane module.) ** Glyphs Nerd-font codepoints, final values verified live before merge (same discipline as wtimer). Reuse the signal-strength ramp already in =waybar-netspeed=; add a captive / no-internet / degraded overlay glyph. ** Tooltip SSID + signal + IPv4 + gateway + the throughput readout (absorbed from netspeed) + the last probe result and its age (stale/expired hinted). ** Interactions (phase-aware; no keyboard-modifier clicks — waybar can't qualify clicks by modifier, so the rich actions live in the panel, not ctrl/super-click) Clicks never block the bar: each dispatches a detached background job and reports via =notify=, single-flight per action. - *Phase 1 (no panel yet)*: left-click runs =net probe= + notify (refreshes the state on demand) and keeps the existing =pypr toggle network= scratchpad as the interim manager; right-click runs =net repair reset= in the background + notify; middle-click runs =net portal=. - *Phase 2+ (panel exists)*: left-click opens the panel (focused on the relevant section — diagnostics when captive); right/middle keep the background reset/portal shortcuts. * Panel (tasks #B + #C diagnostics — Phases 2-3) GTK4 + gtk4-layer-shell, pocketbook scaffold (src-layout package, unittest, Makefile, gtk4-layer-shell anchored dropdown under the bar). One panel shell, reused by the future desktop-settings panel. Sections: 1. *Connections* — list, MRU-first, active marked, live signal bars for in-range wifi; row click switches; buttons for add / edit / remove; a rescan control. 2. *Diagnose* (read-only) — Probe (204/captive, shows portal URL + Open), Gateway ping, DNS config. Streaming step output (the diagnostics contract). 3. *Repair* (mutating, confirmed) — tiered lightest-first: Unblock rfkill, Reset (fresh MAC), Bounce (full stack), DNS override test, Force portal. A "Get me online" button runs =net doctor --fix= (the auto-escalating sequence). 4. *Speed test* — Run button, progress, down/up/ping result + last-run line. ** Panel state, cancellation, permissions State machines for: connection-list loading, rescan-in-progress, activation-in-progress, diagnose-running, repair-running, speedtest-running. Plus the real terminal states on this two-machine fleet: no-wifi-hardware (desktop → ethernet-only view) and missing speedtest-go. (No GPG-key state — there's no credential store; secrets live in NM.) ("No NetworkManager" is not a modeled state — NM is always present on these machines; if nmcli is somehow absent the panel shows a single hard-error and exits.) Long operations show elapsed time and are cancellable where the underlying op allows (rescan, speedtest, probe); clearly non-cancellable ones (an in-flight activation) show elapsed + a disabled control. Permission-denied (sudo/polkit declined) is a first-class outcome with the "nothing changed" message, never a silent failure. Interaction-pattern catalog (=~/code/rulesets/patterns/=) principles that apply: - transient-state-buttons — all the network levers in one place, reachable by one chord (the bar click), state visible. - default-most-common-friction-proportional — connections MRU-ordered so the common pick is first; destructive ops (remove) and privacy-changing ones (reset, override) get a confirm, switching does not. - one-prompt-picker-typed-prefix — if the connection picker ever goes keyboard-driven, kind (wifi/eth/saved/in-range) + name in one typed picker. ** Panel UX flow (settle before Phase 2) The concrete interaction defaults, so the GTK build isn't inventing them: - *Default focus*: the Connections section, current connection's row selected. If the indicator opened the panel because of a captive/no-internet state, focus Diagnose instead with the relevant action highlighted. - *Row content*: glyph (signal bars / wired / active check) + name + a secondary line (security type, "active"/last-used). The active row is visually pinned at top of its group. - *Buttons*: one *primary* per section (Connections: Connect to the selected row; Diagnose: Run diagnose; Repair: "Get me online"; Speed test: Run). Secondary actions (add / edit / remove / rescan; individual repair tiers) are smaller and grouped. - *Disabled rules*: Connect disabled on the already-active row; Repair tiers disabled while one runs; Speed test disabled while running; add/edit disabled for enterprise (with the "edit in nmtui/nmcli" hint). - *Confirmations* (exact wording): Reset → "Reset ? This drops the connection and reconnects with a new MAC."; Bounce → "Restart networking? All links drop briefly."; DNS override → "Temporarily set DNS to 1.1.1.1 for the test? It reverts automatically."; Remove → "Forget ? The saved password is deleted." - *"Get me online" reporting*: shows each escalation step live (Unblock rfkill → Reset → Bounce → Portal) with per-step pass/fail and stops at the first that restores internet or at a terminal state, naming the next action. - *After close*: the bar reflects the new state immediately (signal/refresh on next poll); a running speedtest/diagnose keeps running and notifies on finish (panel close doesn't cancel it). - *Keyboard*: Esc closes; Tab moves between sections; arrows move rows; Enter fires the section primary; the connection list is type-to-filter. * Connection management (nmcli) - Every op via nmcli per the nmcli contract above (terse, escaped, UUID-keyed, bounded =--wait=). - MRU ordering from NM's =connection.timestamp= (last activated), descending. - Ethernet appears in the list whenever a wired device is present, selectable at any time; switching just brings the chosen connection up. - *Mutation safety + rollback*: switching keeps the current connection up until the new one activates successfully (=nmcli --wait 30=); on failure it does not tear down the working link, surfaces the failure, and leaves the prior connection active. =net down= notes that NM may auto-reactivate a profile and reports the post-op active connection so the user isn't surprised. A switch that needs a password it doesn't have prompts (or fails with "password required"), never silently strands. The exact NM command sequence (preflight active-state read → activate target → verify default route → on failure, confirm prior still up) is pinned in the engine and tested against fake nmcli. - *Add/edit scope*: open + WPA-PSK only in v1. Existing saved profiles of any type (including enterprise) can be *activated*; editing an enterprise profile shows "edit via nmtui/nmcli" rather than a broken partial form. * Connection secrets (no separate store) Per Craig's call: don't build a parallel credential store. Settings and secrets live where NetworkManager already keeps them, so there's one source of truth and no extra dependency (no GPG, no gpg-agent, no =~/.config/net/connections=). - *Where secrets live*: =/etc/NetworkManager/system-connections/.nmconnection=, root-owned =0600=, with the PSK/EAP secret stored inline (the default =secret-flags=0= "owned by NM"). That's already secure-at-rest (root-only) and is what =nmcli= reads/writes. - *How we touch them*: every add/edit/remove goes through =nmcli= (=connection add / modify / delete=), which writes the =.nmconnection= with the right ownership and perms. We never read or write =system-connections= files directly (root) and never copy a secret out of them. - *No export / import / sync* — there's nothing to sync. A new machine gets its connections the way it always has (the user joins, or restores NM profiles), not from a tool-specific vault. - *config file*: =~/.config/net/config= still exists, but only for non-secret preferences (speedtest server, redaction flags, probe TTL). It holds no credentials. - *No secret leakage*: PSK/EAP never appear in =net=' =--json= output, the event log, or error text (tested) — even though NM is the store, our surfaces must not echo a secret =nmcli= happens to return. * Speed test - Backend: *=speedtest-go=* (=--json=, =--server=, =--no-download/--no-upload=), already installed on velox (AUR =speedtest-go-bin=). No new dependency for v1. librespeed-cli is the documented fallback for a self-hosted LibreSpeed server. - =net speedtest --json= parses speedtest-go's JSON into the =speedtest= schema. - *Server policy*: auto-select nearest by default; allow a pinned server id in =~/.config/net/config=. - *Timeout + cancellation*: a hard run timeout (e.g. 60s); the panel run is cancellable (kills the child). Offline / rate-limited / no-server errors map to the failure-message table. - *Tests*: fixture JSON (success) and fixture stderr (offline, no server, malformed output) drive =net speedtest= parsing without touching the network. * Help + documentation In-app help has three layers, each reachable in the situation it's needed: - *CLI help (works from a dead-GUI TTY)*: =net --help= lists the subcommands in one screen; =net --help= documents each (flags, what it mutates, the console-recovery targets). The Makefile targets are self-describing (=make help= lists =online= / =net-doctor= / etc. with one-line descriptions). This is the layer that matters most when you're at a console with no network. - *Panel help (in the GUI)*: a small =?= affordance in the panel header opens an inline help pane — what each section does, which Repair actions mutate state, what the indicator glyphs/colors mean. Per-control tooltips on the less-obvious buttons (rfkill, bounce, DNS override). No external help browser. - *User guide (the durable doc)*: a README / docs page covering every command, the indicator states + glyphs, the panel sections, the config file keys, the recovery make targets, troubleshooting (the failure-message table), and rollback. Written so a future session — or Craig six months out — can operate and recover the module from the doc alone. The failure-message table above is the single source of truth for the troubleshooting text; the guide and the panel help both render from it rather than restating it. * Enhancement radar Low-cost adjacent affordances, each dispositioned so cheap wins aren't lost and the v1 panel stays focused. (Several are already in v1 by virtue of other sections; marked here so the consideration is visible.) | Enhancement | Disposition | Reason | |-------------------------------------+-------------+--------------------------------------------------------| | Open / copy portal URL | v1 | already in the captive flow; trivial Open + Copy | |-------------------------------------+-------------+--------------------------------------------------------| | Forget network | v1 | it's the remove op, already specced | |-------------------------------------+-------------+--------------------------------------------------------| | Rescan now | v1 | already a Connections control | |-------------------------------------+-------------+--------------------------------------------------------| | Retry with hardware MAC | v1 | captive already has --hardware-mac; expose in Repair | |-------------------------------------+-------------+--------------------------------------------------------| | Pin speedtest server | v1 | already a config key | |-------------------------------------+-------------+--------------------------------------------------------| | Copy redacted doctor report | v1 | cheap, serves the observability/support goal | |-------------------------------------+-------------+--------------------------------------------------------| | Show last good network / result | vNext | needs small history persistence | |-------------------------------------+-------------+--------------------------------------------------------| | Watch mode for net doctor | vNext | a --watch loop; handy at a TTY, not v1-critical | |-------------------------------------+-------------+--------------------------------------------------------| | Actionable desktop notifications | vNext | dunst supports actions; extra wiring | |-------------------------------------+-------------+--------------------------------------------------------| | Keyboard connection picker (fuzzel) | vNext | the typed-prefix pattern; panel covers v1 | |-------------------------------------+-------------+--------------------------------------------------------| | QR-code share / import WiFi | rejected | low value for a personal 2-machine setup; phones do QR | |-------------------------------------+-------------+--------------------------------------------------------| * Waybar wiring - Replace =custom/netspeed= with =custom/net= in the bar's module list (same slot). - Module def: =exec: waybar-net=, =return-type: json=, =interval: 2=, a =signal= for on-demand refresh (next free signal after wtimer's 14), =on-click=, =on-click-right=, =on-click-middle= per the phase-aware interactions (each dispatches a detached job, never blocks). - Remove the old =on-click: pypr toggle network= scratchpad only once the panel replaces it (Phase 2); Phase 1 keeps it as the interim manager. * Testing plan (TDD) - *Engine (normal)* — fake =nmcli= + =curl= + =speedtest-go= on PATH; assert command sequences and parsed/emitted JSON for status, list, up/down, add/edit/remove, probe, diagnose, repair, speedtest. Pure state/format functions tested directly. JSON schemas locked by example. - *Portal parser* — already covered in =tests/captive= (Normal/Boundary/Error + the real SONIFI body). The engine's native probe reuses the same cases. - *nmcli parsing* — escaped colon/backslash/newline in SSID, duplicate names, hidden SSID, non-ASCII, wired-mid-session, multi-active (wifi+tether). - *Failure + concurrency (the risky classes)* — slow/hung nmcli/curl/speedtest (degraded state within budget), concurrent =net status= probe refresh (single-flight), corrupt cache (recovered), stale cache after SSID change (invalidated), permission denied / sudo declined, DNS-override cleanup failure (=cleanup-unverified=), NM partial activation (rollback keeps prior link), secret redaction, missing speedtest-go, no wifi hardware, rfkill soft/hard block. - *Doctor classification* — fixture-driven =net doctor= over fake nmcli/curl asserting the right terminal classification + that =--fix= stops before destructive repairs: auth failures (=needs-user-action=), upstream/AP failure (=upstream-not-local=), VPN-routed failure (=deferred/vpn=), and the DNS classes (hijack → portal, broken-but-1.1.1.1-works → offered override, egress-blocked → upstream). Assert the failure-mode coverage table's "detects / repairs / terminal action" holds for each row. - *Indicator* — drive =net status --json= through =waybar-net=, assert the JSON per state (online / captive / no-internet / degraded / wired / disconnected / rfkill), iface override via env. - *Panel* — pocketbook-style: backing logic (list ordering, op dispatch, state-machine transitions), not GTK widgets. - *NM secrets / no-leak* — add/edit writes the secret into NM via nmcli (asserted against fake nmcli, never to a tool-owned file); assert no PSK/EAP appears in any =--json=, log line, or error (there is no credential store to round-trip). - *Live checklist (gated out of the suite)* — a "Manual testing and validation" task per phase for the real-network states (captive at a hotel, no-internet, switch under load, reset, speedtest) that can't be faked. ** Harness + coverage gate The concrete contract, matching the repo's existing convention (not pytest — the dotfiles suites are =unittest=, run by =make test= as =python3 -m unittest= over =tests/*/test_*.py=; 33 suites today): - *Framework*: =unittest=. Each suite is =tests//test_.py= (=tests/net/=, =tests/waybar-net/=), collected by the existing =make test= loop — no new runner, no pytest dependency. - *Fakes on a temp PATH*: =fake-nmcli=, =fake-curl=, =fake-speedtest-go=, =fake-rfkill=, =fake-resolvectl= live as executable stubs in =tests//= (the =tests/layout-navigate/fake-hyprctl= pattern). A fixture file encodes the command→canned-output map and the stub appends each invocation to a log the test asserts against. Subprocess timeouts are simulated by a stub that sleeps past the budget; =net status= must still return the degraded state. - *Waybar wrappers end-to-end*: =waybar-net= is run as a subprocess with the fake PATH and the env overrides (iface, cache path), asserting the emitted JSON — same as =tests/waybar-netspeed=. - *Coverage*: coverage.py is absent system-wide (and not importable), so coverage runs in a throwaway venv (=python3 -m venv=, =pip install coverage=, =coverage run -m unittest=, =coverage report=) — the method the wtimer suite used (95%). Target: *branch* coverage over =net/= and the wrapper, ≥ 90% on the pure classifier/parser modules. ** Coverage as a gap-finder, not a number (per phase) Line coverage alone misses the branches that matter here, so each phase ends with a *coverage-gap pass*, not just a percentage: - After the first green run, read the branch report and map every uncovered branch to either a new test or a consciously-excluded live-only behavior (with a comment or a Manual-testing entry naming it). - *Branch coverage is required* for the pure logic: the doctor classifier (every outcome — fixable / needs-user-action / upstream-not-local / deferred-vpn), the cleanup-unverified path, the redaction paths, the degraded hot-path fallback, the timeout branches, and the portal/nmcli parsers. - A phase isn't "done" until its coverage-gap pass is recorded — uncovered logic is either tested or explicitly excused, never silently uncovered. * Files touched (planned, all in =~/.dotfiles=) - =net/= package (src-layout, like pocketbook) — engine + panel. - =hyprland/.local/bin/waybar-net= — the indicator (replaces =waybar-netspeed=). - =hyprland/.local/bin/net= — engine CLI entry (console-script shim). - =hyprland/.config/waybar/config= — swap =custom/netspeed= → =custom/net=; remove =custom/airplane=. - =hyprland/.config/waybar/style.css= — captive / no-internet / degraded / rfkill classes; remove airplane classes. - =tests/net/=, =tests/waybar-net/= — suites. - =captive= — refactor: extract probe + reset into functions callable non-interactively (a =--json= probe mode) so the engine reuses them. - =~/.config/net/config= — seed config (probe TTL, speedtest server, redaction flags). No secrets; not a credential store. - dotfiles =Makefile= — add the console-recovery targets (=online=, =net-doctor=, =net-status=, =net-diagnose=, =net-portal=, =net-reset=, =net-bounce=). - *Deletions once net ships* (the airplane module is absorbed): =hyprland/.local/bin/waybar-airplane=, =hyprland/.local/bin/airplane-mode=, =tests/waybar-airplane/=, =tests/airplane-mode/=, and the =custom/airplane= module + its css. - archsetup Hyprland step — add =gtk4-layer-shell=, =python-gobject=, =speedtest-go-bin= to the install lists (the only archsetup change; no =gpg= added, secrets stay in NM's store). * Resolved decisions (Craig's calls + this response) 1. Panel UI tech → GTK4 + gtk4-layer-shell, shared pocketbook scaffold (one panel shell, reused by the desktop-settings sibling). 2. Engine language → Python =net= package; shells out to =captive= for the portal-force flow, native cheap probe for the bar path. 3. Connectivity probe → split cadence (fast link poll every 2s + slow cached internet/captive probe, TTL ~45s) with single-flight + atomic cache. 4. No keyboard-modifier clicks (waybar can't qualify them) — the panel hosts the rich actions; bar clicks dispatch detached jobs (phase-aware). 5. No separate credential store (Craig's call, cj). Secrets live in NM's own =system-connections= (root =0600=, inline), touched via nmcli. No GPG, no gpg-agent, no =~/.config/net/connections=. Supersedes the earlier GPG-store design. 6. =custom/netspeed= absorbed into =custom/net=; throughput moves to the tooltip. 7. Speed-test backend → =speedtest-go= (already installed), not a new librespeed-cli dependency; librespeed-cli is the self-hosted fallback. 8. Code lives in the dotfiles repo; archsetup only installs deps. 9. v1 add/edit scope = open + WPA-PSK; enterprise/802.1X is activate-only, add/edit is vNext (settled by Craig 2026-06-29 — no enterprise networks in his history, so the form would be unused UI). 10. =net doctor= is in v1 (Craig's call, cj) — a one-shot diagnose+fix mode, reachable from a TTY via =make online= / =make net-doctor=. (The earlier "defer the doctor/bundle command" decision is reversed.) 11. Diagnose (read-only) and Repair (mutating, confirmed) are separated in the panel and the CLI; Repair is tiered lightest-first (rfkill → reset → bounce). 12. =custom/net= absorbs the airplane module (Craig's call, cj). *As built (2026-06-29, option 1): display-only.* net shows the airplane state (reads the airplane-mode state file); the =airplane-mode= low-power toggle is kept (radios + CPU + brightness + services is not a network concern) and moved to =custom/net='s right-click + signal 15. Only the redundant display pieces — =waybar-airplane=, =custom/airplane=, and the retired =waybar-netspeed= — plus their tests/css were deleted. The earlier "delete airplane-mode" framing is superseded. 13. Repair includes a full-stack bounce and an rfkill-unblock (Craig's calls, cj) — the latter recovers the framework-laptop post-power-loss soft-block. 14. VPN / WireGuard is a planned Phase 5 (Craig's call, cj), not a permanent exclusion. * Implementation phases - *Phase 1 — Indicator + console recovery (task #C).* =net status= + =net probe= (native cheap probe, reusing captive's logic) + the =captive= probe refactor + =waybar-net= + the split-cadence cache (single-flight, atomic, stale classes) + CSS states (incl. rfkill) + performance budget. Plus the CLI-only recovery path: =net repair= tiers (rfkill / reset / bounce), =net doctor [--fix]=, and the Makefile targets (=make online= etc.) — all testable without the GTK panel. Absorbs the airplane state and removes the standalone airplane module. Interim left-click keeps the existing scratchpad until the panel lands. - *Acceptance*: fresh-login waybar smoke test shows correct state on online/captive/no-internet/wired/rfkill; =net status= stays within budget under a fake slow nmcli (degraded state); =net doctor --fix= recovers a soft-blocked radio from a TTY; the live captive checklist passes at a real portal; the airplane state works and the old airplane module is gone; reverting = swap =custom/netspeed= + =custom/airplane= back. - *Phase 2 — Panel shell + connection management (task #B core).* GTK4 layer-shell scaffold + =net list/up/down/add/edit/remove/rescan= + MRU list + mutation safety/rollback + panel state machines. - *Acceptance*: switch wifi↔wifi and ethernet↔wifi without stranding; a failed switch leaves the prior link up; add/edit open + WPA-PSK writes the secret to NM; remove confirms; panel states render for loading/rescan/activation. - *Phase 3 — Diagnostics + speed test in the panel.* Wire =net diagnose= / =net repair= / =net doctor= / =net portal= / =net speedtest= into the Diagnose vs Repair sections; the "Get me online" button; portal Open button; speedtest progress + cancel. - *Acceptance*: diagnose runs read-only; each repair tier confirms + verifies cleanup (DNS override reverts, shown); speedtest result parses from speedtest-go and a fixture-driven failure shows the right message. - *Phase 4 — Docs + rollout.* In-app help (=net --help= / per-command help, the panel help affordance), README/user-guide (commands, panel, config, troubleshooting, the make targets, rollback), and the manual dep step on ratio. - *Acceptance*: =net --help= and each subcommand's help are complete; the user-guide covers every command + the recovery targets; ratio rollout documented. - *Phase 5 — VPN / WireGuard (future).* Fold the existing archsetup wireguard tooling into the same panel + CLI (=net vpn ...=). Out of the v1 milestone; specced separately when picked up. * Open items / risks - gtk4-layer-shell dropdown anchoring under a waybar module needs the same positioning work pocketbook solved; reuse it. (Phase 2.) - The =captive= refactor must keep the standalone CLI behavior identical while exposing a non-interactive =--json= probe; covered by the existing =tests/captive= suite plus new probe-mode tests. (Phase 1.) - speedtest-go server selection variance (nearest-server flor) — pin a server in config if results are noisy. (Phase 3.) - The background-probe kick from =net status= must be truly non-blocking (spawn + detach); enforced by the single-flight lock and the performance benchmark test. * Rollback Each phase is independent. The indicator (Phase 1) is a drop-in replacement for =custom/netspeed= (and =custom/airplane=); reverting is swapping those modules back in the config and restoring their scripts. The panel is additive — not wiring its clicks leaves the bar working as before. No credential store to roll back (secrets stay in NM throughout). * Review findings [31/31] ** DONE Define the structured diagnostics contract :blocking: The spec says the engine "emits JSON" and that diagnostics "reuse =captive= verbatim", but the current =~/.dotfiles/common/.local/bin/captive= flow is a human-readable bash script that mixes diagnostics, sudo prompts, DNS mutation, browser launch, and terminal prose. A GTK panel cannot reliably turn that into clear state, progress, cancellation, or useful error messages. Define the machine contract before implementation: every diagnostic step should have a stable id, status (=pending/running/pass/warn/fail/skipped=), redacted evidence, elapsed time, safety outcome, and next action. Keep =captive= as the interactive CLI, but either refactor reusable probe/reset functions behind =net diagnose --json= or make =captive= expose a non-interactive JSON mode. This blocks the panel and logging work because otherwise the implementer must invent the boundary. Disposition: accept — added the "Diagnostics contract" section (per-step id / status / evidence / elapsed / safety / next_action) and the =captive= =--json= probe-mode refactor under Architecture + Files touched. ** DONE Specify user-facing failure messages and recovery actions :blocking: The spec names failure states like =no-internet=, =captive=, failed probe, failed reset, missing DNS, and missing speed-test backend, but it does not define the messages the user sees or what each message tells them to do next. For this feature, "error" is not enough: a user needs to know whether WiFi is associated, whether DHCP succeeded, whether DNS is hijacked/broken, whether HTTP is intercepted, whether sudo was declined, whether a command timed out, and whether the system was left unchanged or partially changed. Add a message table for the indicator, panel, and CLI with: failure class, visible text, evidence included, redaction rule, and next action. This is blocking because UX quality here is the product, not an implementation detail. Disposition: accept — added the "Failure states, messages, recovery" section covering each class, the visible message, the "what changed" residue note, and the next action across indicator/panel/CLI. ** DONE Define the debug log and redacted support bundle :blocking: There is no observability section. When this fails in a hotel or cafe, an agent needs enough evidence to diagnose it without rerunning destructive actions. Add log location, rotation/retention, JSONL event schema, command argv logging, exit-code/stderr capture, elapsed time, selected iface, NM active connection UUID, probe URL class, HTTP code, redirect host, DNS servers, and cache read/write events. Also define a =net doctor --json= or =net debug-bundle= command that emits redacted status, recent log events, dependency versions, and a reproduction command. Redact SSID if configured, MAC addresses, portal query tokens, PSKs, EAP identities/passwords, IPs when requested, and all GPG/NM secrets. This blocks implementation readiness because post-failure diagnosis is currently left to ad hoc terminal spelunking. Disposition: modify — accepted the JSONL event log, the schema, and the redaction rules in full (new "Observability" section). Deferred the dedicated =net debug-bundle= / =net doctor= command to vNext: for a single-user tool =net diagnose --json= (the snapshot) plus the event log (the history) cover post-failure diagnosis; a bundle command is gold-plating for v1. Recorded under Out + Resolved decision 10. ** DONE Pin the nmcli parsing and timeout contract :blocking: The spec lists nmcli operations but not the exact fields, output modes, escaping rules, ID semantics, or timeouts. This is risky because SSIDs and connection names can contain spaces, colons, duplicates, hidden names, and non-ASCII; the current =waybar-netspeed= already had an SSID parsing bug. The nmcli manual documents =--terse=, =--get-values=, =--escape=, =--wait=, ID/UUID/path selection, =passwd-file=, and built-in connectivity states (=none/portal/limited/full/unknown=) at https://man.archlinux.org/man/nmcli.1.en. The spec should require UUIDs for saved-profile operations, explicit =--wait= budgets, parser tests for escaped colons/backslashes/newlines/duplicate names/hidden SSIDs, and a decision on when to use or ignore =nmcli networking connectivity [check]=. This is blocking because the command wrapper is the core reliability boundary. Disposition: accept — added the "nmcli contract" section: terse + =--escape= + =--get-values=, UUID-keyed ops, explicit =--wait= budgets, NM connectivity as a cheap hint (our probe authoritative), and the parser test matrix. ** DONE Define cache concurrency, atomicity, and stale-state behavior :blocking: =net status= may spawn =net probe= whenever the cache is stale, but the spec does not define locking, process coalescing, atomic writes, crash cleanup, or what happens when the probe hangs. With a 2s Waybar interval, a bad network could start overlapping probes, corrupt the runtime cache, or keep showing stale "online" while the link is gone. Add a single-flight lock under =$XDG_RUNTIME_DIR/waybar=, atomic write+rename for cache updates, max probe runtime, stale age classes (fresh/stale/expired/unknown), cache invalidation on iface/SSID/connection UUID change, and tests for concurrent =net status= calls. This blocks the fast-path design because it is the main performance and correctness risk. Disposition: accept — added "Concurrency, atomicity, staleness" under the Connectivity model: flock single-flight, temp+rename atomic write, ≤6s probe timeout, fresh/stale/expired/unknown classes, iface/SSID/UUID invalidation, stale lock reclaim, plus concurrency tests in the test plan. ** DONE Bound hot-path performance with measured budgets :blocking: The spec says the cheap poll should be sub-100ms, but the proposed fast path still may call multiple =nmcli= commands every two seconds, read sysfs, parse throughput, and maybe spawn a background probe. The existing =waybar-netspeed= had a deliberate sleep for throughput sampling; replacing it must define how throughput is sampled without sleeping in the bar path. Add a per-command budget for =waybar-net= and =net status=, a maximum number of subprocesses on the hot path, a timeout for every subprocess, benchmark tests with fake slow =nmcli=, and a rule that the indicator emits a degraded JSON state rather than blocking. This is blocking because Waybar custom modules can visibly freeze or lag when their exec path stalls. Disposition: accept — added the "Performance budgets" section: <100ms typical / <250ms worst, throughput sampled across the poll interval (no in-process sleep), one nmcli call max on the hot path, timeouts on every subprocess, the degraded state, and a fake-slow-nmcli benchmark test. ** DONE Make click actions non-blocking and visible :blocking: Waybar right-click runs =net reset= and middle-click runs =net portal= directly. Those operations can require sudo, open browsers, mutate DNS, delete/recreate NM profiles, or hang on network commands, but Waybar click handlers provide no panel, terminal, progress, or cancellation surface by default. Define whether right/middle click instead opens the panel focused on the action, dispatches a background job with notifications, or is removed from v1. If kept, specify single-flight behavior, how sudo/polkit prompts surface, how success/failure is reported, and how the user can inspect logs. This blocks UX readiness because the fastest remediation path is currently the easiest place to hide failure. Disposition: modify — accepted the concern; made the interactions phase-aware and non-blocking. Every click dispatches a detached, single-flight background job and reports via =notify=; sudo surfaces through polkit/the normal prompt; failures go to the notify + the event log. In Phase 1 (no panel) left-click runs probe + notify and keeps the scratchpad; from Phase 2 left-click opens the panel focused on the action. Recorded in the Indicator "Interactions" subsection. ** DONE Specify connection mutation safety and rollback :blocking: The spec says row click switches connections and remove gets a confirm, but it does not define what happens when a switch partially succeeds, disconnects the current working link, needs a password, loses the default route, or triggers auto-activation. The nmcli manual warns that =connection down= does not prevent future auto-activation and may internally block a profile until user action. Define preflight, the exact NM command sequence, whether the old active connection is kept until the new one proves usable, when rollback is attempted, how long activation waits, and what the panel says when rollback fails. This is blocking because the module can strand the user offline. Disposition: accept — added "Mutation safety + rollback" under Connection management: keep the prior link up until the target activates (=--wait 30=), no teardown on failure, password-required surfaced not stranded, =net down= reports post-op active state + the auto-reactivation caveat, and the pinned NM command sequence is tested against fake nmcli. ** DONE Define the credential-store security model :blocking: The GPG store is described as optional and default-unencrypted, but the spec does not define file modes, schema, secret-source rules, import/export prompts, recipient verification, stale secret handling, or what is logged. It also says NM remains source of truth while the user-owned store contains PSK/EAP secrets, which creates two truth sources for sensitive data. Add a precise schema, =0600= file creation with parent-dir permissions, encrypted-recipient checks, plaintext warning text, explicit opt-in flow, redaction requirements, behavior when NM has a secret not in the store, behavior when the store has a secret NM rejects, and tests for no secret leakage in JSON/logs/errors. This blocks Phase 4 and the full spec because otherwise the implementer must make security decisions mid-code. Disposition: accept — rewrote "Credential storage" with the versioned schema, =0600= file / =0700= dir, recipient verification on opt-in, the plaintext warning, secret-source rule (entered/exported, never harvested from root store), the two-source reconciliation policy (NM wins live, store wins for what NM lacks, stale-secret flagging), and the no-leak tests. ** DONE Define EAP, enterprise WiFi, and unsupported connection behavior :blocking: The store says "PSK/EAP" and connection management says add/edit, but there is no v1 contract for WPA-Enterprise fields, certificates, identity vs anonymous identity, hidden networks, static IP, proxy settings, metered flags, MAC randomization, or 802.1X prompt behavior. Either scope v1 to open/WPA-PSK plus existing saved-profile activation, or define the minimum EAP form and the unsupported-state messages. This blocks add/edit/import because enterprise WiFi is too sensitive to hand-wave. Disposition: modify (scope) — scoped v1 to open + WPA-PSK add/edit, with *activation* of any existing saved profile (including enterprise). Enterprise / 802.1X add/edit, static-IP, proxy, metered, and MAC-randomization editing are vNext, shown as "edit via nmtui/nmcli". Recorded in Scope/Out, Connection management, and Resolved decision 9. ** DONE Split read-only diagnostics from mutating remediation :blocking: The panel's diagnostics section includes probe, bounce/reset, gateway ping, and DNS override test in one area, while =captive= currently performs resets and temporary DNS changes as part of its flow. Users need to know which buttons are read-only and which mutate NM profiles, MAC mode, DNS, or browser state. Add separate "Diagnose" and "Repair" actions, confirmations for destructive or privacy-changing operations, explicit cleanup verification for DNS override, and a terminal state when cleanup is unverified. This blocks readiness because network repair must not surprise the user or leave hidden residue. Disposition: accept — split the panel into a read-only Diagnose section and a confirmed, mutating Repair section (and split the CLI into =net diagnose= vs =net repair=). Added =cleanup_verified= + a terminal =cleanup-unverified= state to the diagnostics contract. ** DONE Define panel state, cancellation, and permissions UX :blocking: The panel sections list buttons and a streaming output area, but not loading states, disabled states, empty states, keyboard/focus behavior, cancellation, or permission-denied handling. Add panel state machines for connection list loading, rescan in progress, activation in progress, diagnostics running, speedtest running, and no NetworkManager/no WiFi/no permissions/no GPG key/no librespeed-cli. Each long operation should be cancellable where possible or clearly non-cancellable with an elapsed-time display. This blocks the GTK work because without it the implementer must invent the user flow. Disposition: modify — accepted the state-machine requirement (added "Panel state, cancellation, permissions"), but scoped the state set to what can actually occur on the two-machine fleet: dropped "no NetworkManager" as a modeled state (NM is always present; a missing nmcli is a single hard-error exit) and kept no-wifi-hardware, missing speedtest-go, no-GPG-key, plus the in-progress states with elapsed-time + cancellation where the op allows. ** DONE Verify speed-test dependency, server choice, and failure contract :blocking: The spec chooses =librespeed-cli= and notes availability/default-server research as an open risk, but Phase 3 still depends on parsing its JSON and showing progress. I checked the upstream project page (https://github.com/librespeed/speedtest-cli) and the AUR URL named by search is not sufficient as a verified package/install contract in this spec. Add the exact package name/source to install, command version expected, JSON shape, server-selection policy, timeout, cancellation behavior, offline/rate-limited messages, and tests with fixture JSON and fixture stderr. This blocks Phase 3 because speed-test failure modes are otherwise undefined. Disposition: modify — verified live and changed the backend: =speedtest-go= (AUR =speedtest-go-bin=, 1.x) is already installed on velox and supports =--json=, =--server=, =--no-download/--no-upload=, so v1 needs no new dependency. librespeed-cli (AUR =librespeed-cli= / =-bin=) is the documented self-hosted fallback. Added the "Speed test" section with server policy, timeout, cancellation, the failure-message mapping, and fixture-JSON/stderr tests. ** DONE Define dependency installation and repo boundaries :blocking: The files touched section alternates between archsetup paths and the external dotfiles repo, while pocketbook has been folded into this repo and its previous archsetup provisioning was intentionally removed. The spec should state where the =net= package actually lives, which repository owns the scripts/tests, whether =gtk4-layer-shell=, =python-gobject=, =librespeed-cli=, =gpg=, =nmcli=, =curl=, and =resolvectl= are installed by archsetup or assumed present, and the Makefile targets for test/lint/install. This blocks implementation because the current path plan can produce code that is not installed on a fresh machine. Disposition: accept — added the "Repository + dependencies" section: all code in =~/.dotfiles= (=net/= package in-tree like pocketbook, scripts in the hyprland tier, tests under =tests/=), archsetup owns only the dep install (=gtk4-layer-shell=, =python-gobject=, =speedtest-go-bin=; nmcli/curl/resolvectl already present), Makefile =make test= collects the package suite, and a daily-drivers note for ratio. Rewrote Files touched to match. ** DONE Expand the test plan for failure, concurrency, and live verification :blocking: The testing plan covers normal parsing and fake command sequences, but it misses the riskiest behaviors: slow/hung =nmcli=/=curl=/=librespeed=, concurrent =net status= cache refresh, corrupt cache, stale cache after SSID change, permission denied, sudo declined, DNS override cleanup failure, NM partial activation, duplicate connection names, secret redaction, missing optional dependencies, no WiFi hardware, wired+tether+WiFi ambiguity, portal redirect tokens, and Waybar click handlers. Add unit/fixture tests for each class plus a manual/live checklist gated out of the normal suite. This is blocking because the current plan would leave the exact "things that can go wrong here" mostly untested. Disposition: accept — rewrote the Testing plan with the "Failure + concurrency" class (slow/hung commands, single-flight, corrupt/stale cache, perm-denied, cleanup-failure, partial activation, redaction, missing deps, no-wifi, multi-active) and a per-phase live checklist gated out of the suite. ** DONE Define status JSON schemas and compatibility rules The spec says all subcommands take =--json= but does not define schemas. Add versioned JSON examples for =status=, =probe=, =list=, =diagnose=, =speedtest=, and error envelopes, including nullable fields and unknown/degraded states. This is non-blocking for product direction but should be fixed before code so tests can lock the CLI contract. Disposition: accept — added the "JSON schemas" section with versioned (=v:1=) envelopes for status / probe / list / diagnose / speedtest and a shared error envelope, including the degraded/unknown states. ** DONE Rename or alias the phasing section for workflow compatibility The spec has a usable =Phasing= section, but the spec-review workflow expects an =Implementation phases= section that can be lifted into =todo.org=. Rename it or add an alias heading during response. This is non-blocking because the existing phase decomposition is understandable, but aligning the heading prevents future workflow friction. Disposition: accept — renamed =Phasing= → =Implementation phases= and added per-phase acceptance criteria. ** DONE Add documentation and rollout acceptance checks Rollback is described, but docs and rollout are thin. Add README/user-guide updates for commands, panel behavior, config file, GPG opt-in, troubleshooting, and rollback; add acceptance checks for each phase, including a fresh-login Waybar smoke test and restoring =custom/netspeed=. This is non-blocking but important for handing the feature to a future session without re-discovery. Disposition: accept — added per-phase acceptance criteria under Implementation phases (incl. the fresh-login waybar smoke test and the =custom/netspeed= restore), a Phase 4 "Docs + rollout", and (answering Craig's cj follow-up) a dedicated "Help + documentation" section with the three help layers (CLI help, panel help affordance, user guide). ** DONE Add a failure-mode coverage table :blocking: The spec now names many individual network failures, but it still does not carry one compact coverage matrix that says, for each common failure mode, whether =net diagnose= detects it, whether =net doctor --fix= can repair it, and what terminal user action remains when it cannot. Add a table covering at least: rfkill soft block, rfkill hard block, no WiFi hardware, associated/no DHCP, gateway unreachable, captive DNS hijack, broken DNS where 1.1.1.1 works, HTTP portal, HTTP interception without a parseable portal URL, upstream/AP outage, wrong WPA password or missing secret, enterprise auth/cert failure, duplicate SSID/connection-name ambiguity, hidden SSID, multiple active links, wedged NetworkManager, slow/hung command, stale/corrupt cache, DNS cleanup failure, missing speedtest backend, and VPN/routing interference. This blocks because Craig asked for confidence that the diagnostics and doctor cover the real field failures, and prose scattered across sections is too easy to misread. Disposition: accept — added the "Failure-mode coverage" section: a 22-row table (every mode the finding named) with detect / doctor-fix / terminal-action columns, conformed to the org-table standard (rules under every row, ≤120). ** DONE Pin DNS repair semantics in doctor :blocking: The spec diagnoses DNS hijack, broken hotel DNS, and the temporary 1.1.1.1 override test, but =net doctor --fix= does not say whether it merely recommends the override, applies a temporary override during recovery, or leaves DNS alone after diagnosis. Define the exact behavior for each DNS class: captive hijack should open the portal, broken DNS where 1.1.1.1 works should either offer an explicit temporary repair with cleanup verification or recommend the command, and port-53/egress blocking should stop as upstream/not locally fixable. This is blocking because DNS is one of the most common "connected but unusable" failures and the current doctor contract is ambiguous. Disposition: accept — added "DNS handling in doctor (explicit per class)" under the new Doctor section: hijack → open portal (no DNS mutation); broken-but-1.1.1.1 → explicit temporary override with cleanup verification under =--fix=, recommend otherwise; egress-blocked → terminal =upstream-not-local=. ** DONE Make auth failures terminal user-action states :blocking: Wrong WPA password, missing NM secret, locked keyring/polkit denial, enterprise 802.1X certificate/identity failure, and portal login-required are not fixed by resetting or bouncing NetworkManager. The doctor sequence should classify these as =needs-user-action= terminal states, stop before looping through destructive repairs, and tell the user the exact next action (enter password, edit profile in =nmtui=/=nmcli=, accept portal terms, provide cert/identity, or retry with admin auth). This blocks because repeated reset/bounce against auth failures is slow, noisy, and can make the network state worse without helping. Disposition: accept — added the =needs-user-action= terminal outcome to the Doctor section: wrong password / missing secret / keyring-or-polkit denial / 802.1X cert-or-identity failure / portal-login-required all stop the doctor before any destructive repair and name the exact next step. ** DONE Define upstream/AP/provider failure terminal states :blocking: Some failures are not client-repairable: AP has no uplink, hotel gateway is down, DHCP server is broken, gateway drops traffic, ISP outage, or captive portal backend is failing. The spec should define how =diagnose= proves "local link is up but upstream is broken" and how =doctor --fix= stops after local repairs are exhausted with a clear message like "local repairs tried; likely upstream/AP/provider" plus the evidence. This blocks because users need to know when to stop poking the laptop and switch networks or contact the venue. Disposition: accept — added the =upstream-not-local= terminal outcome: diagnose proves link-up + IP + gateway-reachable but no route out and no captive redirect; =doctor --fix= stops after local repairs with "local repairs tried; likely upstream/AP/provider" + evidence → switch network / contact venue. ** DONE Decide how VPN and policy routing affect v1 diagnosis VPN/WireGuard management is Phase 5, but active VPNs, policy routes, DNS overrides, and firewall killswitches can break apparent internet access in v1. The current spec does not say whether v1 detects active VPN/policy routing and classifies "network is fine, VPN route/DNS is broken" separately from WiFi failure. Add either a v1 diagnostic check for active VPN/default-route/DNS ownership with a "deferred repair" outcome, or explicitly state that VPN-routed failures are out of scope and may be misclassified. This is blocking if Craig expects the module to diagnose normal daily-driver network failures while VPN tooling remains separate. Disposition: accept (chose the detect-and-classify option) — v1 detects an active VPN / non-NM default route / non-NM DNS owner and classifies =deferred/vpn= ("link is fine; internet is VPN-routed"), distinct from a WiFi failure. v1 does not repair it (VPN management is Phase 5); it names the VPN as the likely owner and stops. Added to the Doctor section + the coverage table + a doctor-classification test. ** DONE Remove stale GPG-store references from the resolved spec The spec now decides "no separate credential store; secrets live in NetworkManager", but the Testing plan still mentions =gpg round-trip= and =GPG store= tests, and the panel-state list still mentions a no-GPG-key state. Remove those stale references and replace them with NM-secret/no-secret-leak tests. This is non-blocking for product behavior but blocking for implementation clarity: otherwise tests will be written for a credential store that no longer exists. Disposition: accept — replaced the Testing-plan =gpg round-trip= / =GPG store= bullets with an "NM secrets / no-leak" test (add/edit writes the secret via nmcli; assert no PSK/EAP in any JSON/log/error; no store to round-trip) and dropped the =no-GPG-key= panel state. Residue from the cj-comment pass that dropped the store. ** DONE Reconcile status, goal, and task text before implementation :blocking: The spec status says "Implementation-ready with caveats" and "Phase 1 ready to build", but the body still has an unresolved enterprise add/edit VERIFY, the Goal still says "optional GPG-encrypted secret store", and the unified task title still names "GPG-stored secrets" even though the accepted design removed the store. Before implementation, make the top-level status, goal, scope, task mapping, and resolved decisions agree with the current design. This blocks readiness because a developer starting from the top of the file would still build or plan around abandoned GPG-store behavior. Disposition: accept — fixed the Goal ("secrets stay in NM's own store"), the =[#B]= task-mapping line (notes the "GPG-stored secrets" framing is superseded by decision 5), the enterprise VERIFY (now resolved → Status updated), and corrected the stale =pytest= mentions to =unittest= (the repo's actual harness). Top-of-file status/goal/scope/decisions now agree with the design. ** DONE Resolve enterprise add/edit scope or make the caveat explicit :blocking: The spec still says "One open question for Craig: pull enterprise add/edit into v1?" and points to a VERIFY in =todo.org=. That is a real product-scope decision: if enterprise add/edit is in v1, panel forms, nmcli command sequences, tests, error messages, and docs change materially; if it is out, the UI must consistently show activate-only with "edit in nmtui/nmcli". Decide it in the spec before implementation, or downgrade the status to =Ready with caveats= with this exact accepted caveat. As written, the spec cannot be plain =Ready=. Disposition: accept — Craig decided (2026-06-29): enterprise add/edit is vNext, activate-only in v1. Settled in the Status line, the Scope/Out bullet, decision 9, and the VERIFY (now DONE in todo.org). The UI shows activate-only with "edit in nmtui/nmcli" consistently. Evidence: 24 saved profiles, 0 enterprise. ** DONE Define the concrete test harness and coverage gate :blocking: The spec says TDD, fake binaries on PATH, and benchmark tests, but it does not define the actual harness contract: pytest vs unittest for the =net= package, where fake =nmcli=/=curl=/=speedtest-go=/=rfkill=/=resolvectl= live, how test fixtures encode command histories, how subprocess timeouts are simulated, how Waybar scripts are executed end-to-end, and how coverage is run. Add the exact Makefile targets (=test=, =test-unit= or package-local =pytest=), pytest config, coverage command (e.g. branch coverage over =net/= and =waybar-net= wrappers), minimum threshold, and the rule for reading the coverage report to add missing tests before declaring a phase done. This blocks readiness because "what is the test harness?" is still answerable only by analogy to older suites. Disposition: accept — added the "Harness + coverage gate" section. Corrected the premise: the repo is =unittest= (=make test= → =python3 -m unittest=, 33 suites), not pytest. Pinned the fake-binary stub convention (=tests//fake-*= on a temp PATH), the fixture command→output map, timeout simulation, the end-to-end =waybar-net= subprocess run, and coverage via a throwaway venv (coverage.py is absent system-wide) with a ≥90% branch target on the pure modules. ** DONE Use coverage to find missing behavior, not just report a percentage :blocking: The spec does not say how coverage findings affect implementation. For this feature, line coverage alone can miss the important holes: doctor classification branches, cleanup-unverified paths, redaction paths, degraded hot-path fallbacks, timeout branches, and auth/upstream/VPN terminal states. Define coverage review criteria per phase: branch coverage for pure classifiers and parsers, named untested branches allowed only with comments or manual-check entries, and a required "coverage gap pass" after the first green test run that maps uncovered logic back to tests or consciously excluded live-only behavior. This blocks readiness because the current test plan is broad but does not force the suite to expose missing edge tests. Disposition: accept — added the "Coverage as a gap-finder, not a number (per phase)" subsection: branch coverage required for the doctor classifier (every outcome), cleanup-unverified, redaction, degraded-fallback, timeout, and the parsers; a mandatory coverage-gap pass after the first green run mapping each uncovered branch to a test or a named live-only exclusion; a phase isn't done until that pass is recorded. ** DONE Convert error classes into exact user-facing strings and evidence fields :blocking: The failure table and doctor outcomes classify errors well, but many messages are still templates or descriptions rather than final text. Add exact strings for indicator tooltip, notification, CLI stderr, JSON =error.message=, and panel banner/step text for every failure-mode row, including cases doctor cannot fix: wrong password, missing secret, enterprise cert failure, upstream/AP/provider failure, VPN-routed failure, hard rfkill block, DNS cleanup failure, speedtest missing, and HTTP interception without parseable URL. For each string, specify the redacted evidence included and the next action. This blocks UX readiness because "useful error" is only testable once the actual text and evidence are defined. Disposition: accept — rewrote the Failure states section: each row now carries the exact final string (with == evidence), the evidence field, and the next action, plus a per-surface rendering rule (indicator tooltip / notify / CLI+JSON error.message+detail+code / panel banner all render the one canonical string). Added the missing doctor-unfixable rows: hard rfkill, wrong password / missing secret, enterprise cert failure, upstream/AP/provider, VPN-routed, HTTP interception without a parseable URL, and DNS cleanup-unverified. ** DONE Add an enhancement disposition table The spec captures several good enhancements (doctor, Makefile recovery, rfkill, airplane absorption, VPN phase), but it does not show that low-cost adjacent enhancements were considered and accepted/deferred/rejected. Add a small radar table for likely affordances: copy redacted doctor report, open/copy portal URL, retry with hardware MAC, forget network, rescan now, pin speedtest server, show last good network/result, watch mode for =net doctor=, desktop notification actions, QR-code/share WiFi import/export, and keyboard picker. Mark each =v1=, =vNext=, or =rejected= with a one-line reason. This is non-blocking, but it prevents accidental loss of cheap UX wins and keeps the v1 panel focused. Disposition: accept — added the "Enhancement radar" table dispositioning all the named affordances: open/copy portal URL, forget network, rescan, hardware-MAC retry, pin speedtest server, copy redacted doctor report = v1; last-good network/result, doctor watch mode, actionable notifications, keyboard picker = vNext; QR-share = rejected (low value for a 2-machine personal setup). ** DONE Tighten the panel UX flow before Phase 2 The panel has sections and state machines, but not a concrete interaction flow: default focused section, row content, primary/secondary buttons, disabled-state rules, confirmation wording for reset/bounce/DNS override, how "Get me online" reports each escalation, what stays visible after the panel closes, and keyboard navigation. Add a short UX flow spec or wire-level outline before Phase 2. This is non-blocking for Phase 1, but it blocks Phase 2 implementation because a GTK panel can easily become noisy or surprising if these defaults are invented while coding. Disposition: accept — added the "Panel UX flow (settle before Phase 2)" subsection: default focus (Connections, or Diagnose when opened from a captive state), row content, one primary button per section, disabled-state rules, exact confirmation wording for reset/bounce/DNS-override/remove, the live "Get me online" escalation reporting, what survives panel close, and keyboard nav. * Review and iteration history ** 2026-06-29 Mon @ 17:00:39 -0400 — Codex — reviewer - *What changed or was recommended:* Rubric: =Not ready=. Applied the spec-review workflow and added blocking findings for diagnostics structure, user-facing errors, observability, nmcli contracts, cache concurrency, hot-path performance, Waybar click actions, mutation rollback, credential security, unsupported WiFi types, panel states, speed-test dependency verification, install boundaries, and test coverage. - *Why:* The spec has the right broad shape, but this feature combines a status indicator, NetworkManager mutation, sudo-backed remediation, captive-portal probing, secret storage, and GTK UI. Too many failure modes were still left for implementers to invent during coding. - *Artifacts:* Findings recorded in =Review findings=. Local code read included =~/.dotfiles/hyprland/.local/bin/waybar-netspeed=, =~/.dotfiles/common/.local/bin/captive=, =~/.dotfiles/tests/waybar-netspeed/test_waybar_netspeed.py=, =~/.dotfiles/tests/captive/test_captive.py=, pocketbook's GTK layer-shell scaffold, current Waybar config, and the linked =todo.org= tasks. External checks included the Arch nmcli man page and upstream LibreSpeed CLI page. ** 2026-06-29 Mon @ 17:35:00 -0400 — Claude Code (archsetup) — responder - *What changed:* Processed all 18 findings — 13 accepted, 5 modified, 0 rejected. Accepts added new sections: Repository + dependencies, Performance budgets, nmcli contract, JSON schemas, Diagnostics contract (+ Diagnose vs Repair), Failure states/messages/recovery, Observability, cache concurrency/atomicity/staleness, mutation safety + rollback, the rewritten credential security model, the expanded test plan, JSON schemas, the renamed Implementation phases with per-phase acceptance criteria, and a Status line. Modifies: deferred the =net debug-bundle= command (log + =diagnose --json= cover it), made click actions phase-aware + non-blocking, scoped enterprise WiFi to activate-only (add/edit vNext), trimmed the panel state set to states that occur on the two-machine fleet, and switched the speed-test backend to the already-installed =speedtest-go= (verified live) instead of a new librespeed-cli dependency. - *Why:* Codex's review was strong and mostly right; the modifies are scope calls for a single-user, two-machine tool (avoid gold-plating, use installed deps, don't model impossible states) and one phasing reconciliation (the panel is Phase 2, so Phase-1 clicks can't open it). - *Artifacts:* All 18 findings completed =DONE= in place with dispositions. Live check: =pacman -Qq | grep speed= (speedtest-go-bin installed; =speedtest-go --json= confirmed). Status now "review incorporated (Codex, 2026-06-29)", implementation-ready with the named Phase-2/3 caveats; Phase 1 ready to build. ** 2026-06-29 Mon @ 17:37:58 -0400 — Claude Code (archsetup) — responder (cj comments) - *What changed:* Folded in Craig's 10 cj comments on the spec. Design changes: (1) dropped the separate credential store entirely — secrets stay in NM's own =system-connections= (root =0600=, inline), touched via nmcli; no GPG, no gpg-agent (rewrote the secrets section, decision 5, dropped Phase 4 + the gpg dep). (2) Added =net doctor [--fix]= + Makefile console-recovery targets (=make online= etc.) as a first-class TTY path; reversed the earlier defer-the-doctor call (decision 10). (3) Added a full-stack =bounce= repair and an =rfkill= unblock repair + indicator state — the rfkill one recovers the framework-laptop post-power-loss soft-block Craig hit. (4) =custom/net= absorbs the airplane module; the standalone airplane scripts/tests/module are deleted on ship. (5) Moved VPN/WireGuard from "out" to a planned Phase 5. (6) Added a "Help + documentation" section (CLI help / panel help / user guide). Answered the enterprise-defer rationale and the captive-auto-login explanation inline. - *Why:* Craig's comments simplify (NM-only secrets, fewer deps) and harden the recovery story (doctor + make targets + rfkill/bounce reachable from a dead-GUI console — his stated need). Enterprise add/edit in v1 is the one open call, raised as a VERIFY. - *Artifacts:* All 10 cj blocks removed. Live checks: airplane scripts/tests present (confirmed deletion targets); =/etc/NetworkManager/system-connections/= is root =0600= with inline secrets (confirms the NM-only secret model); =rfkill= present. VERIFY filed in =todo.org= for the enterprise-scope decision. ** 2026-06-29 Mon @ 17:49:24 -0400 — Codex — reviewer - *What changed or was recommended:* Added a second feedback batch from the common network failure-mode review: require a diagnose/doctor coverage table, pin DNS repair semantics, make auth failures terminal user-action states, define upstream/AP/provider terminal states, decide how active VPN/policy routing affects v1 diagnosis, and remove stale GPG-store references. - *Why:* The design now has a solid recovery shape, but the field question is sharper: for each common failure, does =diagnose= catch it, does =doctor --fix= fix it, or does the user need to switch networks/provide credentials/contact the venue? That mapping needs to live in the spec, not in chat. - *Artifacts:* Findings appended under =Review findings= after reviewing the current spec's Diagnostics, Failure states, Doctor, Testing plan, and responder history. ** 2026-06-29 Mon @ 18:06:22 -0400 — Codex — reviewer - *What changed or was recommended:* Added a readiness pass focused on development-time questions, missed enhancements, UX cleanliness, error wording, and test/coverage mechanics. New findings require reconciling stale status/goal text, resolving or explicitly caveating enterprise add/edit, defining the concrete harness and coverage gate, making coverage actionable, turning failure classes into exact user-facing strings, adding an enhancement disposition table, and tightening the Phase-2 panel UX flow. - *Why:* The network failure model is now strong, but implementation would still force developers to invent test infrastructure, coverage expectations, final error copy, and several UX details. The spec should decide those up front or explicitly mark them as caveats. - *Artifacts:* Findings appended under =Review findings= after reading the updated spec, current archsetup Makefile test targets, pocketbook's pyproject, and the linked =todo.org= enterprise VERIFY. ** 2026-06-29 Mon @ 17:55:08 -0400 — Claude Code (archsetup) — responder (2nd review) - *What changed:* Processed all 6 second-batch findings — 6 accepted, 0 modified, 0 rejected. Added a "Doctor: escalation, classification, terminal states" section with four outcomes (=fixable= escalating lightest-first; =needs-user- action=, =upstream-not-local=, =deferred/vpn= terminal) and explicit per-class DNS handling. Added a "Failure-mode coverage" table (22 rows: detect / doctor-fix / terminal action, org-table-standard conformant, ≤120). Cleaned the last GPG-store residue (Testing plan → NM-secret/no-leak test; dropped the no-GPG-key panel state). Added a doctor-classification test class. - *Why:* The findings sharpen the doctor from "escalate repairs" to "classify, then stop at the right terminal state" — the safety property that keeps =doctor --fix= from looping destructive repairs against a wrong password or an upstream outage. The VPN finding got the detect-and-classify option (not repair; that's Phase 5) so v1 doesn't misread a VPN-routed failure as WiFi down. The coverage table is the single artifact Craig asked for to trust the field coverage. - *Artifacts:* All 6 findings =DONE= in place; cookie =[24/24]=. Table conformed via =wrap-org-table.el= (max line 120). No new findings introduced; no scope expansion past what the findings asked. Spec remains implementation-ready, the enterprise add/edit VERIFY still the one open decision. ** 2026-06-29 Mon @ 18:24:51 -0400 — Claude Code (archsetup) — responder (3rd review) - *What changed:* Processed the third review (7 findings). The enterprise-scope finding closed first on Craig's call (vNext, activate-only). The other 6, all accepted: (1) reconciled the top-of-file text — fixed the Goal's GPG-store wording, the =[#B]= task-mapping line, the resolved enterprise VERIFY, and corrected the stale =pytest= mentions to =unittest= (the repo's real harness). (2) Added a "Harness + coverage gate" section (unittest, fake-binary stubs on a temp PATH, venv coverage, ≥90% branch on pure modules). (3) Added a per-phase "coverage as a gap-finder" pass. (4) Rewrote the Failure states section to exact final strings + evidence fields + a per-surface rendering rule, and added the missing doctor-unfixable rows. (5) Added the "Enhancement radar" table (v1/vNext/rejected). (6) Added the "Panel UX flow" subsection. - *Why:* The findings close the gap between "design decided" and "a developer can start": the harness/coverage contract, the exact UX strings, and the panel flow are the things otherwise invented mid-code. The =pytest=→=unittest= correction was a real defect — the spec contradicted the repo's actual test convention. - *Artifacts:* All 31 findings =DONE=; cookie =[31/31]=. Both new tables conformed via =wrap-org-table.el= (coverage 120, radar 110). Harness verified against the live repo (33 unittest suites, =make test=, coverage.py absent → venv). Status raised to "Ready for Phase 1; Ready-with-caveats overall" — no open decisions remain.