From ce9af72ac4a5a6a30443ccdc4d8d785f0fe28c1b Mon Sep 17 00:00:00 2001 From: Craig Jennings Date: Mon, 29 Jun 2026 18:30:40 -0400 Subject: docs: finalize waybar network module spec (reviews incorporated) Incorporated the review feedback and my inline comments into the network-module spec. It's now implementation-ready, every finding resolved. The reviews reshaped the design in a few ways. Secrets stay in NetworkManager's own store instead of a separate GPG file, dropping that dependency. A net doctor mode plus Makefile targets make recovery work from a bare TTY when the GUI is down. The doctor classifies failures and stops at the right terminal state (needs-user-action, upstream-not-local, deferred-vpn) instead of looping destructive repairs. The module absorbs the airplane indicator, and enterprise WiFi add/edit is vNext (activate-only in v1, since the saved history has no enterprise networks). Added a failure-mode coverage table, exact user-facing strings, the test harness and coverage gate, and the panel UX flow. Also corrected the spec's test framework from pytest to unittest, which is what the repo uses. --- .../2026-06-29-waybar-network-module-spec.org | 1557 ++++++++++++++++++-- 1 file changed, 1422 insertions(+), 135 deletions(-) (limited to 'docs') diff --git a/docs/design/2026-06-29-waybar-network-module-spec.org b/docs/design/2026-06-29-waybar-network-module-spec.org index bf195b8..37b87b0 100644 --- a/docs/design/2026-06-29-waybar-network-module-spec.org +++ b/docs/design/2026-06-29-waybar-network-module-spec.org @@ -2,12 +2,26 @@ #+AUTHOR: Craig Jennings & Claude #+DATE: 2026-06-29 +* Status + +Ready for Phase 1; Ready-with-caveats overall. Three Codex review rounds + Craig's +cj comments are all incorporated — every finding has a disposition and the findings +cookie reads complete ([31/31]), with no open decisions (enterprise scope settled: +open + WPA-PSK in v1, 802.1X add/edit vNext, activate-only). The cj comments +reshaped several decisions (no separate credential store — use NM's own; =net +doctor= + Makefile console-recovery in v1; rfkill + full-stack-bounce repair; +airplane module absorbed; VPN a later Phase 5). The only remaining caveats are +Phase-2/3 build unknowns named under Open items (gtk4-layer-shell anchoring, the +=captive= =--json= refactor) — not Phase-1 blockers. Phase 1 (indicator + console +recovery) is ready to build. + * Goal One waybar network component that does the whole job: shows connection state (including the missing "associated but no internet / captive portal" state), -manages connections from a dropdown (nmcli-backed, optional GPG-encrypted secret -store), and runs the network diagnostics and remediation off the same place +manages connections from a dropdown (nmcli-backed; secrets stay in +NetworkManager's own store, no separate credential file), and runs the network +diagnostics and remediation off the same place (captive-portal detection + forcing, bounce/reset, gateway/DNS checks, speed test). @@ -15,8 +29,9 @@ It unifies three todo tasks that are really one feature: - =[#C]= "archsetup Waybar Wi-Fi module should show no-internet state" — the indicator state plus the 2026-06-22 roam expansion (bounce, diagnostics, speed test off the component). -- =[#B]= "Network-manager dropdown, nmcli-backed with GPG-stored secrets" — the - management dropdown. +- =[#B]= "Network-manager dropdown, nmcli-backed" — the management dropdown. (The + todo task's original "GPG-stored secrets" framing is superseded: secrets stay in + NM's own store, decision 5.) - The network diagnostics already shipped in =captive= (the hotel/captive-portal tool, formerly =login-page=) become this module's diagnostics engine rather than a standalone CLI. @@ -26,31 +41,61 @@ It unifies three todo tasks that are really one feature: ** In - *Indicator* — wifi/ethernet icon + signal + SSID, plus an internet sub-state: online / captive / no-internet / connecting / disconnected / airplane. +- *Absorbs the airplane module* — the airplane state + toggle move into + =custom/net= (airplane is a network concern). Once this ships, the standalone + =custom/airplane= module, the =waybar-airplane= + =airplane-mode= scripts, their + =tests/=, and the css are deleted (listed under Files touched). The + desktop-settings panel (sibling =[#B]=) no longer needs an airplane row. - *Interface-correct* — targets the wifi (or chosen) device, not the default-route interface, so an active USB tether or wired link can't mask wifi state. (Same lesson =captive= fixed; the current =custom/netspeed= keys off the default route and has the bug.) - *Connection management (panel)* — list saved connections most-recently-used - first, live signal for in-range wifi, click to switch; add / edit / remove; - ethernet↔wifi and wifi↔wifi switching even when a link appears mid-session. -- *Diagnostics (panel)* — captive probe (204-vs-portal) with the extracted - portal URL and an Open button; bounce/reset (fresh MAC); gateway ping; DNS - config + temporary 1.1.1.1 override test. + first, live signal for in-range wifi, click to switch; add / edit / remove for + open + WPA-PSK; activate any existing saved profile (including enterprise ones + NM already stores); ethernet↔wifi and wifi↔wifi switching even when a link + appears mid-session. +- *Diagnostics (panel)* — read-only Diagnose (captive probe 204-vs-portal with + the extracted portal URL, gateway ping, DNS config) separated from mutating + Repair. Repair has tiers, lightest first: rfkill-unblock, per-connection reset + (fresh MAC), full-stack bounce (=nmcli networking off/on=, then restart + NetworkManager if that fails), and the temporary 1.1.1.1 override test. Each + Repair action confirms and verifies cleanup. - *Speed test (panel)* — down/up/ping with a progress indicator and last-result - shown. -- *Credential store* — optional GPG-encrypted connection+secret file under - =~/.config=, opt-in (default unencrypted), passphrase cadence via gpg-agent - TTL. Supplements NetworkManager, does not replace it. + shown, via the already-installed =speedtest-go --json=. +- *Connection secrets* — none of our own. Settings and passwords live where NM + already keeps them: =/etc/NetworkManager/system-connections/*.nmconnection= + (root-only =0600=, the PSK/EAP secret stored inline). We read/write them through + nmcli, which handles the privilege. No separate file, no GPG, no gpg-agent — one + fewer dependency, and NM's store is already the secure-at-rest source of truth. - *Persistence* — connectivity probe result cached in the runtime dir so the bar reads it cheaply between probes. +- *Observability* — a redacted JSONL event log so a post-failure session can + diagnose without re-running destructive actions. ** Out (v1, note for later) - No replacement of NetworkManager's connection engine. NM stays the thing that connects; we drive it via nmcli. -- No VPN / wireguard management (separate tooling already exists). -- No per-connection captive-portal auto-login automation beyond opening the - portal page. +- No add/edit *form* for WPA-Enterprise / 802.1X in v1. The reason is effort vs + payoff: 802.1X has many interdependent fields (CA cert, client cert, identity, + anonymous identity, phase-2 auth) where a wrong entry silently fails auth, so a + trustworthy form is a lot of UI for connections Craig rarely adds (open + + WPA-PSK covers home, hotels, and phone hotspots). v1 still *activates* existing + saved enterprise profiles and points editing at =nmtui=/=nmcli=. Settled + (Craig, 2026-06-29): enterprise add/edit is vNext — 24 saved profiles on velox, + 0 enterprise, so the form would be unused UI; if one ever appears nmtui adds it + once and the module activates it thereafter. +- No per-connection captive-portal *auto-login* in v1. (That would mean storing a + portal's login form answers — room number, surname, a checkbox — and replaying + them automatically when a known portal is detected, so the page never appears. + Out for v1 because every portal's form differs and it means storing per-venue + answers; v1 just opens the portal for you.) - No graphing/history of speed-test results beyond the last run. +- No static-IP / proxy / metered / MAC-randomization editing in v1 (activate + existing, edit elsewhere). +- No VPN / WireGuard management in v1, but it's a planned later phase (Phase 5), + not a permanent exclusion — it folds the existing archsetup wireguard tooling + into the same panel/CLI. - The desktop-settings dropdown (sibling =[#B]=) is a separate module, but it shares the GTK4 layer-shell panel shell built here. @@ -58,11 +103,12 @@ It unifies three todo tasks that are really one feature: Three layers. Keep the bar cheap, the panel rich, the logic in one tested place. -1. *Engine* — a =net= Python package (src-layout, pytest), exposing a CLI. Wraps +1. *Engine* — a =net= Python package (src-layout, unittest), exposing a CLI. Wraps every nmcli op and owns the diagnostics. Emits JSON. This is the testable - core (fake =nmcli= / =curl= on PATH, like the existing =waybar-netspeed= and - =waybar-sysmon= test harnesses). Precedent: pocketbook is Python in the - dotfiles repo; =wtimer= is Python for the same testability reason. + core (fake =nmcli= / =curl= / =speedtest-go= on PATH, like the existing + =waybar-netspeed= and =waybar-sysmon= test harnesses). Precedent: pocketbook is + Python in the dotfiles repo; =wtimer= is Python for the same testability + reason. 2. *Indicator* — a thin =waybar-net= script that calls =net status --json= and renders icon + signal + state + tooltip. Replaces =custom/netspeed= (throughput folds into the tooltip). @@ -75,15 +121,48 @@ How the existing pieces map in: interactive portal-force flow (sudo reset, DNS override, browser launch). Its cheap portal-detection logic is mirrored natively in the engine for the fast status path so the bar never blocks on a subprocess. =captive= stays a usable - standalone CLI. + standalone CLI. The refactor (below) extracts its probe + reset into functions + the engine can call non-interactively. - =waybar-netspeed= (sh, shipped) — retired; its throughput sampling moves into the engine's status output and renders in the indicator tooltip only. - =nmcli= — the connection backend for every op. Language note: the engine is Python; the indicator is a thin Python or sh -wrapper over =net status --json=. The bar path must stay fast (sub-100ms for the -cheap poll), so the indicator does no network I/O itself — it reads link state -and the cached connectivity result. +wrapper over =net status --json=. The bar path must stay fast (see Performance +budgets), so the indicator does no network I/O itself — it reads link state and +the cached connectivity result. + +* Repository + dependencies + +- *Code lives in the dotfiles repo* (=~/.dotfiles=), not archsetup. The =net= + package sits in-tree like pocketbook (src-layout, unittest, Makefile target); + =waybar-net= and the =net= CLI entry live in the hyprland tier + (=hyprland/.local/bin/=). Tests under =tests/net/= and =tests/waybar-net/=. + archsetup owns only the *dependency install*, not the code. +- *archsetup installs the deps* in its Hyprland step: =gtk4-layer-shell=, + =python-gobject=, plus =nmcli=/=curl=/=resolvectl=/=rfkill= (already present via + NetworkManager/curl/systemd/util-linux). Speed test uses =speedtest-go= (AUR + =speedtest-go-bin=, already installed on velox); archsetup adds it to the AUR + list. librespeed-cli is the documented fallback if a self-hosted LibreSpeed + server is ever wanted. No =gpg= dependency (secrets live in NM's own store). +- *Daily-drivers*: a stowed-script + AUR-dep feature, so ratio needs the same + =git pull= + stow + the archsetup-added deps. Note the manual dep step in the + rollout. + +** Makefile targets (console recovery is a first-class path) +=net doctor= and the diagnostics are reachable from a bare TTY when waybar and +the GUI are down — that's the case where you most need them. The dotfiles +Makefile carries targets that wrap the =net= CLI so "get back online" is one make +command from the console: +- =make online= — =net doctor --fix= (diagnose, then apply the lightest repair: + rfkill-unblock → reset → bounce → open portal). The headline recovery target. +- =make net-doctor= — =net doctor= (read-only diagnose + recommendation). +- =make net-status= / =make net-diagnose= / =make net-portal= / =make net-reset= + / =make net-bounce= — the individual ops. +- =make test= — already runs =tests/*=; the =net= package's unittest suites are + collected the same way. +These intentionally need only nmcli/curl/rfkill (no GUI, no waybar, no Python +GTK), so they work from a TTY on a broken graphical session. * Connectivity model — split cadence @@ -91,208 +170,1416 @@ The indicator polls every ~2s, but a real internet/captive probe every 2s wastes battery and can re-trigger a captive portal. So split it: - *Fast path (every poll, cheap, no network)* — interface, type, SSID, signal, - IPv4 presence, throughput sample. From nmcli / sysfs only. + IPv4 presence, throughput sample. From nmcli / sysfs only. No network I/O. - *Slow path (cached, TTL ~45s)* — the actual internet/captive probe (the 204 check + meta-refresh portal extraction). Result cached at =$XDG_RUNTIME_DIR/waybar/net-connectivity.json= with a timestamp. The indicator reads the cache each poll. When the cache is older than the TTL, -=net status= kicks =net probe= in the background (non-blocking) and renders the -last cached sub-state meanwhile. A user-triggered diagnose/reconnect refreshes -the cache immediately. This keeps the bar responsive and the portal un-poked. +=net status= kicks =net probe= in the background (spawn + detach, never awaited) +and renders the last cached sub-state meanwhile. A user-triggered +diagnose/reconnect refreshes the cache immediately. This keeps the bar +responsive and the portal un-poked. + +** Concurrency, atomicity, staleness +- *Single-flight* — =net probe= takes a lock file at + =$XDG_RUNTIME_DIR/waybar/net-probe.lock= (flock, non-blocking). A second probe + while one runs is a no-op, so a flapping 2s poll can't pile up overlapping + probes. +- *Atomic writes* — the cache is written to a temp file + =os.replace= (atomic + rename), so a reader never sees a half-written cache. Same pattern as =wtimer=. +- *Max probe runtime* — the probe has a hard timeout (≤ 6s total: curl + =--max-time 5= + slack). On timeout it writes an =unknown= result, never hangs. +- *Stale classes* the indicator distinguishes: fresh (< TTL), stale (TTL..3×TTL, + shown with a subdued/aging hint), expired (> 3×TTL → treat as unknown), + unknown (no cache / probe failed). The bar never shows a confident "online" + past the expired threshold. +- *Invalidation* — the cache records the iface + SSID + active-connection UUID it + was taken under; a change in any of them invalidates it immediately (a + reconnect must not show the old network's verdict). +- *Crash cleanup* — a stale lock older than the max runtime is ignored/reclaimed. + +* Performance budgets (hot path) + +The bar exec path (=waybar-net= → =net status=) must stay responsive: +- *Budget*: =net status= returns in < 100ms typical, < 250ms worst case. +- *No sleeping in the bar path.* Throughput is sampled from two reads of + =/sys/class/net//statistics/{rx,tx}_bytes= across the *waybar poll + interval itself* (delta since the last cached sample + timestamp), not via an + in-process =sleep= like the old =waybar-netspeed=. The cache holds the prior + counters. +- *Subprocess cap*: at most one =nmcli= invocation on the hot path (a single + =nmcli -t -f ...= multi-field query), plus sysfs reads. Never a per-field + nmcli call. +- *Every subprocess has a timeout* (=nmcli --wait 2=, =subprocess timeout=). On + timeout or error the indicator emits a degraded JSON state (class + =net-degraded=, a neutral glyph) rather than blocking or crashing waybar. +- *Benchmark test*: a fake slow =nmcli= asserts =net status= still returns within + budget by falling back to the degraded state. * Engine — =net= CLI surface All subcommands take =--json= where a machine reads them. Pure formatting/state -functions under the CLI; IO (nmcli, curl, file) at the edges. +functions under the CLI; IO (nmcli, curl, file) at the edges. Every subcommand +exits non-zero with a JSON error envelope (see JSON schemas) on failure. - =net status [--json] [--iface IF]= — fast link state + cached connectivity - sub-state + throughput. The indicator's source. + sub-state + throughput. The indicator's source. Never does network I/O. - =net probe [--iface IF]= — run the connectivity/captive probe now, update the - cache, print online | captive (+ portal URL) | no-internet. Mirrors =captive='s - cheap detection natively. + cache (single-flight, atomic), print online | captive (+ portal URL) | + no-internet | unknown. Mirrors =captive='s cheap detection natively. - =net list [--json]= — saved connections, MRU order, active flag, plus in-range wifi with signal. -- =net up = / =net down [--iface IF]= — switch / disconnect. -- =net add= / =net edit = / =net remove = — manage connections; sync the - GPG store (below). +- =net up = / =net down [--iface IF]= — switch / disconnect. Operates on + UUID, not name (see nmcli contract). +- =net add= / =net edit = / =net remove = — manage connections + (open + WPA-PSK) through nmcli; the secret lands in NM's own + =.nmconnection=. Enterprise profiles are activate-only. - =net rescan [--iface IF]= — wifi rescan. -- =net diagnose [--json]= — full report: gateway ping, DNS config, DNS 1.1.1.1 - override test, captive probe. Shells to =captive= for the interactive/sudo - parts; native for the read-only parts. +- =net diagnose [--json]= — read-only report: gateway ping, DNS config, captive + probe. The structured contract below. Doubles as the post-failure snapshot. +- =net repair [--json]= — mutating remediation, lightest first: + =rfkill= (unblock + radio on), =reset= (fresh MAC), =bounce= (full-stack: + =nmcli networking off/on=, escalating to =systemctl restart NetworkManager=), + =dns-test= (temporary 1.1.1.1 override, auto-reverted). Each confirms via the + caller and verifies cleanup. +- =net doctor [--json] [--fix]= — one-shot "get me online" mode for the console: + runs the full diagnose, then applies the lightest repair that fits (unblock + rfkill, reset, bounce, open portal) — read-only without =--fix=, acting with + it. The TTY recovery path when waybar/the GUI is down (see the Makefile + targets). - =net portal= — run =captive='s portal-force flow (reset if needed, extract + open the portal page). -- =net reset [--hardware-mac]= — fresh-MAC reconnect (=captive='s =fresh_mac=). -- =net speedtest [--json]= — librespeed run; down/up/ping. +- =net speedtest [--json]= — =speedtest-go --json= run; down/up/ping. + +* nmcli contract + +The command wrapper is the reliability boundary; SSIDs and connection names +contain spaces, colons, duplicates, hidden names, and non-ASCII. Rules: + +- *Terse, field-selected output*: =nmcli -t -f --escape yes ...= and + =nmcli -g ...= (get-values) for single-value reads. Parse with the + documented escaping (=\:= and =\\=); never naive =cut -d:=. +- *UUID is the handle.* Every saved-profile op (=up=, =down=, =modify=, =delete=) + uses the connection UUID, never the display name — names duplicate and contain + separators. =net list= surfaces UUIDs; the panel maps row → UUID. +- *Wait budgets*: activation/deactivation use =nmcli --wait = with an explicit + budget (hot-path reads =--wait 2=; activation =--wait 30=). No unbounded waits. +- *Connectivity*: NM's own =nmcli networking connectivity= can return + =none/portal/limited/full/unknown=. Use it as a *cheap hint* on the fast path + when present, but the authoritative captive verdict is still our own probe + (NM's portal detection is coarser and config-dependent). +- *Parser tests* (fake nmcli fixtures): escaped colons and backslashes in SSIDs, + embedded newlines, duplicate connection names, hidden SSID (empty name), + non-ASCII SSID, the wired-appears-mid-session case, and the multi-active case + (wifi + tether both up). + +* JSON schemas + +Versioned (="v": 1=) envelopes so tests lock the contract. Sketches (fields +nullable unless noted): + +- =status=: ={v, iface, type: wifi|ethernet|none, ssid, signal, ipv4, + gateway, throughput: {rx_bps, tx_bps}, connectivity: online|captive|no-internet|unknown, + connectivity_age_s, connectivity_class: fresh|stale|expired|unknown, state: + online|captive|no-internet|connecting|disconnected|airplane|wired|degraded}=. +- =probe=: ={v, result: online|captive|no-internet|unknown, portal_url, http_code, + redirect_host, elapsed_ms, ts}=. +- =list=: ={v, connections: [{uuid, name, type, active, last_used, signal, + in_range, security}]}=. +- =diagnose=: ={v, steps: [], overall: + ok|warn|fail}=. +- =speedtest=: ={v, down_mbps, up_mbps, ping_ms, server, elapsed_ms, ts}=. +- error envelope (any command): ={v, error: {code, message, detail, partial: + bool}}= with a non-zero exit. + +* Diagnostics contract + +=net diagnose --json= returns an ordered list of steps. Each step is the unit the +panel renders and the log records: + +- =id= — stable identifier (e.g. =link=, =dhcp=, =gateway-ping=, =dns-config=, + =dns-resolve=, =http-probe=, =portal=). +- =status= — =pending | running | pass | warn | fail | skipped=. +- =title= — short human label. +- =evidence= — redacted detail (the value seen), per the redaction rules. +- =elapsed_ms=. +- =safety= — =read-only= or =mutating= (diagnose steps are all read-only). +- =next_action= — what the user/agent should do on warn/fail (e.g. "open portal", + "reset connection", "switch network"). + +Repair actions (=net repair=) carry the same shape but =safety: mutating=, plus a +=cleanup_verified: bool= field (e.g. the DNS override was reverted) and a +terminal =cleanup-unverified= status when revert can't be confirmed. + +** Diagnose vs Repair (read-only vs mutating) +The panel separates them visually and behaviorally: +- *Diagnose* — probe, gateway ping, DNS config read, captive check. No state + change, no sudo, runnable freely. +- *Repair* — reset (fresh MAC, deletes+recreates the NM profile), DNS override + test (mutates resolver, auto-reverts), portal force. Each needs an explicit + confirm, shows that it's privacy/state-changing, and verifies cleanup. A + Repair whose cleanup can't be verified ends in a visible =cleanup-unverified= + state, never a silent success. + +* Failure states, messages, recovery + +Each row below gives the *exact, final* user-facing string (not a template) with +== for redacted evidence, plus the evidence field included and the +next action. The string is canonical: every surface renders the same text, so +there's one source of truth. + +Per-surface rendering of the canonical string: +- *Indicator* — the matching glyph + CSS class; the string is the tooltip + (untruncated). +- *Notification* (=notify=) — title = the failure label, body = the string. +- *CLI* — the string on stderr; =--json= puts it in =error.message= with the + evidence in =error.detail= and a stable =error.code=. +- *Panel* — the string as the section banner, with the diagnostic step's evidence + shown beneath. +Evidence is always redacted per the redaction rules (SSID/host shown; PSK/EAP/ +portal tokens never). + +- *associated, no DHCP* — "Connected to , no IP (DHCP failed)" → + evidence: SSID, iface → reset / reconnect. +- *no-internet* — "On , no internet (gateway reachable, no route out)" → + diagnose / switch network. +- *captive* — "Captive portal at — login required" → Open portal. +- *DNS hijack* — "DNS is being redirected (portal)" → Open portal. +- *DNS broken* — "DNS not resolving (hotel DNS down); 1.1.1.1 works" → use + override / report. +- *HTTP intercepted* — "Traffic is being intercepted before it leaves" → Open + portal. +- *sudo declined* — "Reset needs admin; it was declined — nothing changed" → + retry with auth. +- *command timed out* — " timed out; the system was left unchanged" → retry. +- *partial mutation* — " partially applied: ; rolled back to " + → review. +- *missing speedtest-go* — "speedtest-go not installed" → install hint. +- *no wifi hardware* (desktop) — wifi rows hidden; ethernet-only view. +- *wifi rfkill-blocked* — "WiFi is blocked (rfkill)" → unblock. The indicator + detects a soft-blocked radio (=rfkill list= shows the radio off though hardware + is present) and shows this distinct from disconnected. =net repair rfkill= (and + =net doctor --fix= as its first step) runs =rfkill unblock wifi= + =nmcli radio + wifi on= and reconnects. This is the framework-laptop case: an out-of-power + shutdown sometimes leaves wifi soft-blocked at next boot, and yes — the module + recovers it (the rfkill state is the indicator; the rfkill repair / doctor is + the one-step fix). A *hard* block (physical switch) is reported as + not-recoverable-in-software with that message. +- *wifi rfkill hard-blocked* — "WiFi is blocked by the hardware switch" → + evidence: rfkill hard state → flip the physical switch. +- *wrong password / missing secret* — "Saved password for was rejected" → + evidence: SSID, NM auth-failure reason → re-enter the password. +- *enterprise auth/cert failure* — "Enterprise login failed for (802.1X)" + → evidence: SSID, EAP failure reason → edit the profile in nmtui/nmcli. +- *upstream / AP / provider* — "On , link is fine but the network has no + uplink" → evidence: gateway reachable, no route out, not a portal → switch + network or contact the venue. +- *VPN-routed* — "Connected; internet is routed through a VPN ()" → + evidence: default route on a tun/wg device or non-NM DNS owner → check the VPN, + not WiFi. +- *HTTP interception, no parseable portal URL* — "A portal is intercepting + traffic but didn't give a login link" → evidence: HTTP code, redirect host → + opens neverssl + the gateway page to log in manually. +- *DNS override cleanup unverified* — "Couldn't confirm DNS was restored after the + test" → evidence: iface, attempted revert → revert DNS manually + (=resolvectl revert =). + +Each message names whether the system was left unchanged, partially changed (with +what), or fully changed, so the user knows the residue. + +* Doctor: escalation, classification, terminal states + +=net doctor= diagnoses, classifies the failure, then (with =--fix=) applies the +*lightest* repair that fits and re-checks — it never loops destructive repairs +against a failure they can't fix. Each failure resolves to one of four outcomes, +and the doctor stops at any terminal one: + +- =fixable= — a local repair should help. Escalate lightest-first: rfkill-unblock + → reset (fresh MAC) → bounce (full stack) → portal, re-probing after each, and + stop as soon as the probe returns online. +- =needs-user-action= (terminal) — no reset/bounce will help; doctor stops and + names the exact next step. Covers: wrong WPA password / missing NM secret + (enter the password), locked keyring or polkit denial (retry with auth), + enterprise 802.1X cert/identity failure (edit the profile in =nmtui=/=nmcli=), + captive portal login-required (open the portal + accept terms). Doctor must not + delete/recreate the profile against these — that loses the saved password and + makes things worse. +- =upstream-not-local= (terminal) — the local link is up but the problem is past + it: AP has no uplink, gateway down/dropping traffic, DHCP server broken, ISP + outage, portal backend failing. =diagnose= proves it (link up + IP + gateway + reachable, but no route out and not a captive redirect), and =doctor --fix= + stops after local repairs are exhausted with "local repairs tried; likely + upstream/AP/provider" + the evidence. Next action: switch networks or contact + the venue. +- =deferred/vpn= (terminal for v1) — an active VPN / policy route / non-NM + resolver owns the default route or DNS, so "no internet" may be the VPN's fault, + not WiFi's. v1 *detects* this (default route on a =tun/wg= device, or DNS owned + by something other than the NM link) and classifies it separately — "link is + fine; internet is VPN-routed" — rather than misclassifying it as a WiFi failure. + v1 does not repair it (VPN management is Phase 5); it names the VPN as the likely + owner and stops. + +** DNS handling in doctor (explicit per class) +- *Captive DNS hijack* — open the portal (the hijack clears on login). No DNS + mutation. +- *Broken resolver, 1.1.1.1 works* — doctor offers an explicit *temporary* 1.1.1.1 + override as a repair with cleanup verification (auto-revert, =cleanup_verified=); + without =--fix= it only recommends the command. It does not leave a permanent + resolver change. +- *Port-53 / egress blocked* (even 1.1.1.1 fails) — terminal =upstream-not-local=; + doctor stops, since it's not locally fixable. + +* Failure-mode coverage + +For each common field failure: does =net diagnose= detect it, can =net doctor +--fix= repair it, and what terminal user action remains when it can't. (The +=needs-user-action= / =upstream-not-local= / =deferred/vpn= outcomes are defined +above.) + +| Failure mode | diagnose detects | doctor --fix | terminal user action | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| rfkill soft block | yes | yes (unblock) | none | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| rfkill hard block | yes | no | flip the physical switch | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| no wifi hardware | yes | n/a | use ethernet | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| associated, no DHCP | yes | yes (reset/bounce) | none, else switch network | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| gateway unreachable | yes | yes (bounce) | switch network if it | +| | | | persists | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| captive DNS hijack | yes | opens portal | log in at the portal | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| broken DNS, 1.1.1.1 works | yes | yes (temp override, | report the venue's DNS | +| | | auto-reverted) | | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| HTTP captive portal | yes | opens portal | log in at the portal | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| HTTP interception, no | yes | opens neverssl + gateway | log in manually | +| parseable URL | | | | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| upstream / AP outage | yes (link up, no route out) | no (stops after local) | switch network / contact | +| | | | venue | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| wrong WPA password / | yes | no | enter the password | +| missing secret | | | | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| enterprise auth / cert | yes | no | edit the profile in | +| failure | | | nmtui/nmcli | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| duplicate SSID / | yes (UUID-keyed) | yes (activate by UUID) | none | +| connection-name | | | | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| hidden SSID | yes | yes (connect by name) | enter SSID + password | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| multiple active links | yes | n/a | pick the interface | +| (wifi+tether) | | | | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| wedged NetworkManager | yes | yes (bounce → restart NM) | none, else reboot | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| slow / hung command | yes (degraded) | retries within budget | retry | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| stale / corrupt cache | yes | self-heals (atomic + | none | +| | | invalidation) | | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| DNS cleanup failure | yes | flags cleanup-unverified | revert DNS manually | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| missing speedtest backend | yes | n/a | install speedtest-go | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| +| VPN / policy-routing | yes (route/DNS ownership) | no (deferred to Phase 5) | check the VPN | +| interference | | | | +|----------------------------+-----------------------------+-----------------------------+-----------------------------| + +* Observability — logging + redaction + +- *Event log*: JSONL at =$XDG_STATE_HOME/net/events.jsonl= (fallback + =~/.local/state/net/events.jsonl=), size-rotated (e.g. 1 MB × 3). Every + mutating op and probe appends an event: =ts, op, argv (redacted), exit_code, + stderr_tail, elapsed_ms, iface, nm_uuid, probe_url_class, http_code, + redirect_host, cache_event=. +- *Redaction (always on)*: PSKs, EAP identities/passwords, NM secrets, and + portal query tokens are never logged. MAC addresses, full IPs, and SSID are + redacted when configured (=redact_mac=, =redact_ip=, =redact_ssid= in config). +- *Post-failure diagnosis*: =net doctor --json= is the snapshot + recommendation + (diagnose plus the suggested repair), =net diagnose --json= the raw report, and + the event log the history. =net doctor= is the console-recoverable entry point + (reachable as =make online= / =make net-doctor=). +- *Secret-leak tests*: assert no PSK/EAP/portal-token ever appears in any JSON + output, log line, or error message. * Indicator (task #C — Phase 1, the fast win) ** States (internet sub-state on top of link state) - online — associated and the probe returned 204. Normal icon. - captive — associated, probe hit a portal. Distinct glyph + warning CSS class; - tooltip names the portal host; left-click opens the panel's diagnostics with - the portal ready to open. + tooltip names the portal host; left-click opens diagnostics with the portal + ready to open (Phase 2+; see interactions for the Phase-1 interim). - no-internet — associated, probe failed (no portal, no 204). Distinct glyph + warning class. +- degraded — =net status= couldn't read link state within budget (slow/failed + nmcli). Neutral glyph, =net-degraded= class. Never blocks the bar. +- rfkill-blocked — the radio is soft-blocked (=rfkill=), distinct from + disconnected. Distinct glyph; the fix is =net repair rfkill= / =net doctor=. - connecting / disconnected / airplane / wired — as today, plus wired shown - correctly even when it appears after session start. + correctly even when it appears after session start. (airplane is now this + module's state, absorbed from the retired airplane module.) ** Glyphs Nerd-font codepoints, final values verified live before merge (same discipline as wtimer). Reuse the signal-strength ramp already in =waybar-netspeed=; add a -captive/no-internet overlay glyph. +captive / no-internet / degraded overlay glyph. ** Tooltip SSID + signal + IPv4 + gateway + the throughput readout (absorbed from -netspeed) + the last probe result and age. +netspeed) + the last probe result and its age (stale/expired hinted). -** Interactions (no keyboard-modifier clicks — waybar can't qualify clicks by -modifier; the panel hosts the rich actions) -- left-click — open the panel. -- right-click — quick reconnect / bounce (=net reset=, no panel). -- middle-click — run =net portal= (force the captive page). -- scroll — cycle nothing in v1 (reserved; could cycle saved connections later). +** Interactions (phase-aware; no keyboard-modifier clicks — waybar can't qualify +clicks by modifier, so the rich actions live in the panel, not ctrl/super-click) +Clicks never block the bar: each dispatches a detached background job and reports +via =notify=, single-flight per action. +- *Phase 1 (no panel yet)*: left-click runs =net probe= + notify (refreshes the + state on demand) and keeps the existing =pypr toggle network= scratchpad as the + interim manager; right-click runs =net repair reset= in the background + + notify; middle-click runs =net portal=. +- *Phase 2+ (panel exists)*: left-click opens the panel (focused on the relevant + section — diagnostics when captive); right/middle keep the background + reset/portal shortcuts. * Panel (tasks #B + #C diagnostics — Phases 2-3) -GTK4 + gtk4-layer-shell, pocketbook scaffold (src-layout package, pytest, +GTK4 + gtk4-layer-shell, pocketbook scaffold (src-layout package, unittest, Makefile, gtk4-layer-shell anchored dropdown under the bar). One panel shell, reused by the future desktop-settings panel. Sections: 1. *Connections* — list, MRU-first, active marked, live signal bars for in-range wifi; row click switches; buttons for add / edit / remove; a rescan control. -2. *Diagnostics* — buttons: Probe (204/captive, shows portal URL + Open), - Bounce/Reset, Gateway ping, DNS override test. Streaming output area. -3. *Speed test* — Run button, progress, down/up/ping result + last-run line. +2. *Diagnose* (read-only) — Probe (204/captive, shows portal URL + Open), Gateway + ping, DNS config. Streaming step output (the diagnostics contract). +3. *Repair* (mutating, confirmed) — tiered lightest-first: Unblock rfkill, Reset + (fresh MAC), Bounce (full stack), DNS override test, Force portal. A "Get me + online" button runs =net doctor --fix= (the auto-escalating sequence). +4. *Speed test* — Run button, progress, down/up/ping result + last-run line. + +** Panel state, cancellation, permissions +State machines for: connection-list loading, rescan-in-progress, +activation-in-progress, diagnose-running, repair-running, speedtest-running. Plus +the real terminal states on this two-machine fleet: no-wifi-hardware (desktop → +ethernet-only view) and missing speedtest-go. (No GPG-key state — there's no +credential store; secrets live in NM.) ("No NetworkManager" is not a modeled +state — NM is always present +on these machines; if nmcli is somehow absent the panel shows a single hard-error +and exits.) Long operations show elapsed time and are cancellable where the +underlying op allows (rescan, speedtest, probe); clearly non-cancellable ones +(an in-flight activation) show elapsed + a disabled control. Permission-denied +(sudo/polkit declined) is a first-class outcome with the "nothing changed" +message, never a silent failure. Interaction-pattern catalog (=~/code/rulesets/patterns/=) principles that apply: - transient-state-buttons — all the network levers in one place, reachable by one chord (the bar click), state visible. - default-most-common-friction-proportional — connections MRU-ordered so the - common pick is first; destructive ops (remove) get a confirm, switching does - not. + common pick is first; destructive ops (remove) and privacy-changing ones + (reset, override) get a confirm, switching does not. - one-prompt-picker-typed-prefix — if the connection picker ever goes keyboard-driven, kind (wifi/eth/saved/in-range) + name in one typed picker. +** Panel UX flow (settle before Phase 2) +The concrete interaction defaults, so the GTK build isn't inventing them: +- *Default focus*: the Connections section, current connection's row selected. If + the indicator opened the panel because of a captive/no-internet state, focus + Diagnose instead with the relevant action highlighted. +- *Row content*: glyph (signal bars / wired / active check) + name + a secondary + line (security type, "active"/last-used). The active row is visually pinned at + top of its group. +- *Buttons*: one *primary* per section (Connections: Connect to the selected row; + Diagnose: Run diagnose; Repair: "Get me online"; Speed test: Run). Secondary + actions (add / edit / remove / rescan; individual repair tiers) are smaller and + grouped. +- *Disabled rules*: Connect disabled on the already-active row; Repair tiers + disabled while one runs; Speed test disabled while running; add/edit disabled + for enterprise (with the "edit in nmtui/nmcli" hint). +- *Confirmations* (exact wording): Reset → "Reset ? This drops the + connection and reconnects with a new MAC."; Bounce → "Restart networking? All + links drop briefly."; DNS override → "Temporarily set DNS to 1.1.1.1 for the + test? It reverts automatically."; Remove → "Forget ? The saved password is + deleted." +- *"Get me online" reporting*: shows each escalation step live (Unblock rfkill → + Reset → Bounce → Portal) with per-step pass/fail and stops at the first that + restores internet or at a terminal state, naming the next action. +- *After close*: the bar reflects the new state immediately (signal/refresh on + next poll); a running speedtest/diagnose keeps running and notifies on finish + (panel close doesn't cancel it). +- *Keyboard*: Esc closes; Tab moves between sections; arrows move rows; Enter + fires the section primary; the connection list is type-to-filter. + * Connection management (nmcli) -- Every op via nmcli: =device status=, =connection show=, =con up/down=, - =con add/modify/delete=, =dev wifi rescan/list=. +- Every op via nmcli per the nmcli contract above (terse, escaped, UUID-keyed, + bounded =--wait=). - MRU ordering from NM's =connection.timestamp= (last activated), descending. - Ethernet appears in the list whenever a wired device is present, selectable at any time; switching just brings the chosen connection up. -- TDD with a fake =nmcli= on PATH returning canned output, asserting the exact - nmcli command sequence (behavior, not implementation) — the established - pattern in =tests/waybar-netspeed= and =tests/layout-navigate=. - -* Credential storage (GPG) - -- Store: =~/.config/net/connections.= — connection definitions + secrets - (PSK/EAP), one record per connection, with =last_used=. -- *Default unencrypted* (=connections.json=). Encryption is opt-in: when enabled, - =connections.json.gpg= encrypted to Craig's private key (=c@cjennings.net=). -- Passphrase cadence via gpg-agent: once per session (long cache TTL or - decrypt-and-hold), once per hour (=default-cache-ttl=), or never (plaintext). - Configured in =~/.config/net/config=. -- *Supplements NM, does not replace it.* NM's own store - (=/etc/NetworkManager/system-connections=, root-only) stays the source of - truth that actually connects. The GPG store is a portable, user-owned export + - secret vault. =net add/edit/remove= writes both (nmcli + store). =net import= - rebuilds NM connections from the store on a fresh machine. They sync on every - mutating op; NM wins on conflict for the live connection. - -* Diagnostics + speed test - -- Diagnostics reuse =captive= verbatim for the interactive flow (=net portal=, - =net reset=, the 1.1.1.1 DNS override test) and mirror its cheap probe natively - for =net status= / =net probe=. No logic is duplicated by hand beyond the small - portal-URL parser, which is already unit-tested in =tests/captive=. -- Speed test: *librespeed-cli* (no account, self-hostable, AUR), chosen over - Ookla speedtest-cli. =net speedtest --json= parses its JSON; the panel shows a - progress indicator and the down/up/ping result. +- *Mutation safety + rollback*: switching keeps the current connection up until + the new one activates successfully (=nmcli --wait 30=); on failure it does not + tear down the working link, surfaces the failure, and leaves the prior + connection active. =net down= notes that NM may auto-reactivate a profile and + reports the post-op active connection so the user isn't surprised. A switch that + needs a password it doesn't have prompts (or fails with "password required"), + never silently strands. The exact NM command sequence (preflight active-state + read → activate target → verify default route → on failure, confirm prior + still up) is pinned in the engine and tested against fake nmcli. +- *Add/edit scope*: open + WPA-PSK only in v1. Existing saved profiles of any + type (including enterprise) can be *activated*; editing an enterprise profile + shows "edit via nmtui/nmcli" rather than a broken partial form. + +* Connection secrets (no separate store) + +Per Craig's call: don't build a parallel credential store. Settings and secrets +live where NetworkManager already keeps them, so there's one source of truth and +no extra dependency (no GPG, no gpg-agent, no =~/.config/net/connections=). + +- *Where secrets live*: =/etc/NetworkManager/system-connections/.nmconnection=, + root-owned =0600=, with the PSK/EAP secret stored inline (the default + =secret-flags=0= "owned by NM"). That's already secure-at-rest (root-only) and + is what =nmcli= reads/writes. +- *How we touch them*: every add/edit/remove goes through =nmcli= (=connection add + / modify / delete=), which writes the =.nmconnection= with the right ownership + and perms. We never read or write =system-connections= files directly (root) and + never copy a secret out of them. +- *No export / import / sync* — there's nothing to sync. A new machine gets its + connections the way it always has (the user joins, or restores NM profiles), + not from a tool-specific vault. +- *config file*: =~/.config/net/config= still exists, but only for non-secret + preferences (speedtest server, redaction flags, probe TTL). It holds no + credentials. +- *No secret leakage*: PSK/EAP never appear in =net=' =--json= output, the event + log, or error text (tested) — even though NM is the store, our surfaces must not + echo a secret =nmcli= happens to return. + +* Speed test + +- Backend: *=speedtest-go=* (=--json=, =--server=, =--no-download/--no-upload=), + already installed on velox (AUR =speedtest-go-bin=). No new dependency for v1. + librespeed-cli is the documented fallback for a self-hosted LibreSpeed server. +- =net speedtest --json= parses speedtest-go's JSON into the =speedtest= schema. +- *Server policy*: auto-select nearest by default; allow a pinned server id in + =~/.config/net/config=. +- *Timeout + cancellation*: a hard run timeout (e.g. 60s); the panel run is + cancellable (kills the child). Offline / rate-limited / no-server errors map to + the failure-message table. +- *Tests*: fixture JSON (success) and fixture stderr (offline, no server, + malformed output) drive =net speedtest= parsing without touching the network. + +* Help + documentation + +In-app help has three layers, each reachable in the situation it's needed: + +- *CLI help (works from a dead-GUI TTY)*: =net --help= lists the subcommands in + one screen; =net --help= documents each (flags, what it mutates, the + console-recovery targets). The Makefile targets are self-describing (=make help= + lists =online= / =net-doctor= / etc. with one-line descriptions). This is the + layer that matters most when you're at a console with no network. +- *Panel help (in the GUI)*: a small =?= affordance in the panel header opens an + inline help pane — what each section does, which Repair actions mutate state, + what the indicator glyphs/colors mean. Per-control tooltips on the less-obvious + buttons (rfkill, bounce, DNS override). No external help browser. +- *User guide (the durable doc)*: a README / docs page covering every command, + the indicator states + glyphs, the panel sections, the config file keys, the + recovery make targets, troubleshooting (the failure-message table), and + rollback. Written so a future session — or Craig six months out — can operate + and recover the module from the doc alone. + +The failure-message table above is the single source of truth for the +troubleshooting text; the guide and the panel help both render from it rather +than restating it. + +* Enhancement radar + +Low-cost adjacent affordances, each dispositioned so cheap wins aren't lost and +the v1 panel stays focused. (Several are already in v1 by virtue of other +sections; marked here so the consideration is visible.) + +| Enhancement | Disposition | Reason | +|-------------------------------------+-------------+--------------------------------------------------------| +| Open / copy portal URL | v1 | already in the captive flow; trivial Open + Copy | +|-------------------------------------+-------------+--------------------------------------------------------| +| Forget network | v1 | it's the remove op, already specced | +|-------------------------------------+-------------+--------------------------------------------------------| +| Rescan now | v1 | already a Connections control | +|-------------------------------------+-------------+--------------------------------------------------------| +| Retry with hardware MAC | v1 | captive already has --hardware-mac; expose in Repair | +|-------------------------------------+-------------+--------------------------------------------------------| +| Pin speedtest server | v1 | already a config key | +|-------------------------------------+-------------+--------------------------------------------------------| +| Copy redacted doctor report | v1 | cheap, serves the observability/support goal | +|-------------------------------------+-------------+--------------------------------------------------------| +| Show last good network / result | vNext | needs small history persistence | +|-------------------------------------+-------------+--------------------------------------------------------| +| Watch mode for net doctor | vNext | a --watch loop; handy at a TTY, not v1-critical | +|-------------------------------------+-------------+--------------------------------------------------------| +| Actionable desktop notifications | vNext | dunst supports actions; extra wiring | +|-------------------------------------+-------------+--------------------------------------------------------| +| Keyboard connection picker (fuzzel) | vNext | the typed-prefix pattern; panel covers v1 | +|-------------------------------------+-------------+--------------------------------------------------------| +| QR-code share / import WiFi | rejected | low value for a personal 2-machine setup; phones do QR | +|-------------------------------------+-------------+--------------------------------------------------------| * Waybar wiring - Replace =custom/netspeed= with =custom/net= in the bar's module list (same slot). - Module def: =exec: waybar-net=, =return-type: json=, =interval: 2=, a =signal= - for on-demand refresh (next free signal after wtimer's 14), =on-click: =, =on-click-right: net reset=, =on-click-middle: net portal=. -- Remove the old =on-click: pypr toggle network= scratchpad once the panel - replaces it. + for on-demand refresh (next free signal after wtimer's 14), =on-click=, + =on-click-right=, =on-click-middle= per the phase-aware interactions (each + dispatches a detached job, never blocks). +- Remove the old =on-click: pypr toggle network= scratchpad only once the panel + replaces it (Phase 2); Phase 1 keeps it as the interim manager. * Testing plan (TDD) -- *Engine* — fake =nmcli= + fake =curl= on PATH; assert command sequences and - parsed/emitted JSON for status, list, up/down, add/edit/remove, probe, - diagnose, speedtest. Pure state/format functions tested directly. +- *Engine (normal)* — fake =nmcli= + =curl= + =speedtest-go= on PATH; assert + command sequences and parsed/emitted JSON for status, list, up/down, + add/edit/remove, probe, diagnose, repair, speedtest. Pure state/format + functions tested directly. JSON schemas locked by example. - *Portal parser* — already covered in =tests/captive= (Normal/Boundary/Error + the real SONIFI body). The engine's native probe reuses the same cases. +- *nmcli parsing* — escaped colon/backslash/newline in SSID, duplicate names, + hidden SSID, non-ASCII, wired-mid-session, multi-active (wifi+tether). +- *Failure + concurrency (the risky classes)* — slow/hung nmcli/curl/speedtest + (degraded state within budget), concurrent =net status= probe refresh + (single-flight), corrupt cache (recovered), stale cache after SSID change + (invalidated), permission denied / sudo declined, DNS-override cleanup failure + (=cleanup-unverified=), NM partial activation (rollback keeps prior link), + secret redaction, missing speedtest-go, no wifi hardware, rfkill soft/hard + block. +- *Doctor classification* — fixture-driven =net doctor= over fake nmcli/curl + asserting the right terminal classification + that =--fix= stops before + destructive repairs: auth failures (=needs-user-action=), upstream/AP failure + (=upstream-not-local=), VPN-routed failure (=deferred/vpn=), and the DNS classes + (hijack → portal, broken-but-1.1.1.1-works → offered override, egress-blocked → + upstream). Assert the failure-mode coverage table's "detects / repairs / terminal + action" holds for each row. - *Indicator* — drive =net status --json= through =waybar-net=, assert the JSON - the bar renders for each state (online / captive / no-internet / wired / - disconnected), interface override via env like =WAYBAR_NETSPEED_IFACE=. -- *Panel* — pocketbook-style: test the backing logic (list ordering, op - dispatch, store read/write, gpg round-trip with a test key), not the GTK - widgets. -- *GPG store* — round-trip encrypt/decrypt with a throwaway test key; sync - on add/edit/remove; import rebuilds NM ops (asserted against fake nmcli). + per state (online / captive / no-internet / degraded / wired / disconnected / + rfkill), iface override via env. +- *Panel* — pocketbook-style: backing logic (list ordering, op dispatch, + state-machine transitions), not GTK widgets. +- *NM secrets / no-leak* — add/edit writes the secret into NM via nmcli (asserted + against fake nmcli, never to a tool-owned file); assert no PSK/EAP appears in any + =--json=, log line, or error (there is no credential store to round-trip). +- *Live checklist (gated out of the suite)* — a "Manual testing and validation" + task per phase for the real-network states (captive at a hotel, no-internet, + switch under load, reset, speedtest) that can't be faked. + +** Harness + coverage gate +The concrete contract, matching the repo's existing convention (not pytest — the +dotfiles suites are =unittest=, run by =make test= as =python3 -m unittest= over +=tests/*/test_*.py=; 33 suites today): +- *Framework*: =unittest=. Each suite is =tests//test_.py= + (=tests/net/=, =tests/waybar-net/=), collected by the existing =make test= loop + — no new runner, no pytest dependency. +- *Fakes on a temp PATH*: =fake-nmcli=, =fake-curl=, =fake-speedtest-go=, + =fake-rfkill=, =fake-resolvectl= live as executable stubs in =tests//= + (the =tests/layout-navigate/fake-hyprctl= pattern). A fixture file encodes the + command→canned-output map and the stub appends each invocation to a log the test + asserts against. Subprocess timeouts are simulated by a stub that sleeps past the + budget; =net status= must still return the degraded state. +- *Waybar wrappers end-to-end*: =waybar-net= is run as a subprocess with the fake + PATH and the env overrides (iface, cache path), asserting the emitted JSON — same + as =tests/waybar-netspeed=. +- *Coverage*: coverage.py is absent system-wide (and not importable), so coverage + runs in a throwaway venv (=python3 -m venv=, =pip install coverage=, =coverage + run -m unittest=, =coverage report=) — the method the wtimer suite used (95%). + Target: *branch* coverage over =net/= and the wrapper, ≥ 90% on the pure + classifier/parser modules. -* Files touched (planned) +** Coverage as a gap-finder, not a number (per phase) +Line coverage alone misses the branches that matter here, so each phase ends with +a *coverage-gap pass*, not just a percentage: +- After the first green run, read the branch report and map every uncovered branch + to either a new test or a consciously-excluded live-only behavior (with a comment + or a Manual-testing entry naming it). +- *Branch coverage is required* for the pure logic: the doctor classifier (every + outcome — fixable / needs-user-action / upstream-not-local / deferred-vpn), the + cleanup-unverified path, the redaction paths, the degraded hot-path fallback, the + timeout branches, and the portal/nmcli parsers. +- A phase isn't "done" until its coverage-gap pass is recorded — uncovered logic is + either tested or explicitly excused, never silently uncovered. +* Files touched (planned, all in =~/.dotfiles=) + +- =net/= package (src-layout, like pocketbook) — engine + panel. - =hyprland/.local/bin/waybar-net= — the indicator (replaces =waybar-netspeed=). -- =hyprland/.local/bin/net= — engine CLI entry (or a package console-script). -- =net/= package (src-layout, like pocketbook) — engine + panel, in the dotfiles - repo (or in-tree as pocketbook is). -- =hyprland/.config/waybar/config= — swap =custom/netspeed= → =custom/net=. -- =hyprland/.config/waybar/style.css= — captive / no-internet state classes. +- =hyprland/.local/bin/net= — engine CLI entry (console-script shim). +- =hyprland/.config/waybar/config= — swap =custom/netspeed= → =custom/net=; + remove =custom/airplane=. +- =hyprland/.config/waybar/style.css= — captive / no-internet / degraded / + rfkill classes; remove airplane classes. - =tests/net/=, =tests/waybar-net/= — suites. -- =~/.config/net/= — config + connection store (machine-local; not stowed - content beyond a seed config). -- =captive= — minor refactor so the engine can reuse its probe cleanly. +- =captive= — refactor: extract probe + reset into functions callable + non-interactively (a =--json= probe mode) so the engine reuses them. +- =~/.config/net/config= — seed config (probe TTL, speedtest server, redaction + flags). No secrets; not a credential store. +- dotfiles =Makefile= — add the console-recovery targets (=online=, =net-doctor=, + =net-status=, =net-diagnose=, =net-portal=, =net-reset=, =net-bounce=). +- *Deletions once net ships* (the airplane module is absorbed): + =hyprland/.local/bin/waybar-airplane=, =hyprland/.local/bin/airplane-mode=, + =tests/waybar-airplane/=, =tests/airplane-mode/=, and the =custom/airplane= + module + its css. +- archsetup Hyprland step — add =gtk4-layer-shell=, =python-gobject=, + =speedtest-go-bin= to the install lists (the only archsetup change; no =gpg= + added, secrets stay in NM's store). -* Resolved decisions (this session, Craig's calls) +* Resolved decisions (Craig's calls + this response) 1. Panel UI tech → GTK4 + gtk4-layer-shell, shared pocketbook scaffold (one panel shell, reused by the desktop-settings sibling). 2. Engine language → Python =net= package; shells out to =captive= for the portal-force flow, native cheap probe for the bar path. 3. Connectivity probe → split cadence (fast link poll every 2s + slow cached - internet/captive probe, TTL ~45s). + internet/captive probe, TTL ~45s) with single-flight + atomic cache. 4. No keyboard-modifier clicks (waybar can't qualify them) — the panel hosts the - rich actions; bar clicks are left=panel, right=reset, middle=portal. -5. GPG store supplements NM; NM stays the source of truth. + rich actions; bar clicks dispatch detached jobs (phase-aware). +5. No separate credential store (Craig's call, cj). Secrets live in NM's own + =system-connections= (root =0600=, inline), touched via nmcli. No GPG, no + gpg-agent, no =~/.config/net/connections=. Supersedes the earlier GPG-store + design. 6. =custom/netspeed= absorbed into =custom/net=; throughput moves to the tooltip. -7. Speed-test backend → librespeed-cli. +7. Speed-test backend → =speedtest-go= (already installed), not a new + librespeed-cli dependency; librespeed-cli is the self-hosted fallback. +8. Code lives in the dotfiles repo; archsetup only installs deps. +9. v1 add/edit scope = open + WPA-PSK; enterprise/802.1X is activate-only, + add/edit is vNext (settled by Craig 2026-06-29 — no enterprise networks in his + history, so the form would be unused UI). +10. =net doctor= is in v1 (Craig's call, cj) — a one-shot diagnose+fix mode, + reachable from a TTY via =make online= / =make net-doctor=. (The earlier + "defer the doctor/bundle command" decision is reversed.) +11. Diagnose (read-only) and Repair (mutating, confirmed) are separated in the + panel and the CLI; Repair is tiered lightest-first (rfkill → reset → bounce). +12. =custom/net= absorbs the airplane module (Craig's call, cj); the standalone + airplane module + scripts + tests are deleted once net ships. +13. Repair includes a full-stack bounce and an rfkill-unblock (Craig's calls, + cj) — the latter recovers the framework-laptop post-power-loss soft-block. +14. VPN / WireGuard is a planned Phase 5 (Craig's call, cj), not a permanent + exclusion. -* Phasing +* Implementation phases -- *Phase 1 — Indicator (task #C).* =net status= + =net probe= (native cheap - probe, reusing captive's logic) + =waybar-net= + the split-cadence cache + CSS - states. Ships the no-internet/captive state on the bar. Smallest, highest - value, fully testable without the GTK panel. +- *Phase 1 — Indicator + console recovery (task #C).* =net status= + =net probe= + (native cheap probe, reusing captive's logic) + the =captive= probe refactor + + =waybar-net= + the split-cadence cache (single-flight, atomic, stale classes) + + CSS states (incl. rfkill) + performance budget. Plus the CLI-only recovery path: + =net repair= tiers (rfkill / reset / bounce), =net doctor [--fix]=, and the + Makefile targets (=make online= etc.) — all testable without the GTK panel. + Absorbs the airplane state and removes the standalone airplane module. Interim + left-click keeps the existing scratchpad until the panel lands. + - *Acceptance*: fresh-login waybar smoke test shows correct state on + online/captive/no-internet/wired/rfkill; =net status= stays within budget + under a fake slow nmcli (degraded state); =net doctor --fix= recovers a + soft-blocked radio from a TTY; the live captive checklist passes at a real + portal; the airplane state works and the old airplane module is gone; + reverting = swap =custom/netspeed= + =custom/airplane= back. - *Phase 2 — Panel shell + connection management (task #B core).* GTK4 - layer-shell scaffold + =net list/up/down/add/edit/remove/rescan= + MRU list. + layer-shell scaffold + =net list/up/down/add/edit/remove/rescan= + MRU list + + mutation safety/rollback + panel state machines. + - *Acceptance*: switch wifi↔wifi and ethernet↔wifi without stranding; a failed + switch leaves the prior link up; add/edit open + WPA-PSK writes the secret to + NM; remove confirms; panel states render for loading/rescan/activation. - *Phase 3 — Diagnostics + speed test in the panel.* Wire =net diagnose= / - =net portal= / =net reset= / =net speedtest= into the panel; portal Open - button. -- *Phase 4 — GPG credential store.* Opt-in encryption, cadence config, NM sync, - import. + =net repair= / =net doctor= / =net portal= / =net speedtest= into the Diagnose + vs Repair sections; the "Get me online" button; portal Open button; speedtest + progress + cancel. + - *Acceptance*: diagnose runs read-only; each repair tier confirms + verifies + cleanup (DNS override reverts, shown); speedtest result parses from + speedtest-go and a fixture-driven failure shows the right message. +- *Phase 4 — Docs + rollout.* In-app help (=net --help= / per-command help, the + panel help affordance), README/user-guide (commands, panel, config, + troubleshooting, the make targets, rollback), and the manual dep step on ratio. + - *Acceptance*: =net --help= and each subcommand's help are complete; the + user-guide covers every command + the recovery targets; ratio rollout + documented. +- *Phase 5 — VPN / WireGuard (future).* Fold the existing archsetup wireguard + tooling into the same panel + CLI (=net vpn ...=). Out of the v1 milestone; + specced separately when picked up. * Open items / risks - gtk4-layer-shell dropdown anchoring under a waybar module needs the same - positioning work pocketbook solved; reuse it. -- librespeed-cli availability + a default server choice (public list vs a pinned - server) to confirm before Phase 3. + positioning work pocketbook solved; reuse it. (Phase 2.) +- The =captive= refactor must keep the standalone CLI behavior identical while + exposing a non-interactive =--json= probe; covered by the existing + =tests/captive= suite plus new probe-mode tests. (Phase 1.) +- speedtest-go server selection variance (nearest-server flor) — pin a server in + config if results are noisy. (Phase 3.) - The background-probe kick from =net status= must be truly non-blocking (spawn + - detach) so a slow/hanging probe never stalls the bar. -- NM↔GPG-store conflict policy on edit needs a concrete rule (NM wins for the - live connection) — confirm during Phase 4. + detach); enforced by the single-flight lock and the performance benchmark test. * Rollback Each phase is independent. The indicator (Phase 1) is a drop-in replacement for -=custom/netspeed=; reverting is swapping the module back in the config. The panel -and store are additive — not wiring their clicks / not enabling encryption leaves -the bar working as before. +=custom/netspeed= (and =custom/airplane=); reverting is swapping those modules +back in the config and restoring their scripts. The panel is additive — not +wiring its clicks leaves the bar working as before. No credential store to roll +back (secrets stay in NM throughout). + +* Review findings [31/31] + +** DONE Define the structured diagnostics contract :blocking: +The spec says the engine "emits JSON" and that diagnostics "reuse =captive= +verbatim", but the current =~/.dotfiles/common/.local/bin/captive= flow is a +human-readable bash script that mixes diagnostics, sudo prompts, DNS mutation, +browser launch, and terminal prose. A GTK panel cannot reliably turn that into +clear state, progress, cancellation, or useful error messages. Define the +machine contract before implementation: every diagnostic step should have a +stable id, status (=pending/running/pass/warn/fail/skipped=), redacted evidence, +elapsed time, safety outcome, and next action. Keep =captive= as the interactive +CLI, but either refactor reusable probe/reset functions behind =net diagnose +--json= or make =captive= expose a non-interactive JSON mode. This blocks the +panel and logging work because otherwise the implementer must invent the +boundary. + +Disposition: accept — added the "Diagnostics contract" section (per-step id / +status / evidence / elapsed / safety / next_action) and the =captive= =--json= +probe-mode refactor under Architecture + Files touched. + +** DONE Specify user-facing failure messages and recovery actions :blocking: +The spec names failure states like =no-internet=, =captive=, failed probe, +failed reset, missing DNS, and missing speed-test backend, but it does not define +the messages the user sees or what each message tells them to do next. For this +feature, "error" is not enough: a user needs to know whether WiFi is associated, +whether DHCP succeeded, whether DNS is hijacked/broken, whether HTTP is +intercepted, whether sudo was declined, whether a command timed out, and whether +the system was left unchanged or partially changed. Add a message table for the +indicator, panel, and CLI with: failure class, visible text, evidence included, +redaction rule, and next action. This is blocking because UX quality here is the +product, not an implementation detail. + +Disposition: accept — added the "Failure states, messages, recovery" section +covering each class, the visible message, the "what changed" residue note, and +the next action across indicator/panel/CLI. + +** DONE Define the debug log and redacted support bundle :blocking: +There is no observability section. When this fails in a hotel or cafe, an agent +needs enough evidence to diagnose it without rerunning destructive actions. Add +log location, rotation/retention, JSONL event schema, command argv logging, +exit-code/stderr capture, elapsed time, selected iface, NM active connection +UUID, probe URL class, HTTP code, redirect host, DNS servers, and cache +read/write events. Also define a =net doctor --json= or =net debug-bundle= +command that emits redacted status, recent log events, dependency versions, and +a reproduction command. Redact SSID if configured, MAC addresses, portal query +tokens, PSKs, EAP identities/passwords, IPs when requested, and all GPG/NM +secrets. This blocks implementation readiness because post-failure diagnosis is +currently left to ad hoc terminal spelunking. + +Disposition: modify — accepted the JSONL event log, the schema, and the redaction +rules in full (new "Observability" section). Deferred the dedicated =net +debug-bundle= / =net doctor= command to vNext: for a single-user tool =net +diagnose --json= (the snapshot) plus the event log (the history) cover +post-failure diagnosis; a bundle command is gold-plating for v1. Recorded under +Out + Resolved decision 10. + +** DONE Pin the nmcli parsing and timeout contract :blocking: +The spec lists nmcli operations but not the exact fields, output modes, escaping +rules, ID semantics, or timeouts. This is risky because SSIDs and connection +names can contain spaces, colons, duplicates, hidden names, and non-ASCII; the +current =waybar-netspeed= already had an SSID parsing bug. The nmcli manual +documents =--terse=, =--get-values=, =--escape=, =--wait=, ID/UUID/path +selection, =passwd-file=, and built-in connectivity states +(=none/portal/limited/full/unknown=) at +https://man.archlinux.org/man/nmcli.1.en. The spec should require UUIDs for +saved-profile operations, explicit =--wait= budgets, parser tests for escaped +colons/backslashes/newlines/duplicate names/hidden SSIDs, and a decision on when +to use or ignore =nmcli networking connectivity [check]=. This is blocking +because the command wrapper is the core reliability boundary. + +Disposition: accept — added the "nmcli contract" section: terse + =--escape= + +=--get-values=, UUID-keyed ops, explicit =--wait= budgets, NM connectivity as a +cheap hint (our probe authoritative), and the parser test matrix. + +** DONE Define cache concurrency, atomicity, and stale-state behavior :blocking: +=net status= may spawn =net probe= whenever the cache is stale, but the spec +does not define locking, process coalescing, atomic writes, crash cleanup, or +what happens when the probe hangs. With a 2s Waybar interval, a bad network could +start overlapping probes, corrupt the runtime cache, or keep showing stale +"online" while the link is gone. Add a single-flight lock under +=$XDG_RUNTIME_DIR/waybar=, atomic write+rename for cache updates, max probe +runtime, stale age classes (fresh/stale/expired/unknown), cache invalidation on +iface/SSID/connection UUID change, and tests for concurrent =net status= calls. +This blocks the fast-path design because it is the main performance and +correctness risk. + +Disposition: accept — added "Concurrency, atomicity, staleness" under the +Connectivity model: flock single-flight, temp+rename atomic write, ≤6s probe +timeout, fresh/stale/expired/unknown classes, iface/SSID/UUID invalidation, stale +lock reclaim, plus concurrency tests in the test plan. + +** DONE Bound hot-path performance with measured budgets :blocking: +The spec says the cheap poll should be sub-100ms, but the proposed fast path +still may call multiple =nmcli= commands every two seconds, read sysfs, parse +throughput, and maybe spawn a background probe. The existing =waybar-netspeed= +had a deliberate sleep for throughput sampling; replacing it must define how +throughput is sampled without sleeping in the bar path. Add a per-command budget +for =waybar-net= and =net status=, a maximum number of subprocesses on the hot +path, a timeout for every subprocess, benchmark tests with fake slow =nmcli=, +and a rule that the indicator emits a degraded JSON state rather than blocking. +This is blocking because Waybar custom modules can visibly freeze or lag when +their exec path stalls. + +Disposition: accept — added the "Performance budgets" section: <100ms typical / +<250ms worst, throughput sampled across the poll interval (no in-process sleep), +one nmcli call max on the hot path, timeouts on every subprocess, the degraded +state, and a fake-slow-nmcli benchmark test. + +** DONE Make click actions non-blocking and visible :blocking: +Waybar right-click runs =net reset= and middle-click runs =net portal= directly. +Those operations can require sudo, open browsers, mutate DNS, delete/recreate NM +profiles, or hang on network commands, but Waybar click handlers provide no +panel, terminal, progress, or cancellation surface by default. Define whether +right/middle click instead opens the panel focused on the action, dispatches a +background job with notifications, or is removed from v1. If kept, specify +single-flight behavior, how sudo/polkit prompts surface, how success/failure is +reported, and how the user can inspect logs. This blocks UX readiness because +the fastest remediation path is currently the easiest place to hide failure. + +Disposition: modify — accepted the concern; made the interactions phase-aware and +non-blocking. Every click dispatches a detached, single-flight background job and +reports via =notify=; sudo surfaces through polkit/the normal prompt; failures go +to the notify + the event log. In Phase 1 (no panel) left-click runs probe + +notify and keeps the scratchpad; from Phase 2 left-click opens the panel focused +on the action. Recorded in the Indicator "Interactions" subsection. + +** DONE Specify connection mutation safety and rollback :blocking: +The spec says row click switches connections and remove gets a confirm, but it +does not define what happens when a switch partially succeeds, disconnects the +current working link, needs a password, loses the default route, or triggers +auto-activation. The nmcli manual warns that =connection down= does not prevent +future auto-activation and may internally block a profile until user action. +Define preflight, the exact NM command sequence, whether the old active +connection is kept until the new one proves usable, when rollback is attempted, +how long activation waits, and what the panel says when rollback fails. This is +blocking because the module can strand the user offline. + +Disposition: accept — added "Mutation safety + rollback" under Connection +management: keep the prior link up until the target activates (=--wait 30=), no +teardown on failure, password-required surfaced not stranded, =net down= reports +post-op active state + the auto-reactivation caveat, and the pinned NM command +sequence is tested against fake nmcli. + +** DONE Define the credential-store security model :blocking: +The GPG store is described as optional and default-unencrypted, but the spec does +not define file modes, schema, secret-source rules, import/export prompts, +recipient verification, stale secret handling, or what is logged. It also says +NM remains source of truth while the user-owned store contains PSK/EAP secrets, +which creates two truth sources for sensitive data. Add a precise schema, +=0600= file creation with parent-dir permissions, encrypted-recipient checks, +plaintext warning text, explicit opt-in flow, redaction requirements, behavior +when NM has a secret not in the store, behavior when the store has a secret NM +rejects, and tests for no secret leakage in JSON/logs/errors. This blocks Phase +4 and the full spec because otherwise the implementer must make security +decisions mid-code. + +Disposition: accept — rewrote "Credential storage" with the versioned schema, +=0600= file / =0700= dir, recipient verification on opt-in, the plaintext +warning, secret-source rule (entered/exported, never harvested from root store), +the two-source reconciliation policy (NM wins live, store wins for what NM +lacks, stale-secret flagging), and the no-leak tests. + +** DONE Define EAP, enterprise WiFi, and unsupported connection behavior :blocking: +The store says "PSK/EAP" and connection management says add/edit, but there is +no v1 contract for WPA-Enterprise fields, certificates, identity vs anonymous +identity, hidden networks, static IP, proxy settings, metered flags, MAC +randomization, or 802.1X prompt behavior. Either scope v1 to open/WPA-PSK plus +existing saved-profile activation, or define the minimum EAP form and the +unsupported-state messages. This blocks add/edit/import because enterprise WiFi +is too sensitive to hand-wave. + +Disposition: modify (scope) — scoped v1 to open + WPA-PSK add/edit, with +*activation* of any existing saved profile (including enterprise). Enterprise / +802.1X add/edit, static-IP, proxy, metered, and MAC-randomization editing are +vNext, shown as "edit via nmtui/nmcli". Recorded in Scope/Out, Connection +management, and Resolved decision 9. + +** DONE Split read-only diagnostics from mutating remediation :blocking: +The panel's diagnostics section includes probe, bounce/reset, gateway ping, and +DNS override test in one area, while =captive= currently performs resets and +temporary DNS changes as part of its flow. Users need to know which buttons are +read-only and which mutate NM profiles, MAC mode, DNS, or browser state. Add +separate "Diagnose" and "Repair" actions, confirmations for destructive or +privacy-changing operations, explicit cleanup verification for DNS override, and +a terminal state when cleanup is unverified. This blocks readiness because +network repair must not surprise the user or leave hidden residue. + +Disposition: accept — split the panel into a read-only Diagnose section and a +confirmed, mutating Repair section (and split the CLI into =net diagnose= vs =net +repair=). Added =cleanup_verified= + a terminal =cleanup-unverified= state to the +diagnostics contract. + +** DONE Define panel state, cancellation, and permissions UX :blocking: +The panel sections list buttons and a streaming output area, but not loading +states, disabled states, empty states, keyboard/focus behavior, cancellation, or +permission-denied handling. Add panel state machines for connection list loading, +rescan in progress, activation in progress, diagnostics running, speedtest +running, and no NetworkManager/no WiFi/no permissions/no GPG key/no +librespeed-cli. Each long operation should be cancellable where possible or +clearly non-cancellable with an elapsed-time display. This blocks the GTK work +because without it the implementer must invent the user flow. + +Disposition: modify — accepted the state-machine requirement (added "Panel state, +cancellation, permissions"), but scoped the state set to what can actually occur +on the two-machine fleet: dropped "no NetworkManager" as a modeled state (NM is +always present; a missing nmcli is a single hard-error exit) and kept +no-wifi-hardware, missing speedtest-go, no-GPG-key, plus the in-progress states +with elapsed-time + cancellation where the op allows. + +** DONE Verify speed-test dependency, server choice, and failure contract :blocking: +The spec chooses =librespeed-cli= and notes availability/default-server research +as an open risk, but Phase 3 still depends on parsing its JSON and showing +progress. I checked the upstream project page +(https://github.com/librespeed/speedtest-cli) and the AUR URL named by search is +not sufficient as a verified package/install contract in this spec. Add the +exact package name/source to install, command version expected, JSON shape, +server-selection policy, timeout, cancellation behavior, offline/rate-limited +messages, and tests with fixture JSON and fixture stderr. This blocks Phase 3 +because speed-test failure modes are otherwise undefined. + +Disposition: modify — verified live and changed the backend: =speedtest-go= (AUR +=speedtest-go-bin=, 1.x) is already installed on velox and supports =--json=, +=--server=, =--no-download/--no-upload=, so v1 needs no new dependency. +librespeed-cli (AUR =librespeed-cli= / =-bin=) is the documented self-hosted +fallback. Added the "Speed test" section with server policy, timeout, +cancellation, the failure-message mapping, and fixture-JSON/stderr tests. + +** DONE Define dependency installation and repo boundaries :blocking: +The files touched section alternates between archsetup paths and the external +dotfiles repo, while pocketbook has been folded into this repo and its previous +archsetup provisioning was intentionally removed. The spec should state where +the =net= package actually lives, which repository owns the scripts/tests, +whether =gtk4-layer-shell=, =python-gobject=, =librespeed-cli=, =gpg=, =nmcli=, +=curl=, and =resolvectl= are installed by archsetup or assumed present, and the +Makefile targets for test/lint/install. This blocks implementation because the +current path plan can produce code that is not installed on a fresh machine. + +Disposition: accept — added the "Repository + dependencies" section: all code in +=~/.dotfiles= (=net/= package in-tree like pocketbook, scripts in the hyprland +tier, tests under =tests/=), archsetup owns only the dep install +(=gtk4-layer-shell=, =python-gobject=, =speedtest-go-bin=; nmcli/curl/resolvectl +already present), Makefile =make test= collects the package suite, and a +daily-drivers note for ratio. Rewrote Files touched to match. + +** DONE Expand the test plan for failure, concurrency, and live verification :blocking: +The testing plan covers normal parsing and fake command sequences, but it misses +the riskiest behaviors: slow/hung =nmcli=/=curl=/=librespeed=, concurrent +=net status= cache refresh, corrupt cache, stale cache after SSID change, +permission denied, sudo declined, DNS override cleanup failure, NM partial +activation, duplicate connection names, secret redaction, missing optional +dependencies, no WiFi hardware, wired+tether+WiFi ambiguity, portal redirect +tokens, and Waybar click handlers. Add unit/fixture tests for each class plus a +manual/live checklist gated out of the normal suite. This is blocking because +the current plan would leave the exact "things that can go wrong here" mostly +untested. + +Disposition: accept — rewrote the Testing plan with the "Failure + concurrency" +class (slow/hung commands, single-flight, corrupt/stale cache, perm-denied, +cleanup-failure, partial activation, redaction, missing deps, no-wifi, +multi-active) and a per-phase live checklist gated out of the suite. + +** DONE Define status JSON schemas and compatibility rules +The spec says all subcommands take =--json= but does not define schemas. Add +versioned JSON examples for =status=, =probe=, =list=, =diagnose=, =speedtest=, +and error envelopes, including nullable fields and unknown/degraded states. This +is non-blocking for product direction but should be fixed before code so tests +can lock the CLI contract. + +Disposition: accept — added the "JSON schemas" section with versioned (=v:1=) +envelopes for status / probe / list / diagnose / speedtest and a shared error +envelope, including the degraded/unknown states. + +** DONE Rename or alias the phasing section for workflow compatibility +The spec has a usable =Phasing= section, but the spec-review workflow expects an +=Implementation phases= section that can be lifted into =todo.org=. Rename it or +add an alias heading during response. This is non-blocking because the existing +phase decomposition is understandable, but aligning the heading prevents future +workflow friction. + +Disposition: accept — renamed =Phasing= → =Implementation phases= and added +per-phase acceptance criteria. + +** DONE Add documentation and rollout acceptance checks +Rollback is described, but docs and rollout are thin. Add README/user-guide +updates for commands, panel behavior, config file, GPG opt-in, troubleshooting, +and rollback; add acceptance checks for each phase, including a fresh-login +Waybar smoke test and restoring =custom/netspeed=. This is non-blocking but +important for handing the feature to a future session without re-discovery. + +Disposition: accept — added per-phase acceptance criteria under Implementation +phases (incl. the fresh-login waybar smoke test and the =custom/netspeed= +restore), a Phase 4 "Docs + rollout", and (answering Craig's cj follow-up) a +dedicated "Help + documentation" section with the three help layers (CLI help, +panel help affordance, user guide). + +** DONE Add a failure-mode coverage table :blocking: +The spec now names many individual network failures, but it still does not carry +one compact coverage matrix that says, for each common failure mode, whether +=net diagnose= detects it, whether =net doctor --fix= can repair it, and what +terminal user action remains when it cannot. Add a table covering at least: +rfkill soft block, rfkill hard block, no WiFi hardware, associated/no DHCP, +gateway unreachable, captive DNS hijack, broken DNS where 1.1.1.1 works, HTTP +portal, HTTP interception without a parseable portal URL, upstream/AP outage, +wrong WPA password or missing secret, enterprise auth/cert failure, duplicate +SSID/connection-name ambiguity, hidden SSID, multiple active links, wedged +NetworkManager, slow/hung command, stale/corrupt cache, DNS cleanup failure, +missing speedtest backend, and VPN/routing interference. This blocks because +Craig asked for confidence that the diagnostics and doctor cover the real field +failures, and prose scattered across sections is too easy to misread. + +Disposition: accept — added the "Failure-mode coverage" section: a 22-row table +(every mode the finding named) with detect / doctor-fix / terminal-action +columns, conformed to the org-table standard (rules under every row, ≤120). + +** DONE Pin DNS repair semantics in doctor :blocking: +The spec diagnoses DNS hijack, broken hotel DNS, and the temporary 1.1.1.1 +override test, but =net doctor --fix= does not say whether it merely recommends +the override, applies a temporary override during recovery, or leaves DNS alone +after diagnosis. Define the exact behavior for each DNS class: captive hijack +should open the portal, broken DNS where 1.1.1.1 works should either offer an +explicit temporary repair with cleanup verification or recommend the command, +and port-53/egress blocking should stop as upstream/not locally fixable. This is +blocking because DNS is one of the most common "connected but unusable" failures +and the current doctor contract is ambiguous. + +Disposition: accept — added "DNS handling in doctor (explicit per class)" under +the new Doctor section: hijack → open portal (no DNS mutation); broken-but-1.1.1.1 +→ explicit temporary override with cleanup verification under =--fix=, recommend +otherwise; egress-blocked → terminal =upstream-not-local=. + +** DONE Make auth failures terminal user-action states :blocking: +Wrong WPA password, missing NM secret, locked keyring/polkit denial, enterprise +802.1X certificate/identity failure, and portal login-required are not fixed by +resetting or bouncing NetworkManager. The doctor sequence should classify these +as =needs-user-action= terminal states, stop before looping through destructive +repairs, and tell the user the exact next action (enter password, edit profile in +=nmtui=/=nmcli=, accept portal terms, provide cert/identity, or retry with +admin auth). This blocks because repeated reset/bounce against auth failures is +slow, noisy, and can make the network state worse without helping. + +Disposition: accept — added the =needs-user-action= terminal outcome to the +Doctor section: wrong password / missing secret / keyring-or-polkit denial / +802.1X cert-or-identity failure / portal-login-required all stop the doctor before +any destructive repair and name the exact next step. + +** DONE Define upstream/AP/provider failure terminal states :blocking: +Some failures are not client-repairable: AP has no uplink, hotel gateway is +down, DHCP server is broken, gateway drops traffic, ISP outage, or captive +portal backend is failing. The spec should define how =diagnose= proves "local +link is up but upstream is broken" and how =doctor --fix= stops after local +repairs are exhausted with a clear message like "local repairs tried; likely +upstream/AP/provider" plus the evidence. This blocks because users need to know +when to stop poking the laptop and switch networks or contact the venue. + +Disposition: accept — added the =upstream-not-local= terminal outcome: diagnose +proves link-up + IP + gateway-reachable but no route out and no captive redirect; +=doctor --fix= stops after local repairs with "local repairs tried; likely +upstream/AP/provider" + evidence → switch network / contact venue. + +** DONE Decide how VPN and policy routing affect v1 diagnosis +VPN/WireGuard management is Phase 5, but active VPNs, policy routes, DNS +overrides, and firewall killswitches can break apparent internet access in v1. +The current spec does not say whether v1 detects active VPN/policy routing and +classifies "network is fine, VPN route/DNS is broken" separately from WiFi +failure. Add either a v1 diagnostic check for active VPN/default-route/DNS +ownership with a "deferred repair" outcome, or explicitly state that VPN-routed +failures are out of scope and may be misclassified. This is blocking if Craig +expects the module to diagnose normal daily-driver network failures while VPN +tooling remains separate. + +Disposition: accept (chose the detect-and-classify option) — v1 detects an active +VPN / non-NM default route / non-NM DNS owner and classifies =deferred/vpn= ("link +is fine; internet is VPN-routed"), distinct from a WiFi failure. v1 does not +repair it (VPN management is Phase 5); it names the VPN as the likely owner and +stops. Added to the Doctor section + the coverage table + a doctor-classification +test. + +** DONE Remove stale GPG-store references from the resolved spec +The spec now decides "no separate credential store; secrets live in +NetworkManager", but the Testing plan still mentions =gpg round-trip= and =GPG +store= tests, and the panel-state list still mentions a no-GPG-key state. Remove +those stale references and replace them with NM-secret/no-secret-leak tests. +This is non-blocking for product behavior but blocking for implementation +clarity: otherwise tests will be written for a credential store that no longer +exists. + +Disposition: accept — replaced the Testing-plan =gpg round-trip= / =GPG store= +bullets with an "NM secrets / no-leak" test (add/edit writes the secret via nmcli; +assert no PSK/EAP in any JSON/log/error; no store to round-trip) and dropped the +=no-GPG-key= panel state. Residue from the cj-comment pass that dropped the store. + +** DONE Reconcile status, goal, and task text before implementation :blocking: +The spec status says "Implementation-ready with caveats" and "Phase 1 ready to +build", but the body still has an unresolved enterprise add/edit VERIFY, the +Goal still says "optional GPG-encrypted secret store", and the unified task title +still names "GPG-stored secrets" even though the accepted design removed the +store. Before implementation, make the top-level status, goal, scope, task +mapping, and resolved decisions agree with the current design. This blocks +readiness because a developer starting from the top of the file would still build +or plan around abandoned GPG-store behavior. + +Disposition: accept — fixed the Goal ("secrets stay in NM's own store"), the +=[#B]= task-mapping line (notes the "GPG-stored secrets" framing is superseded by +decision 5), the enterprise VERIFY (now resolved → Status updated), and corrected +the stale =pytest= mentions to =unittest= (the repo's actual harness). Top-of-file +status/goal/scope/decisions now agree with the design. + +** DONE Resolve enterprise add/edit scope or make the caveat explicit :blocking: +The spec still says "One open question for Craig: pull enterprise add/edit into +v1?" and points to a VERIFY in =todo.org=. That is a real product-scope decision: +if enterprise add/edit is in v1, panel forms, nmcli command sequences, tests, +error messages, and docs change materially; if it is out, the UI must consistently +show activate-only with "edit in nmtui/nmcli". Decide it in the spec before +implementation, or downgrade the status to =Ready with caveats= with this exact +accepted caveat. As written, the spec cannot be plain =Ready=. + +Disposition: accept — Craig decided (2026-06-29): enterprise add/edit is vNext, +activate-only in v1. Settled in the Status line, the Scope/Out bullet, decision 9, +and the VERIFY (now DONE in todo.org). The UI shows activate-only with "edit in +nmtui/nmcli" consistently. Evidence: 24 saved profiles, 0 enterprise. + +** DONE Define the concrete test harness and coverage gate :blocking: +The spec says TDD, fake binaries on PATH, and benchmark tests, but it does not +define the actual harness contract: pytest vs unittest for the =net= package, +where fake =nmcli=/=curl=/=speedtest-go=/=rfkill=/=resolvectl= live, how test +fixtures encode command histories, how subprocess timeouts are simulated, how +Waybar scripts are executed end-to-end, and how coverage is run. Add the exact +Makefile targets (=test=, =test-unit= or package-local =pytest=), pytest config, +coverage command (e.g. branch coverage over =net/= and =waybar-net= wrappers), +minimum threshold, and the rule for reading the coverage report to add missing +tests before declaring a phase done. This blocks readiness because "what is the +test harness?" is still answerable only by analogy to older suites. + +Disposition: accept — added the "Harness + coverage gate" section. Corrected the +premise: the repo is =unittest= (=make test= → =python3 -m unittest=, 33 suites), +not pytest. Pinned the fake-binary stub convention (=tests//fake-*= on a +temp PATH), the fixture command→output map, timeout simulation, the end-to-end +=waybar-net= subprocess run, and coverage via a throwaway venv (coverage.py is +absent system-wide) with a ≥90% branch target on the pure modules. + +** DONE Use coverage to find missing behavior, not just report a percentage :blocking: +The spec does not say how coverage findings affect implementation. For this +feature, line coverage alone can miss the important holes: doctor classification +branches, cleanup-unverified paths, redaction paths, degraded hot-path fallbacks, +timeout branches, and auth/upstream/VPN terminal states. Define coverage review +criteria per phase: branch coverage for pure classifiers and parsers, named +untested branches allowed only with comments or manual-check entries, and a +required "coverage gap pass" after the first green test run that maps uncovered +logic back to tests or consciously excluded live-only behavior. This blocks +readiness because the current test plan is broad but does not force the suite to +expose missing edge tests. + +Disposition: accept — added the "Coverage as a gap-finder, not a number (per +phase)" subsection: branch coverage required for the doctor classifier (every +outcome), cleanup-unverified, redaction, degraded-fallback, timeout, and the +parsers; a mandatory coverage-gap pass after the first green run mapping each +uncovered branch to a test or a named live-only exclusion; a phase isn't done +until that pass is recorded. + +** DONE Convert error classes into exact user-facing strings and evidence fields :blocking: +The failure table and doctor outcomes classify errors well, but many messages +are still templates or descriptions rather than final text. Add exact strings +for indicator tooltip, notification, CLI stderr, JSON =error.message=, and panel +banner/step text for every failure-mode row, including cases doctor cannot fix: +wrong password, missing secret, enterprise cert failure, upstream/AP/provider +failure, VPN-routed failure, hard rfkill block, DNS cleanup failure, speedtest +missing, and HTTP interception without parseable URL. For each string, specify +the redacted evidence included and the next action. This blocks UX readiness +because "useful error" is only testable once the actual text and evidence are +defined. + +Disposition: accept — rewrote the Failure states section: each row now carries the +exact final string (with == evidence), the evidence field, and the +next action, plus a per-surface rendering rule (indicator tooltip / notify / +CLI+JSON error.message+detail+code / panel banner all render the one canonical +string). Added the missing doctor-unfixable rows: hard rfkill, wrong password / +missing secret, enterprise cert failure, upstream/AP/provider, VPN-routed, HTTP +interception without a parseable URL, and DNS cleanup-unverified. + +** DONE Add an enhancement disposition table +The spec captures several good enhancements (doctor, Makefile recovery, rfkill, +airplane absorption, VPN phase), but it does not show that low-cost adjacent +enhancements were considered and accepted/deferred/rejected. Add a small radar +table for likely affordances: copy redacted doctor report, open/copy portal URL, +retry with hardware MAC, forget network, rescan now, pin speedtest server, show +last good network/result, watch mode for =net doctor=, desktop notification +actions, QR-code/share WiFi import/export, and keyboard picker. Mark each +=v1=, =vNext=, or =rejected= with a one-line reason. This is non-blocking, but it +prevents accidental loss of cheap UX wins and keeps the v1 panel focused. + +Disposition: accept — added the "Enhancement radar" table dispositioning all the +named affordances: open/copy portal URL, forget network, rescan, hardware-MAC +retry, pin speedtest server, copy redacted doctor report = v1; last-good +network/result, doctor watch mode, actionable notifications, keyboard picker = +vNext; QR-share = rejected (low value for a 2-machine personal setup). + +** DONE Tighten the panel UX flow before Phase 2 +The panel has sections and state machines, but not a concrete interaction flow: +default focused section, row content, primary/secondary buttons, disabled-state +rules, confirmation wording for reset/bounce/DNS override, how "Get me online" +reports each escalation, what stays visible after the panel closes, and keyboard +navigation. Add a short UX flow spec or wire-level outline before Phase 2. This +is non-blocking for Phase 1, but it blocks Phase 2 implementation because a GTK +panel can easily become noisy or surprising if these defaults are invented while +coding. + +Disposition: accept — added the "Panel UX flow (settle before Phase 2)" +subsection: default focus (Connections, or Diagnose when opened from a captive +state), row content, one primary button per section, disabled-state rules, exact +confirmation wording for reset/bounce/DNS-override/remove, the live "Get me +online" escalation reporting, what survives panel close, and keyboard nav. + +* Review and iteration history + +** 2026-06-29 Mon @ 17:00:39 -0400 — Codex — reviewer + +- *What changed or was recommended:* Rubric: =Not ready=. Applied the + spec-review workflow and added blocking findings for diagnostics structure, + user-facing errors, observability, nmcli contracts, cache concurrency, + hot-path performance, Waybar click actions, mutation rollback, credential + security, unsupported WiFi types, panel states, speed-test dependency + verification, install boundaries, and test coverage. +- *Why:* The spec has the right broad shape, but this feature combines a status + indicator, NetworkManager mutation, sudo-backed remediation, captive-portal + probing, secret storage, and GTK UI. Too many failure modes were still left for + implementers to invent during coding. +- *Artifacts:* Findings recorded in =Review findings=. Local code read included + =~/.dotfiles/hyprland/.local/bin/waybar-netspeed=, + =~/.dotfiles/common/.local/bin/captive=, + =~/.dotfiles/tests/waybar-netspeed/test_waybar_netspeed.py=, + =~/.dotfiles/tests/captive/test_captive.py=, pocketbook's GTK layer-shell + scaffold, current Waybar config, and the linked =todo.org= tasks. External + checks included the Arch nmcli man page and upstream LibreSpeed CLI page. + +** 2026-06-29 Mon @ 17:35:00 -0400 — Claude Code (archsetup) — responder + +- *What changed:* Processed all 18 findings — 13 accepted, 5 modified, 0 + rejected. Accepts added new sections: Repository + dependencies, Performance + budgets, nmcli contract, JSON schemas, Diagnostics contract (+ Diagnose vs + Repair), Failure states/messages/recovery, Observability, cache + concurrency/atomicity/staleness, mutation safety + rollback, the rewritten + credential security model, the expanded test plan, JSON schemas, the renamed + Implementation phases with per-phase acceptance criteria, and a Status line. + Modifies: deferred the =net debug-bundle= command (log + =diagnose --json= + cover it), made click actions phase-aware + non-blocking, scoped enterprise + WiFi to activate-only (add/edit vNext), trimmed the panel state set to states + that occur on the two-machine fleet, and switched the speed-test backend to the + already-installed =speedtest-go= (verified live) instead of a new + librespeed-cli dependency. +- *Why:* Codex's review was strong and mostly right; the modifies are scope calls + for a single-user, two-machine tool (avoid gold-plating, use installed deps, + don't model impossible states) and one phasing reconciliation (the panel is + Phase 2, so Phase-1 clicks can't open it). +- *Artifacts:* All 18 findings completed =DONE= in place with dispositions. Live + check: =pacman -Qq | grep speed= (speedtest-go-bin installed; =speedtest-go + --json= confirmed). Status now "review incorporated (Codex, 2026-06-29)", + implementation-ready with the named Phase-2/3 caveats; Phase 1 ready to build. + +** 2026-06-29 Mon @ 17:37:58 -0400 — Claude Code (archsetup) — responder (cj comments) + +- *What changed:* Folded in Craig's 10 cj comments on the spec. Design changes: + (1) dropped the separate credential store entirely — secrets stay in NM's own + =system-connections= (root =0600=, inline), touched via nmcli; no GPG, no + gpg-agent (rewrote the secrets section, decision 5, dropped Phase 4 + the gpg + dep). (2) Added =net doctor [--fix]= + Makefile console-recovery targets + (=make online= etc.) as a first-class TTY path; reversed the earlier + defer-the-doctor call (decision 10). (3) Added a full-stack =bounce= repair and + an =rfkill= unblock repair + indicator state — the rfkill one recovers the + framework-laptop post-power-loss soft-block Craig hit. (4) =custom/net= absorbs + the airplane module; the standalone airplane scripts/tests/module are deleted on + ship. (5) Moved VPN/WireGuard from "out" to a planned Phase 5. (6) Added a + "Help + documentation" section (CLI help / panel help / user guide). Answered + the enterprise-defer rationale and the captive-auto-login explanation inline. +- *Why:* Craig's comments simplify (NM-only secrets, fewer deps) and harden the + recovery story (doctor + make targets + rfkill/bounce reachable from a dead-GUI + console — his stated need). Enterprise add/edit in v1 is the one open call, + raised as a VERIFY. +- *Artifacts:* All 10 cj blocks removed. Live checks: airplane scripts/tests + present (confirmed deletion targets); =/etc/NetworkManager/system-connections/= + is root =0600= with inline secrets (confirms the NM-only secret model); + =rfkill= present. VERIFY filed in =todo.org= for the enterprise-scope decision. + +** 2026-06-29 Mon @ 17:49:24 -0400 — Codex — reviewer + +- *What changed or was recommended:* Added a second feedback batch from the + common network failure-mode review: require a diagnose/doctor coverage table, + pin DNS repair semantics, make auth failures terminal user-action states, + define upstream/AP/provider terminal states, decide how active VPN/policy + routing affects v1 diagnosis, and remove stale GPG-store references. +- *Why:* The design now has a solid recovery shape, but the field question is + sharper: for each common failure, does =diagnose= catch it, does =doctor --fix= + fix it, or does the user need to switch networks/provide credentials/contact + the venue? That mapping needs to live in the spec, not in chat. +- *Artifacts:* Findings appended under =Review findings= after reviewing the + current spec's Diagnostics, Failure states, Doctor, Testing plan, and + responder history. + +** 2026-06-29 Mon @ 18:06:22 -0400 — Codex — reviewer + +- *What changed or was recommended:* Added a readiness pass focused on + development-time questions, missed enhancements, UX cleanliness, error wording, + and test/coverage mechanics. New findings require reconciling stale status/goal + text, resolving or explicitly caveating enterprise add/edit, defining the + concrete harness and coverage gate, making coverage actionable, turning failure + classes into exact user-facing strings, adding an enhancement disposition + table, and tightening the Phase-2 panel UX flow. +- *Why:* The network failure model is now strong, but implementation would still + force developers to invent test infrastructure, coverage expectations, final + error copy, and several UX details. The spec should decide those up front or + explicitly mark them as caveats. +- *Artifacts:* Findings appended under =Review findings= after reading the + updated spec, current archsetup Makefile test targets, pocketbook's pyproject, + and the linked =todo.org= enterprise VERIFY. + +** 2026-06-29 Mon @ 17:55:08 -0400 — Claude Code (archsetup) — responder (2nd review) + +- *What changed:* Processed all 6 second-batch findings — 6 accepted, 0 modified, + 0 rejected. Added a "Doctor: escalation, classification, terminal states" + section with four outcomes (=fixable= escalating lightest-first; =needs-user- + action=, =upstream-not-local=, =deferred/vpn= terminal) and explicit per-class + DNS handling. Added a "Failure-mode coverage" table (22 rows: detect / doctor-fix + / terminal action, org-table-standard conformant, ≤120). Cleaned the last + GPG-store residue (Testing plan → NM-secret/no-leak test; dropped the no-GPG-key + panel state). Added a doctor-classification test class. +- *Why:* The findings sharpen the doctor from "escalate repairs" to "classify, + then stop at the right terminal state" — the safety property that keeps + =doctor --fix= from looping destructive repairs against a wrong password or an + upstream outage. The VPN finding got the detect-and-classify option (not repair; + that's Phase 5) so v1 doesn't misread a VPN-routed failure as WiFi down. The + coverage table is the single artifact Craig asked for to trust the field + coverage. +- *Artifacts:* All 6 findings =DONE= in place; cookie =[24/24]=. Table conformed + via =wrap-org-table.el= (max line 120). No new findings introduced; no scope + expansion past what the findings asked. Spec remains implementation-ready, the + enterprise add/edit VERIFY still the one open decision. + +** 2026-06-29 Mon @ 18:24:51 -0400 — Claude Code (archsetup) — responder (3rd review) + +- *What changed:* Processed the third review (7 findings). The enterprise-scope + finding closed first on Craig's call (vNext, activate-only). The other 6, all + accepted: (1) reconciled the top-of-file text — fixed the Goal's GPG-store + wording, the =[#B]= task-mapping line, the resolved enterprise VERIFY, and + corrected the stale =pytest= mentions to =unittest= (the repo's real harness). + (2) Added a "Harness + coverage gate" section (unittest, fake-binary stubs on a + temp PATH, venv coverage, ≥90% branch on pure modules). (3) Added a per-phase + "coverage as a gap-finder" pass. (4) Rewrote the Failure states section to exact + final strings + evidence fields + a per-surface rendering rule, and added the + missing doctor-unfixable rows. (5) Added the "Enhancement radar" table + (v1/vNext/rejected). (6) Added the "Panel UX flow" subsection. +- *Why:* The findings close the gap between "design decided" and "a developer can + start": the harness/coverage contract, the exact UX strings, and the panel flow + are the things otherwise invented mid-code. The =pytest=→=unittest= correction + was a real defect — the spec contradicted the repo's actual test convention. +- *Artifacts:* All 31 findings =DONE=; cookie =[31/31]=. Both new tables conformed + via =wrap-org-table.el= (coverage 120, radar 110). Harness verified against the + live repo (33 unittest suites, =make test=, coverage.py absent → venv). Status + raised to "Ready for Phase 1; Ready-with-caveats overall" — no open decisions + remain. -- cgit v1.2.3