aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2026-06-30 16:51:21 -0400
committerCraig Jennings <c@cjennings.net>2026-06-30 16:51:21 -0400
commit07ffe1d961e55fc4758b32269a27855a1075ea14 (patch)
treec88386baf39af54327b2a0f21084660ee1860a2e /docs
parentf9c9df6dc079f39b304938fb7fd795df7c319120 (diff)
downloadarchsetup-07ffe1d961e55fc4758b32269a27855a1075ea14.tar.gz
archsetup-07ffe1d961e55fc4758b32269a27855a1075ea14.zip
docs: bring network module spec current + add diagnostic verbose-capture
The spec had drifted behind the code and the redesign. Marked Phases 1-3 shipped, recorded the native captive-login engine and the live-testing portal UX fixes, and folded in the V2 redesign: no terminals, the passwordless sudo-helper, verify-every-action, the Connections/Diagnostics/Performance nav, and the full failure-mode catalog moving to the task. Added the automatic diagnostic verbose-capture feature. On a failing diagnose it elevates the underlying stack (NetworkManager, resolved, wpa_supplicant) to debug, captures the journal and dmesg window, restores with a guaranteed crash-guarded path, and writes a redacted bundle. A manual Debug on/off toggle covers intermittent failures. The redesign task gains a child for it.
Diffstat (limited to 'docs')
-rw-r--r--docs/design/2026-06-29-waybar-network-module-spec.org266
1 files changed, 222 insertions, 44 deletions
diff --git a/docs/design/2026-06-29-waybar-network-module-spec.org b/docs/design/2026-06-29-waybar-network-module-spec.org
index db9657d..b4ba1f2 100644
--- a/docs/design/2026-06-29-waybar-network-module-spec.org
+++ b/docs/design/2026-06-29-waybar-network-module-spec.org
@@ -4,26 +4,51 @@
* Status
-*Phase 1 SHIPPED* (2026-06-29, dotfiles =5254bd8=..=c095a22=, 10 commits): engine
-(=net status/probe/diagnose/repair/doctor/portal=) + =waybar-net= indicator +
-split-cadence cache + redacted event log + Makefile recovery targets + airplane
-absorption. 160 net tests; pure modules ≥90% branch. One as-built deviation from
-this spec: airplane absorption is *display-only* (Craig's call, option 1) — net
-shows the airplane state but the =airplane-mode= low-power toggle is KEPT (it does
-radios + CPU + brightness + services, not a network concern); only =waybar-airplane=
-+ =custom/airplane= + =waybar-netspeed= were deleted. See decision 12. Phases 2-5
-remain. Live waybar eyeball is under todo.org "Manual testing and validation".
-
-Ready for Phase 1; Ready-with-caveats overall. Three Codex review rounds + Craig's
-cj comments are all incorporated — every finding has a disposition and the findings
-cookie reads complete ([31/31]), with no open decisions (enterprise scope settled:
-open + WPA-PSK in v1, 802.1X add/edit vNext, activate-only). The cj comments
-reshaped several decisions (no separate credential store — use NM's own; =net
-doctor= + Makefile console-recovery in v1; rfkill + full-stack-bounce repair;
-airplane module absorbed; VPN a later Phase 5). The only remaining caveats are
-Phase-2/3 build unknowns named under Open items (gtk4-layer-shell anchoring, the
-=captive= =--json= refactor) — not Phase-1 blockers. Phase 1 (indicator + console
-recovery) is ready to build.
+*Phases 1-3 SHIPPED* (2026-06-29 → 2026-06-30, dotfiles). The core module is live:
+the =net= engine (=status/probe/list/up/down/add/edit/remove/rescan/diagnose/repair/
+doctor/portal/speedtest=), the =waybar-net= indicator (split-cadence cache, redacted
+event log, display-only airplane absorption per decision 12), and the GTK4
+layer-shell panel (Connections / Diagnose / Repair / Speed test) with the settled bar
+clicks (left = panel, middle = =net portal=, right = =net-fix=; airplane on
+Super+Shift+A). 230+ net tests; full dotfiles suite green. Live-verified on velox.
+
+Built on top since the original spec:
+- *Captive-portal login engine* (2026-06-30, dotfiles =a7d7559=) — =net portal= now
+ runs a native =portal-login= repair tier (drop DoT → recover the portal URL from
+ the redirect → open a throwaway browser profile → auto-restore DoT once online),
+ replacing the old shell-out to =captive= for the force-portal flow. =net portal
+ --restore= is the manual fallback.
+- *Portal UX fixes from live testing* (2026-06-30, dotfiles =eef6b0b=) — removed a
+ polkit-gated =resolvectl flush-caches= that popped an auth dialog (the DoT-drop
+ restart already clears the cache); added an already-online short-circuit so a
+ forced run on a working connection opens nothing; suppressed Chrome's first-run
+ wizard; moved =net portal= off the terminal into the panel status line; hardened
+ the portal-URL extractor against Firefox's detection page.
+- *Panel auto-hide + Close button* (2026-06-30, dotfiles =450b7f0=) — the panel
+ closes on focus-out (popup behavior, suppressed while a child dialog holds focus)
+ and carries a Close button bottom-right.
+
+*V2 redesign in flight* (designed 2026-06-30, not yet built — see todo.org "Network
+panel redesign — no terminals, verify-everything, full failure coverage"). It
+reverses two earlier choices and widens coverage:
+- *No terminals anywhere.* =net-popup= is removed; every action and result renders
+ in the panel. This depends on a passwordless privileged path — a root-owned helper
+ plus a narrow NOPASSWD sudoers rule, archsetup-installed — because an in-panel
+ worker thread has no tty to prompt for a password. Reverses decision 11's
+ "privileged tiers run in a terminal".
+- *New navigation* — top tabs Connections | Diagnostics | Performance. Diagnostics
+ merges Diagnose + Repair (sub-row Diagnose | Get Me Online | Advanced; a shared
+ area below shows diagnose items and streams repair progress; Advanced reveals the
+ individual repair buttons, renamed with tooltips). Speed test lives under
+ Performance.
+- *Verify every action* (each mutating op confirms its effect before reporting
+ success) and *detect + respond to every failure mode* — the full ~44-mode catalog,
+ edge cases included, lives in the redesign task and supersedes the table below.
+
+Phase 4 (docs / rollout) and Phase 5 (VPN) remain. The three Codex review rounds +
+Craig's cj comments from the original spec are all incorporated ([31/31], no open
+decisions). Phases 1-3's manual live checks are under todo.org "Manual testing and
+validation".
* Goal
@@ -127,12 +152,13 @@ Three layers. Keep the bar cheap, the panel rich, the logic in one tested place.
speed test.
How the existing pieces map in:
-- =captive= (bash, shipped) — the engine shells out to it for the heavy,
- interactive portal-force flow (sudo reset, DNS override, browser launch). Its
- cheap portal-detection logic is mirrored natively in the engine for the fast
- status path so the bar never blocks on a subprocess. =captive= stays a usable
- standalone CLI. The refactor (below) extracts its probe + reset into functions
- the engine can call non-interactively.
+- =captive= (bash, shipped) — its cheap portal-detection logic is mirrored natively
+ in the engine for the fast status path so the bar never blocks on a subprocess,
+ and it still exposes a =--probe-json= mode the engine reuses. *As built (2026-06-30):
+ the force-portal flow is now native too* — =repair.py='s =portal-login= tier does
+ the DoT drop, portal-URL recovery, clean-browser launch, and auto-restore in
+ Python, so =net portal= no longer shells out to =captive= for it. =captive= stays a
+ usable standalone CLI.
- =waybar-netspeed= (sh, shipped) — retired; its throughput sampling moves into
the engine's status output and renders in the indicator tooltip only.
- =nmcli= — the connection backend for every op.
@@ -142,6 +168,16 @@ wrapper over =net status --json=. The bar path must stay fast (see Performance
budgets), so the indicator does no network I/O itself — it reads link state and
the cached connectivity result.
+Privileged-path model (v2, planned): repairs that need root (rfkill unblock, nmcli
+modify/up, networking off/on, =systemctl restart NetworkManager/systemd-resolved=,
+resolvectl dns/revert, the DoT toggle) go through a single root-owned helper
+installed by archsetup, with a narrow NOPASSWD sudoers rule scoped to that helper
+only (never a blanket =mv=/=systemctl= rule). =repair.py= calls =sudo <helper>
+<verb>=. This is what lets every action run in-panel with no terminal: a GTK worker
+thread has no tty, so without a passwordless path it can't prompt. It also fixes a
+latent bug in the shipped portal flow — the detached DoT-restore watcher runs with
+no tty and silently fails to restore encrypted DNS when sudo creds aren't cached.
+
* Repository + dependencies
- *Code lives in the dotfiles repo* (=~/.dotfiles=), not archsetup. The =net=
@@ -258,8 +294,10 @@ exits non-zero with a JSON error envelope (see JSON schemas) on failure.
rfkill, reset, bounce, open portal) — read-only without =--fix=, acting with
it. The TTY recovery path when waybar/the GUI is down (see the Makefile
targets).
-- =net portal= — run =captive='s portal-force flow (reset if needed, extract +
- open the portal page).
+- =net portal [--restore]= — the native captive-login flow (=repair.py= =portal-login=
+ tier): short-circuits if already online, else drops DoT to plain DNS, recovers the
+ portal URL from the redirect, opens it in a throwaway browser profile, and spawns a
+ detached watcher that restores DoT once online. =--restore= forces the restore now.
- =net speedtest [--json]= — =speedtest-go --json= run; down/up/ping.
* nmcli contract
@@ -433,15 +471,27 @@ and the doctor stops at any terminal one:
** DNS handling in doctor (explicit per class)
- *Captive DNS hijack* — open the portal (the hijack clears on login). No DNS
mutation.
-- *Broken resolver, 1.1.1.1 works* — doctor offers an explicit *temporary* 1.1.1.1
- override as a repair with cleanup verification (auto-revert, =cleanup_verified=);
- without =--fix= it only recommends the command. It does not leave a permanent
- resolver change.
+- *Broken resolver, 1.1.1.1 works* — the shipped =dns-test= repair is *diagnostic*:
+ it sets 1.1.1.1, confirms the venue resolver is the culprit, then auto-reverts
+ (=cleanup_verified=). Because it reverts, =doctor --fix= does not currently leave
+ you online in this case — it falls through to =upstream-not-local=, which
+ misreports a locally-fixable problem. *V2 fix (planned):* on a dns-test *pass*
+ (public DNS works), set a PERSISTENT resolver override and verify online, with an
+ offered revert — and classify it as its own outcome rather than upstream.
- *Port-53 / egress blocked* (even 1.1.1.1 fails) — terminal =upstream-not-local=;
doctor stops, since it's not locally fixable.
* Failure-mode coverage
+*V2 note (2026-06-30):* the authoritative, exhaustive catalog (~44 modes across 10
+connectivity layers, edge cases included, each tagged fix-and-verify or report-text)
+now lives in the redesign task (todo.org "Network panel redesign"). The table below is
+the v1 baseline; two rows reflect intent the shipped code doesn't yet match, and the
+v2 catalog closes them: =gateway unreachable= claims a bounce that doctor never
+actually reaches (a no-route failure goes straight to =upstream-not-local=), and
+=broken DNS, 1.1.1.1 works= auto-reverts so the user is left offline and misreported
+as upstream (the v2 persistent-override fix closes this).
+
For each common field failure: does =net diagnose= detect it, can =net doctor
--fix= repair it, and what terminal user action remains when it can't. (The
=needs-user-action= / =upstream-not-local= / =deferred/vpn= outcomes are defined
@@ -519,6 +569,50 @@ above.)
- *Secret-leak tests*: assert no PSK/EAP/portal-token ever appears in any JSON
output, log line, or error message.
+** Automatic diagnostic verbose-capture (V2)
+
+A distinct layer from the event log above: that log records what =net= did;
+this captures what the *underlying stack* did at debug verbosity during a run, so a
+failed diagnosis leaves real ground-truth instead of relying on memory. Two triggers,
+one mechanism:
+
+- *Automatic — on a failing diagnose.* When =net diagnose= ends =overall: fail=, the
+ next escalation (or =Get Me Online=) runs inside a verbose-capture session.
+- *Manual — a debug on/off toggle in the panel's Advanced section.* "Debug on"
+ elevates and leaves it elevated (with a visible "debug capturing" indicator) so the
+ user can reproduce an intermittent problem over time; "Debug off" restores and
+ writes the bundle. Useful when the failure doesn't reproduce inside one diagnose.
+
+Mechanism (shared):
+1. *Snapshot* the current log levels (=nmcli general logging=, resolved's level,
+ wpa_supplicant's).
+2. *Elevate* the relevant components to debug at runtime, no restarts, scoped to the
+ domains that matter (NM: =WIFI,DHCP,DNS,CORE=; resolved; wpa_supplicant).
+3. *Run* the diagnostics / repair.
+4. *Capture the window*: =journalctl= for NetworkManager + systemd-resolved +
+ wpa_supplicant since the run started, a =dmesg= tail (driver / firmware / rfkill),
+ and any =curl -v= probe output.
+5. *Restore* every level to its snapshot.
+6. *Write a redacted support bundle* to =$XDG_STATE_HOME/net/bundles/<ts>/= and
+ surface it in the panel.
+
+Hard requirements:
+- *Restore is guaranteed and idempotent.* A =try/finally= restores even on error,
+ and a crash-recovery guard detects "a prior run left NM/resolved/wpa_supplicant
+ elevated" on the next run and puts it back — the same shape as the DoT-restore
+ watcher. A crash must never strand the stack at debug verbosity.
+- *Redaction before anything leaves.* Raw wpa_supplicant and NM debug logs carry the
+ PSK and EAP credentials in cleartext. The captured journal is scrubbed before the
+ bundle is written, shown, or shared; the secret-leak test asserts no passphrase or
+ EAP secret survives into a bundle.
+- *Privilege via the V2 sudo-helper.* The log-level toggles need root, so they become
+ verbs on the passwordless helper (decision 16) — no extra prompt.
+
+Bonus — this closes a real detection gap, not just observability: the spec notes live
+auth-failure detection is a v1 limit (it leans on a one-shot NM state-120 snapshot).
+wpa_supplicant at debug during the run is exactly how a wrong-password or EAP failure
+is caught properly, so the capture feeds back into classification.
+
* Indicator (task #C — Phase 1, the fast win)
** States (internet sub-state on top of link state)
@@ -545,17 +639,16 @@ captive / no-internet / degraded overlay glyph.
SSID + signal + IPv4 + gateway + the throughput readout (absorbed from
netspeed) + the last probe result and its age (stale/expired hinted).
-** Interactions (phase-aware; no keyboard-modifier clicks — waybar can't qualify
-clicks by modifier, so the rich actions live in the panel, not ctrl/super-click)
-Clicks never block the bar: each dispatches a detached background job and reports
-via =notify=, single-flight per action.
-- *Phase 1 (no panel yet)*: left-click runs =net probe= + notify (refreshes the
- state on demand) and keeps the existing =pypr toggle network= scratchpad as the
- interim manager; right-click runs =net repair reset= in the background +
- notify; middle-click runs =net portal=.
-- *Phase 2+ (panel exists)*: left-click opens the panel (focused on the relevant
- section — diagnostics when captive); right/middle keep the background
- reset/portal shortcuts.
+** Interactions (no keyboard-modifier clicks — waybar can't qualify clicks by
+modifier, so the rich actions live in the panel, not ctrl/super-click)
+Clicks never block the bar: each dispatches a detached background job, single-flight
+per action. *As built (settled live with Craig, 2026-06-29):*
+- *left* — =net-panel= toggle (pkill-or-launch the GTK panel).
+- *middle* — =net portal= (the captive-login flow).
+- *right* — =net-fix= (=net doctor= with =--notify=: reports the result when the
+ outcome is one-way, opens a terminal only when it's fixable; the v2 redesign moves
+ even that into the panel).
+- airplane toggle moved off the bar to Super+Shift+A.
* Panel (tasks #B + #C diagnostics — Phases 2-3)
@@ -563,7 +656,7 @@ GTK4 + gtk4-layer-shell, pocketbook scaffold (src-layout package, unittest,
Makefile, gtk4-layer-shell anchored dropdown under the bar). One panel shell,
reused by the future desktop-settings panel.
-Sections:
+Sections as built (Phases 1-3, a four-page stack switcher):
1. *Connections* — list, MRU-first, active marked, live signal bars for in-range
wifi; row click switches; buttons for add / edit / remove; a rescan control.
2. *Diagnose* (read-only) — Probe (204/captive, shows portal URL + Open), Gateway
@@ -572,6 +665,16 @@ Sections:
(fresh MAC), Bounce (full stack), DNS override test, Force portal. A "Get me
online" button runs =net doctor --fix= (the auto-escalating sequence).
4. *Speed test* — Run button, progress, down/up/ping result + last-run line.
+As built, the panel also auto-hides on focus-out (popup behavior, suppressed while a
+child dialog holds focus) and carries a Close button bottom-right (2026-06-30).
+
+*V2 nav (planned):* three top tabs — Connections | Diagnostics | Performance.
+Diagnostics merges the Diagnose and Repair pages into one: a sub-row
+=Diagnose= | =Get Me Online= | =Advanced= over a shared area that shows diagnose
+items and streams repair progress in-panel (no terminal). =Advanced= reveals the
+individual repair tiers (renamed, with tooltips) plus a *Debug capture on/off*
+toggle (the manual side of the verbose-capture feature; a failing diagnose triggers
+it automatically). Speed test moves under Performance.
** Panel state, cancellation, permissions
State machines for: connection-list loading, rescan-in-progress,
@@ -882,8 +985,37 @@ a *coverage-gap pass*, not just a percentage:
14. VPN / WireGuard is a planned Phase 5 (Craig's call, cj), not a permanent
exclusion.
+V2 redesign decisions (Craig, 2026-06-30):
+
+15. *No terminals anywhere in the module* — =net-popup= is removed; every action and
+ result renders in the panel. Reverses the part of decision 11 that ran privileged
+ repairs in a terminal "so sudo/polkit can prompt".
+16. *Passwordless privileged path* — a root-owned helper + a narrow NOPASSWD sudoers
+ rule scoped to it, archsetup-installed, run as =sudo <helper> <verb>=. This gates
+ decision 15 (a worker thread can't prompt). Absorbs the earlier DoT-toggle
+ follow-up and fixes the detached-restore-watcher bug.
+17. *Verify every action* — each mutating op (repair, connect, forget, add, DNS
+ override) re-checks its effect and surfaces pass/fail in the panel.
+18. *Detect + respond to every failure mode, edges included* — the full ~44-mode
+ catalog (todo.org redesign task) is the contract; auto-fix where safe, else report
+ the exact in-panel text. Includes IPv6-only awareness and multi-homing, which need
+ diagnose to stop being IPv4-only and single-iface.
+19. *Navigation* — top tabs Connections | Diagnostics | Performance; Diagnostics
+ merges Diagnose + Repair (Diagnose | Get Me Online | Advanced over a shared
+ streaming area); Speed test under Performance.
+20. *Automatic diagnostic verbose-capture* (Craig, 2026-06-30) — on a failing
+ diagnose, elevate the underlying stack (NM / resolved / wpa_supplicant) to debug,
+ capture the journal + dmesg window, restore (guaranteed + crash-guarded), and
+ write a redacted bundle. Plus a manual Debug on/off toggle in Advanced. Restore
+ bulletproof, secrets scrubbed before the bundle, log-level toggles via the V2
+ helper. See Observability.
+
* Implementation phases
+*Phases 1-3 are SHIPPED* (2026-06-29 → 2026-06-30, dotfiles); their acceptance
+criteria passed and the work is live on velox. Phase 4 (docs/rollout) and Phase 5
+(VPN) remain. The V2 redesign phases at the end are designed, not yet built.
+
- *Phase 1 — Indicator + console recovery (task #C).* =net status= + =net probe=
(native cheap probe, reusing captive's logic) + the =captive= probe refactor +
=waybar-net= + the split-cadence cache (single-flight, atomic, stale classes) +
@@ -921,6 +1053,29 @@ a *coverage-gap pass*, not just a percentage:
tooling into the same panel + CLI (=net vpn ...=). Out of the v1 milestone;
specced separately when picked up.
+V2 redesign phases (designed 2026-06-30, dependency order):
+- *V2.1 — Sudo helper + NOPASSWD sudoers (gates everything).* Root-owned helper
+ dispatching net's fixed privileged verbs, archsetup-installed, narrow sudoers.
+ Also fixes the detached DoT-restore-watcher bug.
+ - *Acceptance*: every repair runs passwordless in-panel on a non-NOPASSWD machine;
+ the sudoers rule is scoped to the helper only.
+- *V2.2 — Merged Diagnostics panel + nav restructure.* Connections | Diagnostics |
+ Performance; the Diagnostics sub-row + shared streaming area; Advanced reveal +
+ tooltips; delete =net-popup=.
+ - *Acceptance*: no terminal opens for any action; repair progress streams in the
+ panel; Speed test lives under Performance.
+- *V2.3 — IPv6-aware and multi-homing-aware diagnose.* Stop treating no-IPv4 as a
+ failure when online over IPv6; identify which interface owns the default route.
+- *V2.4 — Close every detect/correct gap in the catalog, with post-action
+ verification.* Work the redesign-task catalog mode by mode.
+- *V2.5 — Automatic diagnostic verbose-capture.* Snapshot/elevate/capture/restore
+ around a failing diagnose + the Advanced Debug on/off toggle; guaranteed +
+ crash-guarded restore; redacted support bundle; helper log-level verbs.
+ - *Acceptance*: a failing diagnose leaves a redacted bundle (NM/resolved/
+ wpa_supplicant journal + dmesg) and restores every log level; a crash mid-capture
+ is detected and restored on the next run; the secret-leak test finds no PSK/EAP in
+ a bundle; the toggle elevates and restores on demand.
+
* Open items / risks
- gtk4-layer-shell dropdown anchoring under a waybar module needs the same
@@ -1599,3 +1754,26 @@ online" escalation reporting, what survives panel close, and keyboard nav.
live repo (33 unittest suites, =make test=, coverage.py absent → venv). Status
raised to "Ready for Phase 1; Ready-with-caveats overall" — no open decisions
remain.
+
+** 2026-06-30 Tue @ 16:32:07 -0400 — Claude Code (archsetup) — responder (build + V2 redesign)
+
+- *What changed:* Brought the spec current with what shipped and what got decided
+ across 2026-06-29/30. Recorded Phases 1-3 as SHIPPED (engine, indicator, GTK4
+ panel, bar clicks). Added the native captive-login engine (=portal-login= repair
+ tier replacing the =captive= shell-out), the live-testing portal UX fixes (removed
+ the polkit-gated flush, already-online short-circuit, Chrome first-run suppression,
+ in-panel portal, extractor hardening), and the panel auto-hide + Close button.
+ Then folded in the V2 redesign Craig directed: no terminals anywhere, a passwordless
+ root-helper + NOPASSWD sudoers as the enabler, verify-every-action, the full
+ ~44-mode failure catalog (now the authoritative coverage, in todo.org), and the
+ Connections | Diagnostics | Performance nav. Added decisions 15-19, V2 phases
+ 2.1-2.4, and corrected the two failure-table rows the shipped code never matched
+ (gateway-bounce, venue-DNS auto-revert).
+- *Why:* The spec had drifted behind the code (still said "Phase 1 shipped, terminal
+ for privileged ops") and behind Craig's redesign calls. The doc is the contract for
+ the V2 build, so it has to state the no-terminal reversal, the sudo-helper
+ dependency, and where coverage now lives.
+- *Artifacts:* Updated Status, Architecture, Engine CLI surface, Doctor DNS handling,
+ Failure-mode coverage, Interactions, Panel, Resolved decisions, Implementation
+ phases. Shipped commits referenced: dotfiles =a7d7559=, =eef6b0b=, =450b7f0=. Full
+ catalog + redesign children in todo.org "Network panel redesign".