diff options
Diffstat (limited to 'docs/design')
| -rw-r--r-- | docs/design/2026-06-29-waybar-network-module-spec.org | 266 |
1 files changed, 222 insertions, 44 deletions
diff --git a/docs/design/2026-06-29-waybar-network-module-spec.org b/docs/design/2026-06-29-waybar-network-module-spec.org index db9657d..b4ba1f2 100644 --- a/docs/design/2026-06-29-waybar-network-module-spec.org +++ b/docs/design/2026-06-29-waybar-network-module-spec.org @@ -4,26 +4,51 @@ * Status -*Phase 1 SHIPPED* (2026-06-29, dotfiles =5254bd8=..=c095a22=, 10 commits): engine -(=net status/probe/diagnose/repair/doctor/portal=) + =waybar-net= indicator + -split-cadence cache + redacted event log + Makefile recovery targets + airplane -absorption. 160 net tests; pure modules ≥90% branch. One as-built deviation from -this spec: airplane absorption is *display-only* (Craig's call, option 1) — net -shows the airplane state but the =airplane-mode= low-power toggle is KEPT (it does -radios + CPU + brightness + services, not a network concern); only =waybar-airplane= -+ =custom/airplane= + =waybar-netspeed= were deleted. See decision 12. Phases 2-5 -remain. Live waybar eyeball is under todo.org "Manual testing and validation". - -Ready for Phase 1; Ready-with-caveats overall. Three Codex review rounds + Craig's -cj comments are all incorporated — every finding has a disposition and the findings -cookie reads complete ([31/31]), with no open decisions (enterprise scope settled: -open + WPA-PSK in v1, 802.1X add/edit vNext, activate-only). The cj comments -reshaped several decisions (no separate credential store — use NM's own; =net -doctor= + Makefile console-recovery in v1; rfkill + full-stack-bounce repair; -airplane module absorbed; VPN a later Phase 5). The only remaining caveats are -Phase-2/3 build unknowns named under Open items (gtk4-layer-shell anchoring, the -=captive= =--json= refactor) — not Phase-1 blockers. Phase 1 (indicator + console -recovery) is ready to build. +*Phases 1-3 SHIPPED* (2026-06-29 → 2026-06-30, dotfiles). The core module is live: +the =net= engine (=status/probe/list/up/down/add/edit/remove/rescan/diagnose/repair/ +doctor/portal/speedtest=), the =waybar-net= indicator (split-cadence cache, redacted +event log, display-only airplane absorption per decision 12), and the GTK4 +layer-shell panel (Connections / Diagnose / Repair / Speed test) with the settled bar +clicks (left = panel, middle = =net portal=, right = =net-fix=; airplane on +Super+Shift+A). 230+ net tests; full dotfiles suite green. Live-verified on velox. + +Built on top since the original spec: +- *Captive-portal login engine* (2026-06-30, dotfiles =a7d7559=) — =net portal= now + runs a native =portal-login= repair tier (drop DoT → recover the portal URL from + the redirect → open a throwaway browser profile → auto-restore DoT once online), + replacing the old shell-out to =captive= for the force-portal flow. =net portal + --restore= is the manual fallback. +- *Portal UX fixes from live testing* (2026-06-30, dotfiles =eef6b0b=) — removed a + polkit-gated =resolvectl flush-caches= that popped an auth dialog (the DoT-drop + restart already clears the cache); added an already-online short-circuit so a + forced run on a working connection opens nothing; suppressed Chrome's first-run + wizard; moved =net portal= off the terminal into the panel status line; hardened + the portal-URL extractor against Firefox's detection page. +- *Panel auto-hide + Close button* (2026-06-30, dotfiles =450b7f0=) — the panel + closes on focus-out (popup behavior, suppressed while a child dialog holds focus) + and carries a Close button bottom-right. + +*V2 redesign in flight* (designed 2026-06-30, not yet built — see todo.org "Network +panel redesign — no terminals, verify-everything, full failure coverage"). It +reverses two earlier choices and widens coverage: +- *No terminals anywhere.* =net-popup= is removed; every action and result renders + in the panel. This depends on a passwordless privileged path — a root-owned helper + plus a narrow NOPASSWD sudoers rule, archsetup-installed — because an in-panel + worker thread has no tty to prompt for a password. Reverses decision 11's + "privileged tiers run in a terminal". +- *New navigation* — top tabs Connections | Diagnostics | Performance. Diagnostics + merges Diagnose + Repair (sub-row Diagnose | Get Me Online | Advanced; a shared + area below shows diagnose items and streams repair progress; Advanced reveals the + individual repair buttons, renamed with tooltips). Speed test lives under + Performance. +- *Verify every action* (each mutating op confirms its effect before reporting + success) and *detect + respond to every failure mode* — the full ~44-mode catalog, + edge cases included, lives in the redesign task and supersedes the table below. + +Phase 4 (docs / rollout) and Phase 5 (VPN) remain. The three Codex review rounds + +Craig's cj comments from the original spec are all incorporated ([31/31], no open +decisions). Phases 1-3's manual live checks are under todo.org "Manual testing and +validation". * Goal @@ -127,12 +152,13 @@ Three layers. Keep the bar cheap, the panel rich, the logic in one tested place. speed test. How the existing pieces map in: -- =captive= (bash, shipped) — the engine shells out to it for the heavy, - interactive portal-force flow (sudo reset, DNS override, browser launch). Its - cheap portal-detection logic is mirrored natively in the engine for the fast - status path so the bar never blocks on a subprocess. =captive= stays a usable - standalone CLI. The refactor (below) extracts its probe + reset into functions - the engine can call non-interactively. +- =captive= (bash, shipped) — its cheap portal-detection logic is mirrored natively + in the engine for the fast status path so the bar never blocks on a subprocess, + and it still exposes a =--probe-json= mode the engine reuses. *As built (2026-06-30): + the force-portal flow is now native too* — =repair.py='s =portal-login= tier does + the DoT drop, portal-URL recovery, clean-browser launch, and auto-restore in + Python, so =net portal= no longer shells out to =captive= for it. =captive= stays a + usable standalone CLI. - =waybar-netspeed= (sh, shipped) — retired; its throughput sampling moves into the engine's status output and renders in the indicator tooltip only. - =nmcli= — the connection backend for every op. @@ -142,6 +168,16 @@ wrapper over =net status --json=. The bar path must stay fast (see Performance budgets), so the indicator does no network I/O itself — it reads link state and the cached connectivity result. +Privileged-path model (v2, planned): repairs that need root (rfkill unblock, nmcli +modify/up, networking off/on, =systemctl restart NetworkManager/systemd-resolved=, +resolvectl dns/revert, the DoT toggle) go through a single root-owned helper +installed by archsetup, with a narrow NOPASSWD sudoers rule scoped to that helper +only (never a blanket =mv=/=systemctl= rule). =repair.py= calls =sudo <helper> +<verb>=. This is what lets every action run in-panel with no terminal: a GTK worker +thread has no tty, so without a passwordless path it can't prompt. It also fixes a +latent bug in the shipped portal flow — the detached DoT-restore watcher runs with +no tty and silently fails to restore encrypted DNS when sudo creds aren't cached. + * Repository + dependencies - *Code lives in the dotfiles repo* (=~/.dotfiles=), not archsetup. The =net= @@ -258,8 +294,10 @@ exits non-zero with a JSON error envelope (see JSON schemas) on failure. rfkill, reset, bounce, open portal) — read-only without =--fix=, acting with it. The TTY recovery path when waybar/the GUI is down (see the Makefile targets). -- =net portal= — run =captive='s portal-force flow (reset if needed, extract + - open the portal page). +- =net portal [--restore]= — the native captive-login flow (=repair.py= =portal-login= + tier): short-circuits if already online, else drops DoT to plain DNS, recovers the + portal URL from the redirect, opens it in a throwaway browser profile, and spawns a + detached watcher that restores DoT once online. =--restore= forces the restore now. - =net speedtest [--json]= — =speedtest-go --json= run; down/up/ping. * nmcli contract @@ -433,15 +471,27 @@ and the doctor stops at any terminal one: ** DNS handling in doctor (explicit per class) - *Captive DNS hijack* — open the portal (the hijack clears on login). No DNS mutation. -- *Broken resolver, 1.1.1.1 works* — doctor offers an explicit *temporary* 1.1.1.1 - override as a repair with cleanup verification (auto-revert, =cleanup_verified=); - without =--fix= it only recommends the command. It does not leave a permanent - resolver change. +- *Broken resolver, 1.1.1.1 works* — the shipped =dns-test= repair is *diagnostic*: + it sets 1.1.1.1, confirms the venue resolver is the culprit, then auto-reverts + (=cleanup_verified=). Because it reverts, =doctor --fix= does not currently leave + you online in this case — it falls through to =upstream-not-local=, which + misreports a locally-fixable problem. *V2 fix (planned):* on a dns-test *pass* + (public DNS works), set a PERSISTENT resolver override and verify online, with an + offered revert — and classify it as its own outcome rather than upstream. - *Port-53 / egress blocked* (even 1.1.1.1 fails) — terminal =upstream-not-local=; doctor stops, since it's not locally fixable. * Failure-mode coverage +*V2 note (2026-06-30):* the authoritative, exhaustive catalog (~44 modes across 10 +connectivity layers, edge cases included, each tagged fix-and-verify or report-text) +now lives in the redesign task (todo.org "Network panel redesign"). The table below is +the v1 baseline; two rows reflect intent the shipped code doesn't yet match, and the +v2 catalog closes them: =gateway unreachable= claims a bounce that doctor never +actually reaches (a no-route failure goes straight to =upstream-not-local=), and +=broken DNS, 1.1.1.1 works= auto-reverts so the user is left offline and misreported +as upstream (the v2 persistent-override fix closes this). + For each common field failure: does =net diagnose= detect it, can =net doctor --fix= repair it, and what terminal user action remains when it can't. (The =needs-user-action= / =upstream-not-local= / =deferred/vpn= outcomes are defined @@ -519,6 +569,50 @@ above.) - *Secret-leak tests*: assert no PSK/EAP/portal-token ever appears in any JSON output, log line, or error message. +** Automatic diagnostic verbose-capture (V2) + +A distinct layer from the event log above: that log records what =net= did; +this captures what the *underlying stack* did at debug verbosity during a run, so a +failed diagnosis leaves real ground-truth instead of relying on memory. Two triggers, +one mechanism: + +- *Automatic — on a failing diagnose.* When =net diagnose= ends =overall: fail=, the + next escalation (or =Get Me Online=) runs inside a verbose-capture session. +- *Manual — a debug on/off toggle in the panel's Advanced section.* "Debug on" + elevates and leaves it elevated (with a visible "debug capturing" indicator) so the + user can reproduce an intermittent problem over time; "Debug off" restores and + writes the bundle. Useful when the failure doesn't reproduce inside one diagnose. + +Mechanism (shared): +1. *Snapshot* the current log levels (=nmcli general logging=, resolved's level, + wpa_supplicant's). +2. *Elevate* the relevant components to debug at runtime, no restarts, scoped to the + domains that matter (NM: =WIFI,DHCP,DNS,CORE=; resolved; wpa_supplicant). +3. *Run* the diagnostics / repair. +4. *Capture the window*: =journalctl= for NetworkManager + systemd-resolved + + wpa_supplicant since the run started, a =dmesg= tail (driver / firmware / rfkill), + and any =curl -v= probe output. +5. *Restore* every level to its snapshot. +6. *Write a redacted support bundle* to =$XDG_STATE_HOME/net/bundles/<ts>/= and + surface it in the panel. + +Hard requirements: +- *Restore is guaranteed and idempotent.* A =try/finally= restores even on error, + and a crash-recovery guard detects "a prior run left NM/resolved/wpa_supplicant + elevated" on the next run and puts it back — the same shape as the DoT-restore + watcher. A crash must never strand the stack at debug verbosity. +- *Redaction before anything leaves.* Raw wpa_supplicant and NM debug logs carry the + PSK and EAP credentials in cleartext. The captured journal is scrubbed before the + bundle is written, shown, or shared; the secret-leak test asserts no passphrase or + EAP secret survives into a bundle. +- *Privilege via the V2 sudo-helper.* The log-level toggles need root, so they become + verbs on the passwordless helper (decision 16) — no extra prompt. + +Bonus — this closes a real detection gap, not just observability: the spec notes live +auth-failure detection is a v1 limit (it leans on a one-shot NM state-120 snapshot). +wpa_supplicant at debug during the run is exactly how a wrong-password or EAP failure +is caught properly, so the capture feeds back into classification. + * Indicator (task #C — Phase 1, the fast win) ** States (internet sub-state on top of link state) @@ -545,17 +639,16 @@ captive / no-internet / degraded overlay glyph. SSID + signal + IPv4 + gateway + the throughput readout (absorbed from netspeed) + the last probe result and its age (stale/expired hinted). -** Interactions (phase-aware; no keyboard-modifier clicks — waybar can't qualify -clicks by modifier, so the rich actions live in the panel, not ctrl/super-click) -Clicks never block the bar: each dispatches a detached background job and reports -via =notify=, single-flight per action. -- *Phase 1 (no panel yet)*: left-click runs =net probe= + notify (refreshes the - state on demand) and keeps the existing =pypr toggle network= scratchpad as the - interim manager; right-click runs =net repair reset= in the background + - notify; middle-click runs =net portal=. -- *Phase 2+ (panel exists)*: left-click opens the panel (focused on the relevant - section — diagnostics when captive); right/middle keep the background - reset/portal shortcuts. +** Interactions (no keyboard-modifier clicks — waybar can't qualify clicks by +modifier, so the rich actions live in the panel, not ctrl/super-click) +Clicks never block the bar: each dispatches a detached background job, single-flight +per action. *As built (settled live with Craig, 2026-06-29):* +- *left* — =net-panel= toggle (pkill-or-launch the GTK panel). +- *middle* — =net portal= (the captive-login flow). +- *right* — =net-fix= (=net doctor= with =--notify=: reports the result when the + outcome is one-way, opens a terminal only when it's fixable; the v2 redesign moves + even that into the panel). +- airplane toggle moved off the bar to Super+Shift+A. * Panel (tasks #B + #C diagnostics — Phases 2-3) @@ -563,7 +656,7 @@ GTK4 + gtk4-layer-shell, pocketbook scaffold (src-layout package, unittest, Makefile, gtk4-layer-shell anchored dropdown under the bar). One panel shell, reused by the future desktop-settings panel. -Sections: +Sections as built (Phases 1-3, a four-page stack switcher): 1. *Connections* — list, MRU-first, active marked, live signal bars for in-range wifi; row click switches; buttons for add / edit / remove; a rescan control. 2. *Diagnose* (read-only) — Probe (204/captive, shows portal URL + Open), Gateway @@ -572,6 +665,16 @@ Sections: (fresh MAC), Bounce (full stack), DNS override test, Force portal. A "Get me online" button runs =net doctor --fix= (the auto-escalating sequence). 4. *Speed test* — Run button, progress, down/up/ping result + last-run line. +As built, the panel also auto-hides on focus-out (popup behavior, suppressed while a +child dialog holds focus) and carries a Close button bottom-right (2026-06-30). + +*V2 nav (planned):* three top tabs — Connections | Diagnostics | Performance. +Diagnostics merges the Diagnose and Repair pages into one: a sub-row +=Diagnose= | =Get Me Online= | =Advanced= over a shared area that shows diagnose +items and streams repair progress in-panel (no terminal). =Advanced= reveals the +individual repair tiers (renamed, with tooltips) plus a *Debug capture on/off* +toggle (the manual side of the verbose-capture feature; a failing diagnose triggers +it automatically). Speed test moves under Performance. ** Panel state, cancellation, permissions State machines for: connection-list loading, rescan-in-progress, @@ -882,8 +985,37 @@ a *coverage-gap pass*, not just a percentage: 14. VPN / WireGuard is a planned Phase 5 (Craig's call, cj), not a permanent exclusion. +V2 redesign decisions (Craig, 2026-06-30): + +15. *No terminals anywhere in the module* — =net-popup= is removed; every action and + result renders in the panel. Reverses the part of decision 11 that ran privileged + repairs in a terminal "so sudo/polkit can prompt". +16. *Passwordless privileged path* — a root-owned helper + a narrow NOPASSWD sudoers + rule scoped to it, archsetup-installed, run as =sudo <helper> <verb>=. This gates + decision 15 (a worker thread can't prompt). Absorbs the earlier DoT-toggle + follow-up and fixes the detached-restore-watcher bug. +17. *Verify every action* — each mutating op (repair, connect, forget, add, DNS + override) re-checks its effect and surfaces pass/fail in the panel. +18. *Detect + respond to every failure mode, edges included* — the full ~44-mode + catalog (todo.org redesign task) is the contract; auto-fix where safe, else report + the exact in-panel text. Includes IPv6-only awareness and multi-homing, which need + diagnose to stop being IPv4-only and single-iface. +19. *Navigation* — top tabs Connections | Diagnostics | Performance; Diagnostics + merges Diagnose + Repair (Diagnose | Get Me Online | Advanced over a shared + streaming area); Speed test under Performance. +20. *Automatic diagnostic verbose-capture* (Craig, 2026-06-30) — on a failing + diagnose, elevate the underlying stack (NM / resolved / wpa_supplicant) to debug, + capture the journal + dmesg window, restore (guaranteed + crash-guarded), and + write a redacted bundle. Plus a manual Debug on/off toggle in Advanced. Restore + bulletproof, secrets scrubbed before the bundle, log-level toggles via the V2 + helper. See Observability. + * Implementation phases +*Phases 1-3 are SHIPPED* (2026-06-29 → 2026-06-30, dotfiles); their acceptance +criteria passed and the work is live on velox. Phase 4 (docs/rollout) and Phase 5 +(VPN) remain. The V2 redesign phases at the end are designed, not yet built. + - *Phase 1 — Indicator + console recovery (task #C).* =net status= + =net probe= (native cheap probe, reusing captive's logic) + the =captive= probe refactor + =waybar-net= + the split-cadence cache (single-flight, atomic, stale classes) + @@ -921,6 +1053,29 @@ a *coverage-gap pass*, not just a percentage: tooling into the same panel + CLI (=net vpn ...=). Out of the v1 milestone; specced separately when picked up. +V2 redesign phases (designed 2026-06-30, dependency order): +- *V2.1 — Sudo helper + NOPASSWD sudoers (gates everything).* Root-owned helper + dispatching net's fixed privileged verbs, archsetup-installed, narrow sudoers. + Also fixes the detached DoT-restore-watcher bug. + - *Acceptance*: every repair runs passwordless in-panel on a non-NOPASSWD machine; + the sudoers rule is scoped to the helper only. +- *V2.2 — Merged Diagnostics panel + nav restructure.* Connections | Diagnostics | + Performance; the Diagnostics sub-row + shared streaming area; Advanced reveal + + tooltips; delete =net-popup=. + - *Acceptance*: no terminal opens for any action; repair progress streams in the + panel; Speed test lives under Performance. +- *V2.3 — IPv6-aware and multi-homing-aware diagnose.* Stop treating no-IPv4 as a + failure when online over IPv6; identify which interface owns the default route. +- *V2.4 — Close every detect/correct gap in the catalog, with post-action + verification.* Work the redesign-task catalog mode by mode. +- *V2.5 — Automatic diagnostic verbose-capture.* Snapshot/elevate/capture/restore + around a failing diagnose + the Advanced Debug on/off toggle; guaranteed + + crash-guarded restore; redacted support bundle; helper log-level verbs. + - *Acceptance*: a failing diagnose leaves a redacted bundle (NM/resolved/ + wpa_supplicant journal + dmesg) and restores every log level; a crash mid-capture + is detected and restored on the next run; the secret-leak test finds no PSK/EAP in + a bundle; the toggle elevates and restores on demand. + * Open items / risks - gtk4-layer-shell dropdown anchoring under a waybar module needs the same @@ -1599,3 +1754,26 @@ online" escalation reporting, what survives panel close, and keyboard nav. live repo (33 unittest suites, =make test=, coverage.py absent → venv). Status raised to "Ready for Phase 1; Ready-with-caveats overall" — no open decisions remain. + +** 2026-06-30 Tue @ 16:32:07 -0400 — Claude Code (archsetup) — responder (build + V2 redesign) + +- *What changed:* Brought the spec current with what shipped and what got decided + across 2026-06-29/30. Recorded Phases 1-3 as SHIPPED (engine, indicator, GTK4 + panel, bar clicks). Added the native captive-login engine (=portal-login= repair + tier replacing the =captive= shell-out), the live-testing portal UX fixes (removed + the polkit-gated flush, already-online short-circuit, Chrome first-run suppression, + in-panel portal, extractor hardening), and the panel auto-hide + Close button. + Then folded in the V2 redesign Craig directed: no terminals anywhere, a passwordless + root-helper + NOPASSWD sudoers as the enabler, verify-every-action, the full + ~44-mode failure catalog (now the authoritative coverage, in todo.org), and the + Connections | Diagnostics | Performance nav. Added decisions 15-19, V2 phases + 2.1-2.4, and corrected the two failure-table rows the shipped code never matched + (gateway-bounce, venue-DNS auto-revert). +- *Why:* The spec had drifted behind the code (still said "Phase 1 shipped, terminal + for privileged ops") and behind Craig's redesign calls. The doc is the contract for + the V2 build, so it has to state the no-terminal reversal, the sudo-helper + dependency, and where coverage now lives. +- *Artifacts:* Updated Status, Architecture, Engine CLI surface, Doctor DNS handling, + Failure-mode coverage, Interactions, Panel, Resolved decisions, Implementation + phases. Shipped commits referenced: dotfiles =a7d7559=, =eef6b0b=, =450b7f0=. Full + catalog + redesign children in todo.org "Network panel redesign". |
