4 files changed, 2195 insertions, 0 deletions
diff --git a/docs/design/2026-06-25-testinfra-validation.org b/docs/design/2026-06-25-testinfra-validation.org
new file mode 100644
index 0000000..5c82aa2
--- /dev/null
+++ b/docs/design/2026-06-25-testinfra-validation.org
@@ -0,0 +1,238 @@
+#+TITLE: Design: Testinfra Post-Install Validation for archsetup
+#+AUTHOR: Craig Jennings
+#+DATE: 2026-06-25
+#+STATUS: Accepted (2026-06-25)
+
+* Problem
+
+The VM integration harness (=scripts/testing/run-test.sh=) runs archsetup in a
+QEMU VM, then verifies the result two ways:
+
+1. Parses archsetup's own install log for its Error Summary and the
+   =ARCHSETUP_EXECUTION_COMPLETE= marker (did the script finish, did it log
+   errors).
+2. Runs =run_all_validations= from =scripts/testing/lib/validation.sh= — a
+   hand-rolled, shell-based post-install assertion sweep of ~26 checks over SSH.
+
+The shell sweep works, but each check is 6-40 lines of =ssh_cmd= +
+=validation_pass/fail= + =attribute_issue= boilerplate, the pass/fail counters
+are hand-maintained globals, and the reporting is bespoke. Adding or reading a
+check is heavier than it should be, and growing the suite (archsetup configures
+far more than the 26 checks cover) compounds that weight.
+
+This doc proposes porting the post-install validation to Testinfra (Python +
+pytest) for more expressive checks and better reporting, then growing coverage.
+
+* Decision
+
+Port the post-install validation layer to Testinfra + pytest, reaching parity
+with the existing =validation.sh= sweep, then expand coverage. Recorded
+rationale: the up-front port cost (parity rewrite + a test-only dependency) is
+an accepted trade — the priority is a robust, well-reported, growing validation
+suite over feature speed. The framework swap alone buys ergonomics and
+reporting, not coverage, so it is paired with real new coverage (below).
+
+This replaces the shell sweep; it does not touch archsetup's own install-log
+parsing (that stays as a separate signal). The full coverage expansion (P4)
+lands in this task too, sequenced strictly after the parity cutover so the
+parity verification stays clean.
+
+* Current harness (what exists today)
+
+** Flow (run-test.sh)
+1. Revert VM to base snapshot, boot, wait for SSH.
+2. =capture_pre_install_state=.
+3. Bundle + copy archsetup + dotfiles into the VM, run archsetup in background,
+   poll to completion.
+4. =capture_post_install_state=.
+5. =run_all_validations= (the shell sweep).
+6. =analyze_log_diff= + =generate_issue_report= (issue attribution).
+7. Explicit pass/fail exit code; cleanup.
+
+** The shell sweep (validation.sh)
+~26 checks under =run_all_validations=: user created / shell / groups, dotfiles,
+yay, pacman working, window manager, firewall, DNS, avahi, fail2ban,
+NetworkManager, emacs, git config, dev tools, zfs, boot config, autologin,
+gnome-keyring, terminus font, mkinitcpio hooks, initramfs consolefont, nvme
+module, archsetup log, state markers.
+
+** Issue attribution
+=attribute_issue <msg> <bucket>= sorts each failure into one of three arrays —
+=ARCHSETUP_ISSUES=, =BASE_INSTALL_ISSUES=, =UNKNOWN_ISSUES= — and
+=generate_issue_report= writes them out (base-install issues route to the
+archzfs inbox). This is domain logic Testinfra has no equivalent for; the port
+must preserve it.
+
+** Connection
+=ssh_cmd= uses =sshpass -p "$ROOT_PASSWORD" ssh ... -p "$SSH_PORT" root@$VM_IP=,
+with =VM_IP=localhost=, =SSH_PORT=2222=, =ROOT_PASSWORD=archsetup=.
+
+* Design
+
+** Where Testinfra fits
+Replace the =run_all_validations= call (step 5) with a pytest invocation against
+the running VM. Steps 1-4 and 6-7 are unchanged; =analyze_log_diff= stays.
+Testinfra connects over the same SSH the harness already exposes.
+
+** Connection model
+Testinfra's paramiko/ssh backend targets the live VM via its host spec:
+
+#+begin_src sh
+pytest scripts/testing/tests/ \
+  --hosts="ssh://root@localhost:2222" \
+  --ssh-config=<generated> \
+  --json-report --json-report-file="$TEST_RESULTS_DIR/testinfra.json"
+#+end_src
+
+Password auth: generate a throwaway ssh-config (or reuse sshpass via a
+=--ssh-identity= once archsetup drops the key, but at validation time we only
+have the root password). Simplest: a tiny generated ssh config + sshpass
+wrapper, or switch the test VM to a known test key injected pre-run. Open
+question below.
+
+** Test layout
+#+begin_example
+scripts/testing/tests/
+  conftest.py            # host fixture, markers, attribution hook, report glue
+  test_users.py          # user created / shell / groups
+  test_dotfiles.py       # stow symlinks, readable by user
+  test_packages.py       # yay, pacman working, dev tools, key packages
+  test_services.py       # firewall, dns, avahi, fail2ban, networkmanager
+  test_boot.py           # zfs, mkinitcpio hooks, nvme, consolefont, terminus
+  test_desktop.py        # window manager, autologin, gnome-keyring
+  test_archsetup.py      # install log, state markers
+  test_hardening.py      # NEW: sshd drop-in, sysctl, /etc fstab perms, backups
+#+end_example
+
+** Example tests (parity)
+#+begin_src python
+def test_ufw_enabled(host):
+    assert host.service("ufw").is_enabled
+
+def test_user_cjennings_exists(host):
+    u = host.user("cjennings")
+    assert u.exists
+    assert u.shell == "/usr/bin/zsh"
+
+def test_zshrc_stowed_and_readable(host):
+    f = host.file("/home/cjennings/.zshrc")
+    assert f.is_symlink
+    assert ".dotfiles/" in f.linked_to
+    assert f.exists                       # not broken
+    assert host.run("sudo -u cjennings test -r %s" % f.path).rc == 0
+
+def test_mkinitcpio_systemd_hook(host):
+    # non-ZFS systems delegate fsck from udev to systemd
+    conf = host.file("/etc/mkinitcpio.conf").content_string
+    assert "systemd" in conf
+#+end_src
+
+Compare =test_ufw_enabled= (1 line) to the current =validate_firewall= (8 lines
+of ssh_cmd + branch + counters).
+
+** Preserving issue attribution
+Map the three buckets to pytest markers and collect them in a =conftest.py=
+hook:
+
+#+begin_src python
+@pytest.mark.attribution("archsetup")   # or "base_install" / "unknown"
+def test_ufw_enabled(host): ...
+#+end_src
+
+A =pytest_runtest_makereport= hook records each failure under its marker's
+bucket and writes the same three-way report =generate_issue_report= produces
+(base-install failures still route to the archzfs inbox). Default bucket =
+archsetup when unmarked.
+
+** Tiered strategy
+Markers =@pytest.mark.smoke= (user, key packages, dotfiles present) and
+=@pytest.mark.integration= (services, configs, boot). =pytest -m smoke= for a
+fast gate, full run otherwise. Drop the task's original X11/startx end-to-end
+slice — the fleet is Wayland/Hyprland and headless GUI e2e is flaky and
+expensive; a Wayland-session smoke check can be reconsidered later as its own
+task.
+
+** Reporting
+=pytest-json-report= (or junit-xml) → =$TEST_RESULTS_DIR/=, surfaced in the
+test report alongside the install-log analysis. pytest's own per-test
+pass/fail/skip output replaces the hand-maintained counters.
+
+* Coverage
+
+** Parity (port all current checks)
+All ~26 =validation.sh= checks, grouped per the layout above.
+
+** Expansion (new — the coverage win)
+archsetup configures much that isn't validated today. Candidates:
+- sshd hardening drop-in (=/etc/ssh/sshd_config.d/10-hardening.conf=,
+  PermitRootLogin prohibit-password).
+- =backup_system_file= behavior — assert =.archsetup.bak= exists for files
+  archsetup edited in place (fstab, mkinitcpio.conf, sudoers, …).
+- pacman.conf (ParallelDownloads, Color, multilib) and makepkg.conf (MAKEFLAGS,
+  OPTIONS) settings actually applied.
+- systemd-resolved DNS-over-TLS drop-in; NetworkManager wifi-privacy.
+- fail2ban jail.local present; reflector config; sysctl printk; /etc/issue
+  emptied; vconsole font; fstab /efi fmask/dmask perms.
+- sanoid / zfs-replicate units (ZFS hosts).
+
+* Dependencies
+
+Add =python-pytest=, =python-pytest-testinfra= (pulls paramiko), and a JSON
+reporter to =make deps= (test host only — not installed by archsetup itself).
+Note: the existing unit suites run under =python3 -m unittest=; the integration
+layer runs under pytest. Two runners, both Python; =make test-unit= unchanged,
+=make test= gains the pytest step.
+
+* Goss comparison (the task asked)
+
+- *Goss* — YAML-declarative health specs, a single Go binary executed *on the
+  target*. Fast, no Python. But the spec must be pushed into the VM and run
+  there, the assertions are less programmable, and it adds a Go binary to the
+  flow.
+- *Testinfra* — Python, runs *on the host* over SSH (nothing installed in the
+  VM), assertions are full Python with rich built-in modules
+  (File/Package/Service/User/Command), integrates with pytest's tooling.
+
+Choose Testinfra: it runs from the host (the VM stays clean), it's far more
+programmable for the conditional checks archsetup needs (DESKTOP_ENV branches,
+ZFS-vs-not), and it aligns with the repo's existing Python test tooling.
+
+* Migration plan (phased, TDD where the helper logic is ours)
+
+- *P1 — Scaffold.* conftest.py (host fixture + connection), the attribution
+  marker + report hook, and 3 parity checks (firewall, user, dotfiles). Wire a
+  pytest step into run-test.sh behind a flag so the shell sweep still runs.
+- *P2 — Full parity.* Port all ~26 checks; diff a real VM run's results against
+  the shell sweep to confirm no check was lost.
+- *P3 — Cut over.* Make pytest the primary sweep in run-test.sh; keep
+  =analyze_log_diff= and the install-log signal.
+- *P4 — Expand.* Add the new coverage (hardening, backups, applied settings).
+- *P5 — Retire.* Remove =run_all_validations= from validation.sh (keep the
+  capture/analyze helpers that pytest doesn't replace).
+
+* Acceptance criteria
+
+- =make test= runs archsetup in a VM, then a pytest sweep over SSH, and a real
+  run reports parity with (or a superset of) the current shell checks.
+- Failures still sort into archsetup / base-install / unknown, with base-install
+  issues routed to the archzfs inbox as today.
+- =make deps= installs the test dependencies; the VM has nothing extra installed.
+- A documented =pytest -m smoke= fast path exists.
+
+* Resolved decisions (2026-06-25)
+
+1. *Auth at validation time — inject a throwaway test key.* Pre-run, generate
+   an ephemeral keypair, push the pubkey into the VM's
+   =/root/.ssh/authorized_keys= over the existing sshpass channel, and point
+   Testinfra at the private key via a generated ssh-config. No password in the
+   pytest invocation; paramiko key auth just works; the keypair is discarded
+   after the run. (Chosen over wrapping sshpass around Testinfra, which is
+   awkward since Testinfra spawns its own ssh connections.)
+2. *Cut over — run both through parity, then switch.* Keep the shell sweep
+   running alongside pytest through P2 so a real VM run can diff pytest's
+   results against the shell sweep and prove no check was dropped. pytest
+   becomes primary at P3; =run_all_validations= is deleted at P5 after the
+   expanded suite proves out.
+3. *Expansion scope — full, in this task, after cutover.* All of P4 lands here,
+   sequenced strictly after the P3 parity cutover so the parity diff is clean
+   before new checks are added.
diff --git a/docs/design/2026-06-25-zfs-vm-test-coverage.org b/docs/design/2026-06-25-zfs-vm-test-coverage.org
new file mode 100644
index 0000000..d9625e0
--- /dev/null
+++ b/docs/design/2026-06-25-zfs-vm-test-coverage.org
@@ -0,0 +1,139 @@
+#+TITLE: Design: ZFS VM Test Coverage + Bare-Metal Runner Migration
+#+AUTHOR: Craig Jennings
+#+DATE: 2026-06-25
+#+STATUS: Draft — for review
+
+* Problem
+
+Two gaps, one root:
+
+1. *The ZFS install path is untested in automation.* The VM harness
+   (=make test=) uses a single non-ZFS base image, so every ZFS-conditional
+   check skips (mkinitcpio udev hook on ZFS, sanoid, zfs-scrub timer, the whole
+   ZFS branch of archsetup). ZFS is exercised *only* by =run-test-baremetal.sh=
+   against real hardware.
+
+2. *=run-test-baremetal.sh= is latently broken by the sshd hardening.* It SSHes
+   to the target as root *by password* throughout the run, exactly the pattern
+   archsetup's =PermitRootLogin prohibit-password= (shipped 2026-06-24) kills
+   mid-install. The VM runner already hit and fixed this (=inject_root_key= +
+   key auth, commit f50fc1d); the bare-metal runner never got that fix, so it
+   almost certainly aborts mid-install now, the same way the VM runner did.
+
+The fix for both is the same shape: a ZFS base VM gives a safe, repeatable,
+snapshot-rollback ZFS target (no sacrificial hardware), which both fills the
+coverage gap *and* provides a target to migrate + validate the bare-metal
+runner against. This also unblocks P5 (deleting the dead shell-sweep functions
+from validation.sh), which is gated on the bare-metal runner leaving the shell
+sweep.
+
+* Decision
+
+Build a ZFS base VM via archangel, add a filesystem-profile selector to the VM
+harness so =make test= can target zfs or non-zfs, then migrate
+=run-test-baremetal.sh= to key auth + the Testinfra sweep and validate it
+against the ZFS VM. Finish by deleting the now-dead shell-sweep functions (P5).
+
+Explicitly rejected: loosening =PermitRootLogin= (or adding a skip-hardening
+test flag). That trades a real security feature for harness convenience and
+would mean never validating the hardened config. Key auth is the correct fix,
+already proven in the VM runner.
+
+* Current state (grounded)
+
+- =create-base-vm.sh= boots an =archangel-*.iso=, copies =archsetup-test.conf=
+  into the live env, runs =archangel --config-file /root/archsetup-test.conf=
+  (the base-OS install — partitioning/filesystem live here), powers off, and
+  snapshots =clean-install= onto =vm-images/archsetup-base.qcow2=.
+- =run-test.sh= hardcodes that one image + snapshot, and copies
+  =scripts/testing/archsetup-vm.conf= (DESKTOP_ENV=hyprland, non-ZFS) into the
+  VM as the archsetup config.
+- =run-test-baremetal.sh= takes =--host= / =--password=, SSHes as root by
+  password, rolls back ZFS =@genesis= snapshots, transfers + runs archsetup,
+  then calls =run_all_validations= / =validate_all_services= (overriding
+  =VM_IP= to the target). It is the only remaining caller of the shell sweep.
+- Key auth machinery already exists and is reusable: =inject_root_key= and
+  =SSH_KEY_OPT= in =vm-utils.sh=, and =run_testinfra_validation= in
+  =testinfra.sh= (drives connection from a generated ssh-config keyed on
+  =VM_IP= / =SSH_PORT=).
+
+* Design
+
+** A. ZFS archangel base
+Add a ZFS archangel config (a =archsetup-test-zfs.conf= or equivalent) that
+installs a ZFS root. Confirm archangel supports a ZFS-root config (it's a
+separate project — verify its config options first). Unencrypted ZFS for the
+test VM (skip the passphrase prompt; encryption isn't what we're validating).
+
+** B. Per-profile base images + selector
+- =create-base-vm.sh= takes a profile (e.g. =FS_PROFILE=zfs|ext4=, default
+  current/non-ZFS), picks the matching archangel config, and writes a
+  profile-named image: =vm-images/archsetup-base.qcow2= (default) vs
+  =vm-images/archsetup-base-zfs.qcow2=. Same =clean-install= snapshot name.
+- =run-test.sh= + Makefile take the same =FS_PROFILE= and select the image (via
+  =init_vm_paths=). The archsetup run config (=archsetup-vm.conf=) is *shared* —
+  archsetup auto-detects ZFS from the live root, so no per-profile run config is
+  needed. =make test FS_PROFILE=zfs=.
+
+** C. Bare-metal runner migration
+Mirror the VM runner's fix in =run-test-baremetal.sh=:
+- After the first successful SSH to =TARGET_HOST=, call =inject_root_key= (it
+  authorizes a key over the password session; set =VM_IP=TARGET_HOST=,
+  =SSH_PORT=22= so the helpers + ssh-config target the real host).
+- Replace =run_all_validations= / =validate_all_services= with
+  =run_testinfra_validation= (now authoritative).
+- Everything downstream already routes through =$SSH_KEY_OPT= (the vm-utils
+  helpers) and the ssh-config, so it survives the hardening.
+
+** D. Validate
+- =make test FS_PROFILE=zfs= → the ZFS-conditional pytest checks now *run*
+  (not skip): mkinitcpio uses the udev hook, sanoid installed, zfs-scrub timer,
+  zfs root. Fix any real ZFS-path findings archsetup has.
+- Point =run-test-baremetal.sh= at the ZFS VM (or real hardware) → confirm the
+  key-auth migration carries it through the hardening to a green pytest sweep.
+
+** E. Delete the shell sweep (P5)
+Once both runners use =run_testinfra_validation=, delete the dead functions from
+=validation.sh= (run_all_validations, validate_all_services, the ~26 validate_*
+checks, validate_service*, run_full_validation, validation_pass/fail/warn/skip).
+Keep the live helpers: ssh_cmd, attribute_issue, capture_pre/post_install_state,
+analyze_log_diff, categorize_errors, generate_issue_report, VALIDATION_*.
+
+* Phases
+- *P-A* archangel ZFS config (verify archangel ZFS support first).
+- *P-B* create-base-vm.sh + run-test.sh + Makefile profile selector; build the
+  ZFS base image + snapshot.
+- *P-C* =make test FS_PROFILE=zfs= green (ZFS-conditional tests run; fix
+  findings). VM-validatable here.
+- *P-D* migrate run-test-baremetal.sh to key auth + Testinfra; validate against
+  the ZFS VM.
+- *P-E* delete the dead shell-sweep functions (the standing P5 follow-up).
+
+* Open questions
+1. *Does archangel support a ZFS-root config out of the box?* RESOLVED (yes).
+   ZFS is archangel's *default* filesystem (=FILESYSTEM=zfs=, validated by
+   =installer/lib/config.sh:validate_filesystem=), with =NO_ENCRYPT=yes= for an
+   unattended unencrypted install and a ready =installer/velox-zfs.conf.example=
+   to model. No archangel work needed.
+2. *Two images vs one image + two snapshots?* RESOLVED — two images. ZFS vs
+   btrfs are different on-disk layouts; cleaner than juggling snapshots on one
+   disk. =btrfs= keeps the legacy unsuffixed =archsetup-base.qcow2=; =zfs= gets
+   =archsetup-base-zfs.qcow2=.
+3. *Profile on run-test.sh vs a separate run-test-zfs.sh?* RESOLVED —
+   =FS_PROFILE= env param on the existing runner + Makefile, no duplicate
+   harness.
+4. *Disk size / RAM for the ZFS VM* — start at the 4G RAM / 50G disk defaults;
+   bump =VM_RAM= only if the ZFS install OOMs (decide at P-C build time).
+5. *Should the bare-metal runner stay at all once a ZFS VM exists*, or does the
+   ZFS VM profile make it redundant for everything except real-hardware smoke?
+   Defer until after P-D.
+
+* Design corrections (found during P-A/P-B grounding)
+- The "non-ZFS" base is *btrfs*, not ext4 — =archsetup-test.conf= sets
+  =FILESYSTEM=btrfs=. The profile axis is zfs vs btrfs throughout.
+- *No =archsetup-vm-zfs.conf= is needed.* archsetup reads no filesystem key; it
+  auto-detects ZFS from the live root via =is_zfs_root()= (=findmnt -n -o FSTYPE
+  /=, archsetup:688). The ZFS branch (sanoid, zfs-scrub timer, mkinitcpio udev
+  hook, docker zfs storage driver) fires whenever the running root is ZFS. So
+  only the *archangel* base config and the base *image* differ per profile; the
+  archsetup run config (=archsetup-vm.conf=) is shared.
diff --git a/docs/design/2026-06-29-waybar-network-module-spec.org b/docs/design/2026-06-29-waybar-network-module-spec.org
new file mode 100644
index 0000000..db9657d
--- /dev/null
+++ b/docs/design/2026-06-29-waybar-network-module-spec.org
@@ -0,0 +1,1601 @@
+#+TITLE: Waybar Network Module — Design Spec
+#+AUTHOR: Craig Jennings & Claude
+#+DATE: 2026-06-29
+
+* Status
+
+*Phase 1 SHIPPED* (2026-06-29, dotfiles =5254bd8=..=c095a22=, 10 commits): engine
+(=net status/probe/diagnose/repair/doctor/portal=) + =waybar-net= indicator +
+split-cadence cache + redacted event log + Makefile recovery targets + airplane
+absorption. 160 net tests; pure modules ≥90% branch. One as-built deviation from
+this spec: airplane absorption is *display-only* (Craig's call, option 1) — net
+shows the airplane state but the =airplane-mode= low-power toggle is KEPT (it does
+radios + CPU + brightness + services, not a network concern); only =waybar-airplane=
++ =custom/airplane= + =waybar-netspeed= were deleted. See decision 12. Phases 2-5
+remain. Live waybar eyeball is under todo.org "Manual testing and validation".
+
+Ready for Phase 1; Ready-with-caveats overall. Three Codex review rounds + Craig's
+cj comments are all incorporated — every finding has a disposition and the findings
+cookie reads complete ([31/31]), with no open decisions (enterprise scope settled:
+open + WPA-PSK in v1, 802.1X add/edit vNext, activate-only). The cj comments
+reshaped several decisions (no separate credential store — use NM's own; =net
+doctor= + Makefile console-recovery in v1; rfkill + full-stack-bounce repair;
+airplane module absorbed; VPN a later Phase 5). The only remaining caveats are
+Phase-2/3 build unknowns named under Open items (gtk4-layer-shell anchoring, the
+=captive= =--json= refactor) — not Phase-1 blockers. Phase 1 (indicator + console
+recovery) is ready to build.
+
+* Goal
+
+One waybar network component that does the whole job: shows connection state
+(including the missing "associated but no internet / captive portal" state),
+manages connections from a dropdown (nmcli-backed; secrets stay in
+NetworkManager's own store, no separate credential file), and runs the network
+diagnostics and remediation off the same place
+(captive-portal detection + forcing, bounce/reset, gateway/DNS checks, speed
+test).
+
+It unifies three todo tasks that are really one feature:
+- =[#C]= "archsetup Waybar Wi-Fi module should show no-internet state" — the
+  indicator state plus the 2026-06-22 roam expansion (bounce, diagnostics, speed
+  test off the component).
+- =[#B]= "Network-manager dropdown, nmcli-backed" — the management dropdown. (The
+  todo task's original "GPG-stored secrets" framing is superseded: secrets stay in
+  NM's own store, decision 5.)
+- The network diagnostics already shipped in =captive= (the hotel/captive-portal
+  tool, formerly =login-page=) become this module's diagnostics engine rather
+  than a standalone CLI.
+
+* Scope
+
+** In
+- *Indicator* — wifi/ethernet icon + signal + SSID, plus an internet sub-state:
+  online / captive / no-internet / connecting / disconnected / airplane.
+- *Absorbs the airplane module* — the airplane state + toggle move into
+  =custom/net= (airplane is a network concern). Once this ships, the standalone
+  =custom/airplane= module, the =waybar-airplane= + =airplane-mode= scripts, their
+  =tests/=, and the css are deleted (listed under Files touched). The
+  desktop-settings panel (sibling =[#B]=) no longer needs an airplane row.
+- *Interface-correct* — targets the wifi (or chosen) device, not the
+  default-route interface, so an active USB tether or wired link can't mask
+  wifi state. (Same lesson =captive= fixed; the current =custom/netspeed= keys
+  off the default route and has the bug.)
+- *Connection management (panel)* — list saved connections most-recently-used
+  first, live signal for in-range wifi, click to switch; add / edit / remove for
+  open + WPA-PSK; activate any existing saved profile (including enterprise ones
+  NM already stores); ethernet↔wifi and wifi↔wifi switching even when a link
+  appears mid-session.
+- *Diagnostics (panel)* — read-only Diagnose (captive probe 204-vs-portal with
+  the extracted portal URL, gateway ping, DNS config) separated from mutating
+  Repair. Repair has tiers, lightest first: rfkill-unblock, per-connection reset
+  (fresh MAC), full-stack bounce (=nmcli networking off/on=, then restart
+  NetworkManager if that fails), and the temporary 1.1.1.1 override test. Each
+  Repair action confirms and verifies cleanup.
+- *Speed test (panel)* — down/up/ping with a progress indicator and last-result
+  shown, via the already-installed =speedtest-go --json=.
+- *Connection secrets* — none of our own. Settings and passwords live where NM
+  already keeps them: =/etc/NetworkManager/system-connections/*.nmconnection=
+  (root-only =0600=, the PSK/EAP secret stored inline). We read/write them through
+  nmcli, which handles the privilege. No separate file, no GPG, no gpg-agent — one
+  fewer dependency, and NM's store is already the secure-at-rest source of truth.
+- *Persistence* — connectivity probe result cached in the runtime dir so the
+  bar reads it cheaply between probes.
+- *Observability* — a redacted JSONL event log so a post-failure session can
+  diagnose without re-running destructive actions.
+
+** Out (v1, note for later)
+- No replacement of NetworkManager's connection engine. NM stays the thing that
+  connects; we drive it via nmcli.
+- No add/edit *form* for WPA-Enterprise / 802.1X in v1. The reason is effort vs
+  payoff: 802.1X has many interdependent fields (CA cert, client cert, identity,
+  anonymous identity, phase-2 auth) where a wrong entry silently fails auth, so a
+  trustworthy form is a lot of UI for connections Craig rarely adds (open +
+  WPA-PSK covers home, hotels, and phone hotspots). v1 still *activates* existing
+  saved enterprise profiles and points editing at =nmtui=/=nmcli=. Settled
+  (Craig, 2026-06-29): enterprise add/edit is vNext — 24 saved profiles on velox,
+  0 enterprise, so the form would be unused UI; if one ever appears nmtui adds it
+  once and the module activates it thereafter.
+- No per-connection captive-portal *auto-login* in v1. (That would mean storing a
+  portal's login form answers — room number, surname, a checkbox — and replaying
+  them automatically when a known portal is detected, so the page never appears.
+  Out for v1 because every portal's form differs and it means storing per-venue
+  answers; v1 just opens the portal for you.)
+- No graphing/history of speed-test results beyond the last run.
+- No static-IP / proxy / metered / MAC-randomization editing in v1 (activate
+  existing, edit elsewhere).
+- No VPN / WireGuard management in v1, but it's a planned later phase (Phase 5),
+  not a permanent exclusion — it folds the existing archsetup wireguard tooling
+  into the same panel/CLI.
+- The desktop-settings dropdown (sibling =[#B]=) is a separate module, but it
+  shares the GTK4 layer-shell panel shell built here.
+
+* Architecture
+
+Three layers. Keep the bar cheap, the panel rich, the logic in one tested place.
+
+1. *Engine* — a =net= Python package (src-layout, unittest), exposing a CLI. Wraps
+   every nmcli op and owns the diagnostics. Emits JSON. This is the testable
+   core (fake =nmcli= / =curl= / =speedtest-go= on PATH, like the existing
+   =waybar-netspeed= and =waybar-sysmon= test harnesses). Precedent: pocketbook is
+   Python in the dotfiles repo; =wtimer= is Python for the same testability
+   reason.
+2. *Indicator* — a thin =waybar-net= script that calls =net status --json= and
+   renders icon + signal + state + tooltip. Replaces =custom/netspeed=
+   (throughput folds into the tooltip).
+3. *Panel* — a GTK4 + gtk4-layer-shell app (mirrors pocketbook's structure)
+   that imports the engine. Hosts connection management, diagnostics, and the
+   speed test.
+
+How the existing pieces map in:
+- =captive= (bash, shipped) — the engine shells out to it for the heavy,
+  interactive portal-force flow (sudo reset, DNS override, browser launch). Its
+  cheap portal-detection logic is mirrored natively in the engine for the fast
+  status path so the bar never blocks on a subprocess. =captive= stays a usable
+  standalone CLI. The refactor (below) extracts its probe + reset into functions
+  the engine can call non-interactively.
+- =waybar-netspeed= (sh, shipped) — retired; its throughput sampling moves into
+  the engine's status output and renders in the indicator tooltip only.
+- =nmcli= — the connection backend for every op.
+
+Language note: the engine is Python; the indicator is a thin Python or sh
+wrapper over =net status --json=. The bar path must stay fast (see Performance
+budgets), so the indicator does no network I/O itself — it reads link state and
+the cached connectivity result.
+
+* Repository + dependencies
+
+- *Code lives in the dotfiles repo* (=~/.dotfiles=), not archsetup. The =net=
+  package sits in-tree like pocketbook (src-layout, unittest, Makefile target);
+  =waybar-net= and the =net= CLI entry live in the hyprland tier
+  (=hyprland/.local/bin/=). Tests under =tests/net/= and =tests/waybar-net/=.
+  archsetup owns only the *dependency install*, not the code.
+- *archsetup installs the deps* in its Hyprland step: =gtk4-layer-shell=,
+  =python-gobject=, plus =nmcli=/=curl=/=resolvectl=/=rfkill= (already present via
+  NetworkManager/curl/systemd/util-linux). Speed test uses =speedtest-go= (AUR
+  =speedtest-go-bin=, already installed on velox); archsetup adds it to the AUR
+  list. librespeed-cli is the documented fallback if a self-hosted LibreSpeed
+  server is ever wanted. No =gpg= dependency (secrets live in NM's own store).
+- *Daily-drivers*: a stowed-script + AUR-dep feature, so ratio needs the same
+  =git pull= + stow + the archsetup-added deps. Note the manual dep step in the
+  rollout.
+
+** Makefile targets (console recovery is a first-class path)
+=net doctor= and the diagnostics are reachable from a bare TTY when waybar and
+the GUI are down — that's the case where you most need them. The dotfiles
+Makefile carries targets that wrap the =net= CLI so "get back online" is one make
+command from the console:
+- =make online= — =net doctor --fix= (diagnose, then apply the lightest repair:
+  rfkill-unblock → reset → bounce → open portal). The headline recovery target.
+- =make net-doctor= — =net doctor= (read-only diagnose + recommendation).
+- =make net-status= / =make net-diagnose= / =make net-portal= / =make net-reset=
+  / =make net-bounce= — the individual ops.
+- =make test= — already runs =tests/*=; the =net= package's unittest suites are
+  collected the same way.
+These intentionally need only nmcli/curl/rfkill (no GUI, no waybar, no Python
+GTK), so they work from a TTY on a broken graphical session.
+
+* Connectivity model — split cadence
+
+The indicator polls every ~2s, but a real internet/captive probe every 2s wastes
+battery and can re-trigger a captive portal. So split it:
+
+- *Fast path (every poll, cheap, no network)* — interface, type, SSID, signal,
+  IPv4 presence, throughput sample. From nmcli / sysfs only. No network I/O.
+- *Slow path (cached, TTL ~45s)* — the actual internet/captive probe (the 204
+  check + meta-refresh portal extraction). Result cached at
+  =$XDG_RUNTIME_DIR/waybar/net-connectivity.json= with a timestamp.
+
+The indicator reads the cache each poll. When the cache is older than the TTL,
+=net status= kicks =net probe= in the background (spawn + detach, never awaited)
+and renders the last cached sub-state meanwhile. A user-triggered
+diagnose/reconnect refreshes the cache immediately. This keeps the bar
+responsive and the portal un-poked.
+
+** Concurrency, atomicity, staleness
+- *Single-flight* — =net probe= takes a lock file at
+  =$XDG_RUNTIME_DIR/waybar/net-probe.lock= (flock, non-blocking). A second probe
+  while one runs is a no-op, so a flapping 2s poll can't pile up overlapping
+  probes.
+- *Atomic writes* — the cache is written to a temp file + =os.replace= (atomic
+  rename), so a reader never sees a half-written cache. Same pattern as =wtimer=.
+- *Max probe runtime* — the probe has a hard timeout (≤ 6s total: curl
+  =--max-time 5= + slack). On timeout it writes an =unknown= result, never hangs.
+- *Stale classes* the indicator distinguishes: fresh (< TTL), stale (TTL..3×TTL,
+  shown with a subdued/aging hint), expired (> 3×TTL → treat as unknown),
+  unknown (no cache / probe failed). The bar never shows a confident "online"
+  past the expired threshold.
+- *Invalidation* — the cache records the iface + SSID + active-connection UUID it
+  was taken under; a change in any of them invalidates it immediately (a
+  reconnect must not show the old network's verdict).
+- *Crash cleanup* — a stale lock older than the max runtime is ignored/reclaimed.
+
+* Performance budgets (hot path)
+
+The bar exec path (=waybar-net= → =net status=) must stay responsive:
+- *Budget*: =net status= returns in < 100ms typical, < 250ms worst case.
+- *No sleeping in the bar path.* Throughput is sampled from two reads of
+  =/sys/class/net/<iface>/statistics/{rx,tx}_bytes= across the *waybar poll
+  interval itself* (delta since the last cached sample + timestamp), not via an
+  in-process =sleep= like the old =waybar-netspeed=. The cache holds the prior
+  counters.
+- *Subprocess cap*: at most one =nmcli= invocation on the hot path (a single
+  =nmcli -t -f ...= multi-field query), plus sysfs reads. Never a per-field
+  nmcli call.
+- *Every subprocess has a timeout* (=nmcli --wait 2=, =subprocess timeout=). On
+  timeout or error the indicator emits a degraded JSON state (class
+  =net-degraded=, a neutral glyph) rather than blocking or crashing waybar.
+- *Benchmark test*: a fake slow =nmcli= asserts =net status= still returns within
+  budget by falling back to the degraded state.
+
+* Engine — =net= CLI surface
+
+All subcommands take =--json= where a machine reads them. Pure formatting/state
+functions under the CLI; IO (nmcli, curl, file) at the edges. Every subcommand
+exits non-zero with a JSON error envelope (see JSON schemas) on failure.
+
+- =net status [--json] [--iface IF]= — fast link state + cached connectivity
+  sub-state + throughput. The indicator's source. Never does network I/O.
+- =net probe [--iface IF]= — run the connectivity/captive probe now, update the
+  cache (single-flight, atomic), print online | captive (+ portal URL) |
+  no-internet | unknown. Mirrors =captive='s cheap detection natively.
+- =net list [--json]= — saved connections, MRU order, active flag, plus in-range
+  wifi with signal.
+- =net up <uuid>= / =net down [--iface IF]= — switch / disconnect. Operates on
+  UUID, not name (see nmcli contract).
+- =net add= / =net edit <uuid>= / =net remove <uuid>= — manage connections
+  (open + WPA-PSK) through nmcli; the secret lands in NM's own
+  =.nmconnection=. Enterprise profiles are activate-only.
+- =net rescan [--iface IF]= — wifi rescan.
+- =net diagnose [--json]= — read-only report: gateway ping, DNS config, captive
+  probe. The structured contract below. Doubles as the post-failure snapshot.
+- =net repair <action> [--json]= — mutating remediation, lightest first:
+  =rfkill= (unblock + radio on), =reset= (fresh MAC), =bounce= (full-stack:
+  =nmcli networking off/on=, escalating to =systemctl restart NetworkManager=),
+  =dns-test= (temporary 1.1.1.1 override, auto-reverted). Each confirms via the
+  caller and verifies cleanup.
+- =net doctor [--json] [--fix]= — one-shot "get me online" mode for the console:
+  runs the full diagnose, then applies the lightest repair that fits (unblock
+  rfkill, reset, bounce, open portal) — read-only without =--fix=, acting with
+  it. The TTY recovery path when waybar/the GUI is down (see the Makefile
+  targets).
+- =net portal= — run =captive='s portal-force flow (reset if needed, extract +
+  open the portal page).
+- =net speedtest [--json]= — =speedtest-go --json= run; down/up/ping.
+
+* nmcli contract
+
+The command wrapper is the reliability boundary; SSIDs and connection names
+contain spaces, colons, duplicates, hidden names, and non-ASCII. Rules:
+
+- *Terse, field-selected output*: =nmcli -t -f <fields> --escape yes ...= and
+  =nmcli -g <fields> ...= (get-values) for single-value reads. Parse with the
+  documented escaping (=\:= and =\\=); never naive =cut -d:=.
+- *UUID is the handle.* Every saved-profile op (=up=, =down=, =modify=, =delete=)
+  uses the connection UUID, never the display name — names duplicate and contain
+  separators. =net list= surfaces UUIDs; the panel maps row → UUID.
+- *Wait budgets*: activation/deactivation use =nmcli --wait <n>= with an explicit
+  budget (hot-path reads =--wait 2=; activation =--wait 30=). No unbounded waits.
+- *Connectivity*: NM's own =nmcli networking connectivity= can return
+  =none/portal/limited/full/unknown=. Use it as a *cheap hint* on the fast path
+  when present, but the authoritative captive verdict is still our own probe
+  (NM's portal detection is coarser and config-dependent).
+- *Parser tests* (fake nmcli fixtures): escaped colons and backslashes in SSIDs,
+  embedded newlines, duplicate connection names, hidden SSID (empty name),
+  non-ASCII SSID, the wired-appears-mid-session case, and the multi-active case
+  (wifi + tether both up).
+
+* JSON schemas
+
+Versioned (="v": 1=) envelopes so tests lock the contract. Sketches (fields
+nullable unless noted):
+
+- =status=: ={v, iface, type: wifi|ethernet|none, ssid, signal, ipv4,
+  gateway, throughput: {rx_bps, tx_bps}, connectivity: online|captive|no-internet|unknown,
+  connectivity_age_s, connectivity_class: fresh|stale|expired|unknown, state:
+  online|captive|no-internet|connecting|disconnected|airplane|wired|degraded}=.
+- =probe=: ={v, result: online|captive|no-internet|unknown, portal_url, http_code,
+  redirect_host, elapsed_ms, ts}=.
+- =list=: ={v, connections: [{uuid, name, type, active, last_used, signal,
+  in_range, security}]}=.
+- =diagnose=: ={v, steps: [<diagnostic step, see contract>], overall:
+  ok|warn|fail}=.
+- =speedtest=: ={v, down_mbps, up_mbps, ping_ms, server, elapsed_ms, ts}=.
+- error envelope (any command): ={v, error: {code, message, detail, partial:
+  bool}}= with a non-zero exit.
+
+* Diagnostics contract
+
+=net diagnose --json= returns an ordered list of steps. Each step is the unit the
+panel renders and the log records:
+
+- =id= — stable identifier (e.g. =link=, =dhcp=, =gateway-ping=, =dns-config=,
+  =dns-resolve=, =http-probe=, =portal=).
+- =status= — =pending | running | pass | warn | fail | skipped=.
+- =title= — short human label.
+- =evidence= — redacted detail (the value seen), per the redaction rules.
+- =elapsed_ms=.
+- =safety= — =read-only= or =mutating= (diagnose steps are all read-only).
+- =next_action= — what the user/agent should do on warn/fail (e.g. "open portal",
+  "reset connection", "switch network").
+
+Repair actions (=net repair=) carry the same shape but =safety: mutating=, plus a
+=cleanup_verified: bool= field (e.g. the DNS override was reverted) and a
+terminal =cleanup-unverified= status when revert can't be confirmed.
+
+** Diagnose vs Repair (read-only vs mutating)
+The panel separates them visually and behaviorally:
+- *Diagnose* — probe, gateway ping, DNS config read, captive check. No state
+  change, no sudo, runnable freely.
+- *Repair* — reset (fresh MAC, deletes+recreates the NM profile), DNS override
+  test (mutates resolver, auto-reverts), portal force. Each needs an explicit
+  confirm, shows that it's privacy/state-changing, and verifies cleanup. A
+  Repair whose cleanup can't be verified ends in a visible =cleanup-unverified=
+  state, never a silent success.
+
+* Failure states, messages, recovery
+
+Each row below gives the *exact, final* user-facing string (not a template) with
+=<placeholders>= for redacted evidence, plus the evidence field included and the
+next action. The string is canonical: every surface renders the same text, so
+there's one source of truth.
+
+Per-surface rendering of the canonical string:
+- *Indicator* — the matching glyph + CSS class; the string is the tooltip
+  (untruncated).
+- *Notification* (=notify=) — title = the failure label, body = the string.
+- *CLI* — the string on stderr; =--json= puts it in =error.message= with the
+  evidence in =error.detail= and a stable =error.code=.
+- *Panel* — the string as the section banner, with the diagnostic step's evidence
+  shown beneath.
+Evidence is always redacted per the redaction rules (SSID/host shown; PSK/EAP/
+portal tokens never).
+
+- *associated, no DHCP* — "Connected to <SSID>, no IP (DHCP failed)" →
+  evidence: SSID, iface → reset / reconnect.
+- *no-internet* — "On <SSID>, no internet (gateway reachable, no route out)" →
+  diagnose / switch network.
+- *captive* — "Captive portal at <host> — login required" → Open portal.
+- *DNS hijack* — "DNS is being redirected (portal)" → Open portal.
+- *DNS broken* — "DNS not resolving (hotel DNS down); 1.1.1.1 works" → use
+  override / report.
+- *HTTP intercepted* — "Traffic is being intercepted before it leaves" → Open
+  portal.
+- *sudo declined* — "Reset needs admin; it was declined — nothing changed" →
+  retry with auth.
+- *command timed out* — "<op> timed out; the system was left unchanged" → retry.
+- *partial mutation* — "<op> partially applied: <what>; rolled back to <state>"
+  → review.
+- *missing speedtest-go* — "speedtest-go not installed" → install hint.
+- *no wifi hardware* (desktop) — wifi rows hidden; ethernet-only view.
+- *wifi rfkill-blocked* — "WiFi is blocked (rfkill)" → unblock. The indicator
+  detects a soft-blocked radio (=rfkill list= shows the radio off though hardware
+  is present) and shows this distinct from disconnected. =net repair rfkill= (and
+  =net doctor --fix= as its first step) runs =rfkill unblock wifi= + =nmcli radio
+  wifi on= and reconnects. This is the framework-laptop case: an out-of-power
+  shutdown sometimes leaves wifi soft-blocked at next boot, and yes — the module
+  recovers it (the rfkill state is the indicator; the rfkill repair / doctor is
+  the one-step fix). A *hard* block (physical switch) is reported as
+  not-recoverable-in-software with that message.
+- *wifi rfkill hard-blocked* — "WiFi is blocked by the hardware switch" →
+  evidence: rfkill hard state → flip the physical switch.
+- *wrong password / missing secret* — "Saved password for <SSID> was rejected" →
+  evidence: SSID, NM auth-failure reason → re-enter the password.
+- *enterprise auth/cert failure* — "Enterprise login failed for <SSID> (802.1X)"
+  → evidence: SSID, EAP failure reason → edit the profile in nmtui/nmcli.
+- *upstream / AP / provider* — "On <SSID>, link is fine but the network has no
+  uplink" → evidence: gateway reachable, no route out, not a portal → switch
+  network or contact the venue.
+- *VPN-routed* — "Connected; internet is routed through a VPN (<dev>)" →
+  evidence: default route on a tun/wg device or non-NM DNS owner → check the VPN,
+  not WiFi.
+- *HTTP interception, no parseable portal URL* — "A portal is intercepting
+  traffic but didn't give a login link" → evidence: HTTP code, redirect host →
+  opens neverssl + the gateway page to log in manually.
+- *DNS override cleanup unverified* — "Couldn't confirm DNS was restored after the
+  test" → evidence: iface, attempted revert → revert DNS manually
+  (=resolvectl revert <iface>=).
+
+Each message names whether the system was left unchanged, partially changed (with
+what), or fully changed, so the user knows the residue.
+
+* Doctor: escalation, classification, terminal states
+
+=net doctor= diagnoses, classifies the failure, then (with =--fix=) applies the
+*lightest* repair that fits and re-checks — it never loops destructive repairs
+against a failure they can't fix. Each failure resolves to one of four outcomes,
+and the doctor stops at any terminal one:
+
+- =fixable= — a local repair should help. Escalate lightest-first: rfkill-unblock
+  → reset (fresh MAC) → bounce (full stack) → portal, re-probing after each, and
+  stop as soon as the probe returns online.
+- =needs-user-action= (terminal) — no reset/bounce will help; doctor stops and
+  names the exact next step. Covers: wrong WPA password / missing NM secret
+  (enter the password), locked keyring or polkit denial (retry with auth),
+  enterprise 802.1X cert/identity failure (edit the profile in =nmtui=/=nmcli=),
+  captive portal login-required (open the portal + accept terms). Doctor must not
+  delete/recreate the profile against these — that loses the saved password and
+  makes things worse.
+- =upstream-not-local= (terminal) — the local link is up but the problem is past
+  it: AP has no uplink, gateway down/dropping traffic, DHCP server broken, ISP
+  outage, portal backend failing. =diagnose= proves it (link up + IP + gateway
+  reachable, but no route out and not a captive redirect), and =doctor --fix=
+  stops after local repairs are exhausted with "local repairs tried; likely
+  upstream/AP/provider" + the evidence. Next action: switch networks or contact
+  the venue.
+- =deferred/vpn= (terminal for v1) — an active VPN / policy route / non-NM
+  resolver owns the default route or DNS, so "no internet" may be the VPN's fault,
+  not WiFi's. v1 *detects* this (default route on a =tun/wg= device, or DNS owned
+  by something other than the NM link) and classifies it separately — "link is
+  fine; internet is VPN-routed" — rather than misclassifying it as a WiFi failure.
+  v1 does not repair it (VPN management is Phase 5); it names the VPN as the likely
+  owner and stops.
+
+** DNS handling in doctor (explicit per class)
+- *Captive DNS hijack* — open the portal (the hijack clears on login). No DNS
+  mutation.
+- *Broken resolver, 1.1.1.1 works* — doctor offers an explicit *temporary* 1.1.1.1
+  override as a repair with cleanup verification (auto-revert, =cleanup_verified=);
+  without =--fix= it only recommends the command. It does not leave a permanent
+  resolver change.
+- *Port-53 / egress blocked* (even 1.1.1.1 fails) — terminal =upstream-not-local=;
+  doctor stops, since it's not locally fixable.
+
+* Failure-mode coverage
+
+For each common field failure: does =net diagnose= detect it, can =net doctor
+--fix= repair it, and what terminal user action remains when it can't. (The
+=needs-user-action= / =upstream-not-local= / =deferred/vpn= outcomes are defined
+above.)
+
+| Failure mode               | diagnose detects            | doctor --fix                | terminal user action        |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| rfkill soft block          | yes                         | yes (unblock)               | none                        |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| rfkill hard block          | yes                         | no                          | flip the physical switch    |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| no wifi hardware           | yes                         | n/a                         | use ethernet                |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| associated, no DHCP        | yes                         | yes (reset/bounce)          | none, else switch network   |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| gateway unreachable        | yes                         | yes (bounce)                | switch network if it        |
+|                            |                             |                             | persists                    |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| captive DNS hijack         | yes                         | opens portal                | log in at the portal        |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| broken DNS, 1.1.1.1 works  | yes                         | yes (temp override,         | report the venue's DNS      |
+|                            |                             | auto-reverted)              |                             |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| HTTP captive portal        | yes                         | opens portal                | log in at the portal        |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| HTTP interception, no      | yes                         | opens neverssl + gateway    | log in manually             |
+| parseable URL              |                             |                             |                             |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| upstream / AP outage       | yes (link up, no route out) | no (stops after local)      | switch network / contact    |
+|                            |                             |                             | venue                       |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| wrong WPA password /       | yes                         | no                          | enter the password          |
+| missing secret             |                             |                             |                             |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| enterprise auth / cert     | yes                         | no                          | edit the profile in         |
+| failure                    |                             |                             | nmtui/nmcli                 |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| duplicate SSID /           | yes (UUID-keyed)            | yes (activate by UUID)      | none                        |
+| connection-name            |                             |                             |                             |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| hidden SSID                | yes                         | yes (connect by name)       | enter SSID + password       |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| multiple active links      | yes                         | n/a                         | pick the interface          |
+| (wifi+tether)              |                             |                             |                             |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| wedged NetworkManager      | yes                         | yes (bounce → restart NM)   | none, else reboot           |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| slow / hung command        | yes (degraded)              | retries within budget       | retry                       |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| stale / corrupt cache      | yes                         | self-heals (atomic +        | none                        |
+|                            |                             | invalidation)               |                             |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| DNS cleanup failure        | yes                         | flags cleanup-unverified    | revert DNS manually         |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| missing speedtest backend  | yes                         | n/a                         | install speedtest-go        |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+| VPN / policy-routing       | yes (route/DNS ownership)   | no (deferred to Phase 5)    | check the VPN               |
+| interference               |                             |                             |                             |
+|----------------------------+-----------------------------+-----------------------------+-----------------------------|
+
+* Observability — logging + redaction
+
+- *Event log*: JSONL at =$XDG_STATE_HOME/net/events.jsonl= (fallback
+  =~/.local/state/net/events.jsonl=), size-rotated (e.g. 1 MB × 3). Every
+  mutating op and probe appends an event: =ts, op, argv (redacted), exit_code,
+  stderr_tail, elapsed_ms, iface, nm_uuid, probe_url_class, http_code,
+  redirect_host, cache_event=.
+- *Redaction (always on)*: PSKs, EAP identities/passwords, NM secrets, and
+  portal query tokens are never logged. MAC addresses, full IPs, and SSID are
+  redacted when configured (=redact_mac=, =redact_ip=, =redact_ssid= in config).
+- *Post-failure diagnosis*: =net doctor --json= is the snapshot + recommendation
+  (diagnose plus the suggested repair), =net diagnose --json= the raw report, and
+  the event log the history. =net doctor= is the console-recoverable entry point
+  (reachable as =make online= / =make net-doctor=).
+- *Secret-leak tests*: assert no PSK/EAP/portal-token ever appears in any JSON
+  output, log line, or error message.
+
+* Indicator (task #C — Phase 1, the fast win)
+
+** States (internet sub-state on top of link state)
+- online — associated and the probe returned 204. Normal icon.
+- captive — associated, probe hit a portal. Distinct glyph + warning CSS class;
+  tooltip names the portal host; left-click opens diagnostics with the portal
+  ready to open (Phase 2+; see interactions for the Phase-1 interim).
+- no-internet — associated, probe failed (no portal, no 204). Distinct glyph +
+  warning class.
+- degraded — =net status= couldn't read link state within budget (slow/failed
+  nmcli). Neutral glyph, =net-degraded= class. Never blocks the bar.
+- rfkill-blocked — the radio is soft-blocked (=rfkill=), distinct from
+  disconnected. Distinct glyph; the fix is =net repair rfkill= / =net doctor=.
+- connecting / disconnected / airplane / wired — as today, plus wired shown
+  correctly even when it appears after session start. (airplane is now this
+  module's state, absorbed from the retired airplane module.)
+
+** Glyphs
+Nerd-font codepoints, final values verified live before merge (same discipline
+as wtimer). Reuse the signal-strength ramp already in =waybar-netspeed=; add a
+captive / no-internet / degraded overlay glyph.
+
+** Tooltip
+SSID + signal + IPv4 + gateway + the throughput readout (absorbed from
+netspeed) + the last probe result and its age (stale/expired hinted).
+
+** Interactions (phase-aware; no keyboard-modifier clicks — waybar can't qualify
+clicks by modifier, so the rich actions live in the panel, not ctrl/super-click)
+Clicks never block the bar: each dispatches a detached background job and reports
+via =notify=, single-flight per action.
+- *Phase 1 (no panel yet)*: left-click runs =net probe= + notify (refreshes the
+  state on demand) and keeps the existing =pypr toggle network= scratchpad as the
+  interim manager; right-click runs =net repair reset= in the background +
+  notify; middle-click runs =net portal=.
+- *Phase 2+ (panel exists)*: left-click opens the panel (focused on the relevant
+  section — diagnostics when captive); right/middle keep the background
+  reset/portal shortcuts.
+
+* Panel (tasks #B + #C diagnostics — Phases 2-3)
+
+GTK4 + gtk4-layer-shell, pocketbook scaffold (src-layout package, unittest,
+Makefile, gtk4-layer-shell anchored dropdown under the bar). One panel shell,
+reused by the future desktop-settings panel.
+
+Sections:
+1. *Connections* — list, MRU-first, active marked, live signal bars for in-range
+   wifi; row click switches; buttons for add / edit / remove; a rescan control.
+2. *Diagnose* (read-only) — Probe (204/captive, shows portal URL + Open), Gateway
+   ping, DNS config. Streaming step output (the diagnostics contract).
+3. *Repair* (mutating, confirmed) — tiered lightest-first: Unblock rfkill, Reset
+   (fresh MAC), Bounce (full stack), DNS override test, Force portal. A "Get me
+   online" button runs =net doctor --fix= (the auto-escalating sequence).
+4. *Speed test* — Run button, progress, down/up/ping result + last-run line.
+
+** Panel state, cancellation, permissions
+State machines for: connection-list loading, rescan-in-progress,
+activation-in-progress, diagnose-running, repair-running, speedtest-running. Plus
+the real terminal states on this two-machine fleet: no-wifi-hardware (desktop →
+ethernet-only view) and missing speedtest-go. (No GPG-key state — there's no
+credential store; secrets live in NM.) ("No NetworkManager" is not a modeled
+state — NM is always present
+on these machines; if nmcli is somehow absent the panel shows a single hard-error
+and exits.) Long operations show elapsed time and are cancellable where the
+underlying op allows (rescan, speedtest, probe); clearly non-cancellable ones
+(an in-flight activation) show elapsed + a disabled control. Permission-denied
+(sudo/polkit declined) is a first-class outcome with the "nothing changed"
+message, never a silent failure.
+
+Interaction-pattern catalog (=~/code/rulesets/patterns/=) principles that apply:
+- transient-state-buttons — all the network levers in one place, reachable by
+  one chord (the bar click), state visible.
+- default-most-common-friction-proportional — connections MRU-ordered so the
+  common pick is first; destructive ops (remove) and privacy-changing ones
+  (reset, override) get a confirm, switching does not.
+- one-prompt-picker-typed-prefix — if the connection picker ever goes
+  keyboard-driven, kind (wifi/eth/saved/in-range) + name in one typed picker.
+
+** Panel UX flow (settle before Phase 2)
+The concrete interaction defaults, so the GTK build isn't inventing them:
+- *Default focus*: the Connections section, current connection's row selected. If
+  the indicator opened the panel because of a captive/no-internet state, focus
+  Diagnose instead with the relevant action highlighted.
+- *Row content*: glyph (signal bars / wired / active check) + name + a secondary
+  line (security type, "active"/last-used). The active row is visually pinned at
+  top of its group.
+- *Buttons*: one *primary* per section (Connections: Connect to the selected row;
+  Diagnose: Run diagnose; Repair: "Get me online"; Speed test: Run). Secondary
+  actions (add / edit / remove / rescan; individual repair tiers) are smaller and
+  grouped.
+- *Disabled rules*: Connect disabled on the already-active row; Repair tiers
+  disabled while one runs; Speed test disabled while running; add/edit disabled
+  for enterprise (with the "edit in nmtui/nmcli" hint).
+- *Confirmations* (exact wording): Reset → "Reset <SSID>? This drops the
+  connection and reconnects with a new MAC."; Bounce → "Restart networking? All
+  links drop briefly."; DNS override → "Temporarily set DNS to 1.1.1.1 for the
+  test? It reverts automatically."; Remove → "Forget <SSID>? The saved password is
+  deleted."
+- *"Get me online" reporting*: shows each escalation step live (Unblock rfkill →
+  Reset → Bounce → Portal) with per-step pass/fail and stops at the first that
+  restores internet or at a terminal state, naming the next action.
+- *After close*: the bar reflects the new state immediately (signal/refresh on
+  next poll); a running speedtest/diagnose keeps running and notifies on finish
+  (panel close doesn't cancel it).
+- *Keyboard*: Esc closes; Tab moves between sections; arrows move rows; Enter
+  fires the section primary; the connection list is type-to-filter.
+
+* Connection management (nmcli)
+
+- Every op via nmcli per the nmcli contract above (terse, escaped, UUID-keyed,
+  bounded =--wait=).
+- MRU ordering from NM's =connection.timestamp= (last activated), descending.
+- Ethernet appears in the list whenever a wired device is present, selectable at
+  any time; switching just brings the chosen connection up.
+- *Mutation safety + rollback*: switching keeps the current connection up until
+  the new one activates successfully (=nmcli --wait 30=); on failure it does not
+  tear down the working link, surfaces the failure, and leaves the prior
+  connection active. =net down= notes that NM may auto-reactivate a profile and
+  reports the post-op active connection so the user isn't surprised. A switch that
+  needs a password it doesn't have prompts (or fails with "password required"),
+  never silently strands. The exact NM command sequence (preflight active-state
+  read → activate target → verify default route → on failure, confirm prior
+  still up) is pinned in the engine and tested against fake nmcli.
+- *Add/edit scope*: open + WPA-PSK only in v1. Existing saved profiles of any
+  type (including enterprise) can be *activated*; editing an enterprise profile
+  shows "edit via nmtui/nmcli" rather than a broken partial form.
+
+* Connection secrets (no separate store)
+
+Per Craig's call: don't build a parallel credential store. Settings and secrets
+live where NetworkManager already keeps them, so there's one source of truth and
+no extra dependency (no GPG, no gpg-agent, no =~/.config/net/connections=).
+
+- *Where secrets live*: =/etc/NetworkManager/system-connections/<name>.nmconnection=,
+  root-owned =0600=, with the PSK/EAP secret stored inline (the default
+  =secret-flags=0= "owned by NM"). That's already secure-at-rest (root-only) and
+  is what =nmcli= reads/writes.
+- *How we touch them*: every add/edit/remove goes through =nmcli= (=connection add
+  / modify / delete=), which writes the =.nmconnection= with the right ownership
+  and perms. We never read or write =system-connections= files directly (root) and
+  never copy a secret out of them.
+- *No export / import / sync* — there's nothing to sync. A new machine gets its
+  connections the way it always has (the user joins, or restores NM profiles),
+  not from a tool-specific vault.
+- *config file*: =~/.config/net/config= still exists, but only for non-secret
+  preferences (speedtest server, redaction flags, probe TTL). It holds no
+  credentials.
+- *No secret leakage*: PSK/EAP never appear in =net=' =--json= output, the event
+  log, or error text (tested) — even though NM is the store, our surfaces must not
+  echo a secret =nmcli= happens to return.
+
+* Speed test
+
+- Backend: *=speedtest-go=* (=--json=, =--server=, =--no-download/--no-upload=),
+  already installed on velox (AUR =speedtest-go-bin=). No new dependency for v1.
+  librespeed-cli is the documented fallback for a self-hosted LibreSpeed server.
+- =net speedtest --json= parses speedtest-go's JSON into the =speedtest= schema.
+- *Server policy*: auto-select nearest by default; allow a pinned server id in
+  =~/.config/net/config=.
+- *Timeout + cancellation*: a hard run timeout (e.g. 60s); the panel run is
+  cancellable (kills the child). Offline / rate-limited / no-server errors map to
+  the failure-message table.
+- *Tests*: fixture JSON (success) and fixture stderr (offline, no server,
+  malformed output) drive =net speedtest= parsing without touching the network.
+
+* Help + documentation
+
+In-app help has three layers, each reachable in the situation it's needed:
+
+- *CLI help (works from a dead-GUI TTY)*: =net --help= lists the subcommands in
+  one screen; =net <cmd> --help= documents each (flags, what it mutates, the
+  console-recovery targets). The Makefile targets are self-describing (=make help=
+  lists =online= / =net-doctor= / etc. with one-line descriptions). This is the
+  layer that matters most when you're at a console with no network.
+- *Panel help (in the GUI)*: a small =?= affordance in the panel header opens an
+  inline help pane — what each section does, which Repair actions mutate state,
+  what the indicator glyphs/colors mean. Per-control tooltips on the less-obvious
+  buttons (rfkill, bounce, DNS override). No external help browser.
+- *User guide (the durable doc)*: a README / docs page covering every command,
+  the indicator states + glyphs, the panel sections, the config file keys, the
+  recovery make targets, troubleshooting (the failure-message table), and
+  rollback. Written so a future session — or Craig six months out — can operate
+  and recover the module from the doc alone.
+
+The failure-message table above is the single source of truth for the
+troubleshooting text; the guide and the panel help both render from it rather
+than restating it.
+
+* Enhancement radar
+
+Low-cost adjacent affordances, each dispositioned so cheap wins aren't lost and
+the v1 panel stays focused. (Several are already in v1 by virtue of other
+sections; marked here so the consideration is visible.)
+
+| Enhancement                         | Disposition | Reason                                                 |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Open / copy portal URL              | v1          | already in the captive flow; trivial Open + Copy       |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Forget network                      | v1          | it's the remove op, already specced                    |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Rescan now                          | v1          | already a Connections control                          |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Retry with hardware MAC             | v1          | captive already has --hardware-mac; expose in Repair   |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Pin speedtest server                | v1          | already a config key                                   |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Copy redacted doctor report         | v1          | cheap, serves the observability/support goal           |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Show last good network / result     | vNext       | needs small history persistence                        |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Watch mode for net doctor           | vNext       | a --watch loop; handy at a TTY, not v1-critical        |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Actionable desktop notifications    | vNext       | dunst supports actions; extra wiring                   |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| Keyboard connection picker (fuzzel) | vNext       | the typed-prefix pattern; panel covers v1              |
+|-------------------------------------+-------------+--------------------------------------------------------|
+| QR-code share / import WiFi         | rejected    | low value for a personal 2-machine setup; phones do QR |
+|-------------------------------------+-------------+--------------------------------------------------------|
+
+* Waybar wiring
+
+- Replace =custom/netspeed= with =custom/net= in the bar's module list (same
+  slot).
+- Module def: =exec: waybar-net=, =return-type: json=, =interval: 2=, a =signal=
+  for on-demand refresh (next free signal after wtimer's 14), =on-click=,
+  =on-click-right=, =on-click-middle= per the phase-aware interactions (each
+  dispatches a detached job, never blocks).
+- Remove the old =on-click: pypr toggle network= scratchpad only once the panel
+  replaces it (Phase 2); Phase 1 keeps it as the interim manager.
+
+* Testing plan (TDD)
+
+- *Engine (normal)* — fake =nmcli= + =curl= + =speedtest-go= on PATH; assert
+  command sequences and parsed/emitted JSON for status, list, up/down,
+  add/edit/remove, probe, diagnose, repair, speedtest. Pure state/format
+  functions tested directly. JSON schemas locked by example.
+- *Portal parser* — already covered in =tests/captive= (Normal/Boundary/Error +
+  the real SONIFI body). The engine's native probe reuses the same cases.
+- *nmcli parsing* — escaped colon/backslash/newline in SSID, duplicate names,
+  hidden SSID, non-ASCII, wired-mid-session, multi-active (wifi+tether).
+- *Failure + concurrency (the risky classes)* — slow/hung nmcli/curl/speedtest
+  (degraded state within budget), concurrent =net status= probe refresh
+  (single-flight), corrupt cache (recovered), stale cache after SSID change
+  (invalidated), permission denied / sudo declined, DNS-override cleanup failure
+  (=cleanup-unverified=), NM partial activation (rollback keeps prior link),
+  secret redaction, missing speedtest-go, no wifi hardware, rfkill soft/hard
+  block.
+- *Doctor classification* — fixture-driven =net doctor= over fake nmcli/curl
+  asserting the right terminal classification + that =--fix= stops before
+  destructive repairs: auth failures (=needs-user-action=), upstream/AP failure
+  (=upstream-not-local=), VPN-routed failure (=deferred/vpn=), and the DNS classes
+  (hijack → portal, broken-but-1.1.1.1-works → offered override, egress-blocked →
+  upstream). Assert the failure-mode coverage table's "detects / repairs / terminal
+  action" holds for each row.
+- *Indicator* — drive =net status --json= through =waybar-net=, assert the JSON
+  per state (online / captive / no-internet / degraded / wired / disconnected /
+  rfkill), iface override via env.
+- *Panel* — pocketbook-style: backing logic (list ordering, op dispatch,
+  state-machine transitions), not GTK widgets.
+- *NM secrets / no-leak* — add/edit writes the secret into NM via nmcli (asserted
+  against fake nmcli, never to a tool-owned file); assert no PSK/EAP appears in any
+  =--json=, log line, or error (there is no credential store to round-trip).
+- *Live checklist (gated out of the suite)* — a "Manual testing and validation"
+  task per phase for the real-network states (captive at a hotel, no-internet,
+  switch under load, reset, speedtest) that can't be faked.
+
+** Harness + coverage gate
+The concrete contract, matching the repo's existing convention (not pytest — the
+dotfiles suites are =unittest=, run by =make test= as =python3 -m unittest= over
+=tests/*/test_*.py=; 33 suites today):
+- *Framework*: =unittest=. Each suite is =tests/<name>/test_<name>.py=
+  (=tests/net/=, =tests/waybar-net/=), collected by the existing =make test= loop
+  — no new runner, no pytest dependency.
+- *Fakes on a temp PATH*: =fake-nmcli=, =fake-curl=, =fake-speedtest-go=,
+  =fake-rfkill=, =fake-resolvectl= live as executable stubs in =tests/<name>/=
+  (the =tests/layout-navigate/fake-hyprctl= pattern). A fixture file encodes the
+  command→canned-output map and the stub appends each invocation to a log the test
+  asserts against. Subprocess timeouts are simulated by a stub that sleeps past the
+  budget; =net status= must still return the degraded state.
+- *Waybar wrappers end-to-end*: =waybar-net= is run as a subprocess with the fake
+  PATH and the env overrides (iface, cache path), asserting the emitted JSON — same
+  as =tests/waybar-netspeed=.
+- *Coverage*: coverage.py is absent system-wide (and not importable), so coverage
+  runs in a throwaway venv (=python3 -m venv=, =pip install coverage=, =coverage
+  run -m unittest=, =coverage report=) — the method the wtimer suite used (95%).
+  Target: *branch* coverage over =net/= and the wrapper, ≥ 90% on the pure
+  classifier/parser modules.
+
+** Coverage as a gap-finder, not a number (per phase)
+Line coverage alone misses the branches that matter here, so each phase ends with
+a *coverage-gap pass*, not just a percentage:
+- After the first green run, read the branch report and map every uncovered branch
+  to either a new test or a consciously-excluded live-only behavior (with a comment
+  or a Manual-testing entry naming it).
+- *Branch coverage is required* for the pure logic: the doctor classifier (every
+  outcome — fixable / needs-user-action / upstream-not-local / deferred-vpn), the
+  cleanup-unverified path, the redaction paths, the degraded hot-path fallback, the
+  timeout branches, and the portal/nmcli parsers.
+- A phase isn't "done" until its coverage-gap pass is recorded — uncovered logic is
+  either tested or explicitly excused, never silently uncovered.
+
+* Files touched (planned, all in =~/.dotfiles=)
+
+- =net/= package (src-layout, like pocketbook) — engine + panel.
+- =hyprland/.local/bin/waybar-net= — the indicator (replaces =waybar-netspeed=).
+- =hyprland/.local/bin/net= — engine CLI entry (console-script shim).
+- =hyprland/.config/waybar/config= — swap =custom/netspeed= → =custom/net=;
+  remove =custom/airplane=.
+- =hyprland/.config/waybar/style.css= — captive / no-internet / degraded /
+  rfkill classes; remove airplane classes.
+- =tests/net/=, =tests/waybar-net/= — suites.
+- =captive= — refactor: extract probe + reset into functions callable
+  non-interactively (a =--json= probe mode) so the engine reuses them.
+- =~/.config/net/config= — seed config (probe TTL, speedtest server, redaction
+  flags). No secrets; not a credential store.
+- dotfiles =Makefile= — add the console-recovery targets (=online=, =net-doctor=,
+  =net-status=, =net-diagnose=, =net-portal=, =net-reset=, =net-bounce=).
+- *Deletions once net ships* (the airplane module is absorbed):
+  =hyprland/.local/bin/waybar-airplane=, =hyprland/.local/bin/airplane-mode=,
+  =tests/waybar-airplane/=, =tests/airplane-mode/=, and the =custom/airplane=
+  module + its css.
+- archsetup Hyprland step — add =gtk4-layer-shell=, =python-gobject=,
+  =speedtest-go-bin= to the install lists (the only archsetup change; no =gpg=
+  added, secrets stay in NM's store).
+
+* Resolved decisions (Craig's calls + this response)
+
+1. Panel UI tech → GTK4 + gtk4-layer-shell, shared pocketbook scaffold (one
+   panel shell, reused by the desktop-settings sibling).
+2. Engine language → Python =net= package; shells out to =captive= for the
+   portal-force flow, native cheap probe for the bar path.
+3. Connectivity probe → split cadence (fast link poll every 2s + slow cached
+   internet/captive probe, TTL ~45s) with single-flight + atomic cache.
+4. No keyboard-modifier clicks (waybar can't qualify them) — the panel hosts the
+   rich actions; bar clicks dispatch detached jobs (phase-aware).
+5. No separate credential store (Craig's call, cj). Secrets live in NM's own
+   =system-connections= (root =0600=, inline), touched via nmcli. No GPG, no
+   gpg-agent, no =~/.config/net/connections=. Supersedes the earlier GPG-store
+   design.
+6. =custom/netspeed= absorbed into =custom/net=; throughput moves to the tooltip.
+7. Speed-test backend → =speedtest-go= (already installed), not a new
+   librespeed-cli dependency; librespeed-cli is the self-hosted fallback.
+8. Code lives in the dotfiles repo; archsetup only installs deps.
+9. v1 add/edit scope = open + WPA-PSK; enterprise/802.1X is activate-only,
+   add/edit is vNext (settled by Craig 2026-06-29 — no enterprise networks in his
+   history, so the form would be unused UI).
+10. =net doctor= is in v1 (Craig's call, cj) — a one-shot diagnose+fix mode,
+    reachable from a TTY via =make online= / =make net-doctor=. (The earlier
+    "defer the doctor/bundle command" decision is reversed.)
+11. Diagnose (read-only) and Repair (mutating, confirmed) are separated in the
+    panel and the CLI; Repair is tiered lightest-first (rfkill → reset → bounce).
+12. =custom/net= absorbs the airplane module (Craig's call, cj). *As built
+    (2026-06-29, option 1): display-only.* net shows the airplane state (reads
+    the airplane-mode state file); the =airplane-mode= low-power toggle is kept
+    (radios + CPU + brightness + services is not a network concern) and moved to
+    =custom/net='s right-click + signal 15. Only the redundant display pieces —
+    =waybar-airplane=, =custom/airplane=, and the retired =waybar-netspeed= —
+    plus their tests/css were deleted. The earlier "delete airplane-mode" framing
+    is superseded.
+13. Repair includes a full-stack bounce and an rfkill-unblock (Craig's calls,
+    cj) — the latter recovers the framework-laptop post-power-loss soft-block.
+14. VPN / WireGuard is a planned Phase 5 (Craig's call, cj), not a permanent
+    exclusion.
+
+* Implementation phases
+
+- *Phase 1 — Indicator + console recovery (task #C).* =net status= + =net probe=
+  (native cheap probe, reusing captive's logic) + the =captive= probe refactor +
+  =waybar-net= + the split-cadence cache (single-flight, atomic, stale classes) +
+  CSS states (incl. rfkill) + performance budget. Plus the CLI-only recovery path:
+  =net repair= tiers (rfkill / reset / bounce), =net doctor [--fix]=, and the
+  Makefile targets (=make online= etc.) — all testable without the GTK panel.
+  Absorbs the airplane state and removes the standalone airplane module. Interim
+  left-click keeps the existing scratchpad until the panel lands.
+  - *Acceptance*: fresh-login waybar smoke test shows correct state on
+    online/captive/no-internet/wired/rfkill; =net status= stays within budget
+    under a fake slow nmcli (degraded state); =net doctor --fix= recovers a
+    soft-blocked radio from a TTY; the live captive checklist passes at a real
+    portal; the airplane state works and the old airplane module is gone;
+    reverting = swap =custom/netspeed= + =custom/airplane= back.
+- *Phase 2 — Panel shell + connection management (task #B core).* GTK4
+  layer-shell scaffold + =net list/up/down/add/edit/remove/rescan= + MRU list +
+  mutation safety/rollback + panel state machines.
+  - *Acceptance*: switch wifi↔wifi and ethernet↔wifi without stranding; a failed
+    switch leaves the prior link up; add/edit open + WPA-PSK writes the secret to
+    NM; remove confirms; panel states render for loading/rescan/activation.
+- *Phase 3 — Diagnostics + speed test in the panel.* Wire =net diagnose= /
+  =net repair= / =net doctor= / =net portal= / =net speedtest= into the Diagnose
+  vs Repair sections; the "Get me online" button; portal Open button; speedtest
+  progress + cancel.
+  - *Acceptance*: diagnose runs read-only; each repair tier confirms + verifies
+    cleanup (DNS override reverts, shown); speedtest result parses from
+    speedtest-go and a fixture-driven failure shows the right message.
+- *Phase 4 — Docs + rollout.* In-app help (=net --help= / per-command help, the
+  panel help affordance), README/user-guide (commands, panel, config,
+  troubleshooting, the make targets, rollback), and the manual dep step on ratio.
+  - *Acceptance*: =net --help= and each subcommand's help are complete; the
+    user-guide covers every command + the recovery targets; ratio rollout
+    documented.
+- *Phase 5 — VPN / WireGuard (future).* Fold the existing archsetup wireguard
+  tooling into the same panel + CLI (=net vpn ...=). Out of the v1 milestone;
+  specced separately when picked up.
+
+* Open items / risks
+
+- gtk4-layer-shell dropdown anchoring under a waybar module needs the same
+  positioning work pocketbook solved; reuse it. (Phase 2.)
+- The =captive= refactor must keep the standalone CLI behavior identical while
+  exposing a non-interactive =--json= probe; covered by the existing
+  =tests/captive= suite plus new probe-mode tests. (Phase 1.)
+- speedtest-go server selection variance (nearest-server flor) — pin a server in
+  config if results are noisy. (Phase 3.)
+- The background-probe kick from =net status= must be truly non-blocking (spawn +
+  detach); enforced by the single-flight lock and the performance benchmark test.
+
+* Rollback
+
+Each phase is independent. The indicator (Phase 1) is a drop-in replacement for
+=custom/netspeed= (and =custom/airplane=); reverting is swapping those modules
+back in the config and restoring their scripts. The panel is additive — not
+wiring its clicks leaves the bar working as before. No credential store to roll
+back (secrets stay in NM throughout).
+
+* Review findings [31/31]
+
+** DONE Define the structured diagnostics contract :blocking:
+The spec says the engine "emits JSON" and that diagnostics "reuse =captive=
+verbatim", but the current =~/.dotfiles/common/.local/bin/captive= flow is a
+human-readable bash script that mixes diagnostics, sudo prompts, DNS mutation,
+browser launch, and terminal prose. A GTK panel cannot reliably turn that into
+clear state, progress, cancellation, or useful error messages. Define the
+machine contract before implementation: every diagnostic step should have a
+stable id, status (=pending/running/pass/warn/fail/skipped=), redacted evidence,
+elapsed time, safety outcome, and next action. Keep =captive= as the interactive
+CLI, but either refactor reusable probe/reset functions behind =net diagnose
+--json= or make =captive= expose a non-interactive JSON mode. This blocks the
+panel and logging work because otherwise the implementer must invent the
+boundary.
+
+Disposition: accept — added the "Diagnostics contract" section (per-step id /
+status / evidence / elapsed / safety / next_action) and the =captive= =--json=
+probe-mode refactor under Architecture + Files touched.
+
+** DONE Specify user-facing failure messages and recovery actions :blocking:
+The spec names failure states like =no-internet=, =captive=, failed probe,
+failed reset, missing DNS, and missing speed-test backend, but it does not define
+the messages the user sees or what each message tells them to do next. For this
+feature, "error" is not enough: a user needs to know whether WiFi is associated,
+whether DHCP succeeded, whether DNS is hijacked/broken, whether HTTP is
+intercepted, whether sudo was declined, whether a command timed out, and whether
+the system was left unchanged or partially changed. Add a message table for the
+indicator, panel, and CLI with: failure class, visible text, evidence included,
+redaction rule, and next action. This is blocking because UX quality here is the
+product, not an implementation detail.
+
+Disposition: accept — added the "Failure states, messages, recovery" section
+covering each class, the visible message, the "what changed" residue note, and
+the next action across indicator/panel/CLI.
+
+** DONE Define the debug log and redacted support bundle :blocking:
+There is no observability section. When this fails in a hotel or cafe, an agent
+needs enough evidence to diagnose it without rerunning destructive actions. Add
+log location, rotation/retention, JSONL event schema, command argv logging,
+exit-code/stderr capture, elapsed time, selected iface, NM active connection
+UUID, probe URL class, HTTP code, redirect host, DNS servers, and cache
+read/write events. Also define a =net doctor --json= or =net debug-bundle=
+command that emits redacted status, recent log events, dependency versions, and
+a reproduction command. Redact SSID if configured, MAC addresses, portal query
+tokens, PSKs, EAP identities/passwords, IPs when requested, and all GPG/NM
+secrets. This blocks implementation readiness because post-failure diagnosis is
+currently left to ad hoc terminal spelunking.
+
+Disposition: modify — accepted the JSONL event log, the schema, and the redaction
+rules in full (new "Observability" section). Deferred the dedicated =net
+debug-bundle= / =net doctor= command to vNext: for a single-user tool =net
+diagnose --json= (the snapshot) plus the event log (the history) cover
+post-failure diagnosis; a bundle command is gold-plating for v1. Recorded under
+Out + Resolved decision 10.
+
+** DONE Pin the nmcli parsing and timeout contract :blocking:
+The spec lists nmcli operations but not the exact fields, output modes, escaping
+rules, ID semantics, or timeouts. This is risky because SSIDs and connection
+names can contain spaces, colons, duplicates, hidden names, and non-ASCII; the
+current =waybar-netspeed= already had an SSID parsing bug. The nmcli manual
+documents =--terse=, =--get-values=, =--escape=, =--wait=, ID/UUID/path
+selection, =passwd-file=, and built-in connectivity states
+(=none/portal/limited/full/unknown=) at
+https://man.archlinux.org/man/nmcli.1.en. The spec should require UUIDs for
+saved-profile operations, explicit =--wait= budgets, parser tests for escaped
+colons/backslashes/newlines/duplicate names/hidden SSIDs, and a decision on when
+to use or ignore =nmcli networking connectivity [check]=. This is blocking
+because the command wrapper is the core reliability boundary.
+
+Disposition: accept — added the "nmcli contract" section: terse + =--escape= +
+=--get-values=, UUID-keyed ops, explicit =--wait= budgets, NM connectivity as a
+cheap hint (our probe authoritative), and the parser test matrix.
+
+** DONE Define cache concurrency, atomicity, and stale-state behavior :blocking:
+=net status= may spawn =net probe= whenever the cache is stale, but the spec
+does not define locking, process coalescing, atomic writes, crash cleanup, or
+what happens when the probe hangs. With a 2s Waybar interval, a bad network could
+start overlapping probes, corrupt the runtime cache, or keep showing stale
+"online" while the link is gone. Add a single-flight lock under
+=$XDG_RUNTIME_DIR/waybar=, atomic write+rename for cache updates, max probe
+runtime, stale age classes (fresh/stale/expired/unknown), cache invalidation on
+iface/SSID/connection UUID change, and tests for concurrent =net status= calls.
+This blocks the fast-path design because it is the main performance and
+correctness risk.
+
+Disposition: accept — added "Concurrency, atomicity, staleness" under the
+Connectivity model: flock single-flight, temp+rename atomic write, ≤6s probe
+timeout, fresh/stale/expired/unknown classes, iface/SSID/UUID invalidation, stale
+lock reclaim, plus concurrency tests in the test plan.
+
+** DONE Bound hot-path performance with measured budgets :blocking:
+The spec says the cheap poll should be sub-100ms, but the proposed fast path
+still may call multiple =nmcli= commands every two seconds, read sysfs, parse
+throughput, and maybe spawn a background probe. The existing =waybar-netspeed=
+had a deliberate sleep for throughput sampling; replacing it must define how
+throughput is sampled without sleeping in the bar path. Add a per-command budget
+for =waybar-net= and =net status=, a maximum number of subprocesses on the hot
+path, a timeout for every subprocess, benchmark tests with fake slow =nmcli=,
+and a rule that the indicator emits a degraded JSON state rather than blocking.
+This is blocking because Waybar custom modules can visibly freeze or lag when
+their exec path stalls.
+
+Disposition: accept — added the "Performance budgets" section: <100ms typical /
+<250ms worst, throughput sampled across the poll interval (no in-process sleep),
+one nmcli call max on the hot path, timeouts on every subprocess, the degraded
+state, and a fake-slow-nmcli benchmark test.
+
+** DONE Make click actions non-blocking and visible :blocking:
+Waybar right-click runs =net reset= and middle-click runs =net portal= directly.
+Those operations can require sudo, open browsers, mutate DNS, delete/recreate NM
+profiles, or hang on network commands, but Waybar click handlers provide no
+panel, terminal, progress, or cancellation surface by default. Define whether
+right/middle click instead opens the panel focused on the action, dispatches a
+background job with notifications, or is removed from v1. If kept, specify
+single-flight behavior, how sudo/polkit prompts surface, how success/failure is
+reported, and how the user can inspect logs. This blocks UX readiness because
+the fastest remediation path is currently the easiest place to hide failure.
+
+Disposition: modify — accepted the concern; made the interactions phase-aware and
+non-blocking. Every click dispatches a detached, single-flight background job and
+reports via =notify=; sudo surfaces through polkit/the normal prompt; failures go
+to the notify + the event log. In Phase 1 (no panel) left-click runs probe +
+notify and keeps the scratchpad; from Phase 2 left-click opens the panel focused
+on the action. Recorded in the Indicator "Interactions" subsection.
+
+** DONE Specify connection mutation safety and rollback :blocking:
+The spec says row click switches connections and remove gets a confirm, but it
+does not define what happens when a switch partially succeeds, disconnects the
+current working link, needs a password, loses the default route, or triggers
+auto-activation. The nmcli manual warns that =connection down= does not prevent
+future auto-activation and may internally block a profile until user action.
+Define preflight, the exact NM command sequence, whether the old active
+connection is kept until the new one proves usable, when rollback is attempted,
+how long activation waits, and what the panel says when rollback fails. This is
+blocking because the module can strand the user offline.
+
+Disposition: accept — added "Mutation safety + rollback" under Connection
+management: keep the prior link up until the target activates (=--wait 30=), no
+teardown on failure, password-required surfaced not stranded, =net down= reports
+post-op active state + the auto-reactivation caveat, and the pinned NM command
+sequence is tested against fake nmcli.
+
+** DONE Define the credential-store security model :blocking:
+The GPG store is described as optional and default-unencrypted, but the spec does
+not define file modes, schema, secret-source rules, import/export prompts,
+recipient verification, stale secret handling, or what is logged. It also says
+NM remains source of truth while the user-owned store contains PSK/EAP secrets,
+which creates two truth sources for sensitive data. Add a precise schema,
+=0600= file creation with parent-dir permissions, encrypted-recipient checks,
+plaintext warning text, explicit opt-in flow, redaction requirements, behavior
+when NM has a secret not in the store, behavior when the store has a secret NM
+rejects, and tests for no secret leakage in JSON/logs/errors. This blocks Phase
+4 and the full spec because otherwise the implementer must make security
+decisions mid-code.
+
+Disposition: accept — rewrote "Credential storage" with the versioned schema,
+=0600= file / =0700= dir, recipient verification on opt-in, the plaintext
+warning, secret-source rule (entered/exported, never harvested from root store),
+the two-source reconciliation policy (NM wins live, store wins for what NM
+lacks, stale-secret flagging), and the no-leak tests.
+
+** DONE Define EAP, enterprise WiFi, and unsupported connection behavior :blocking:
+The store says "PSK/EAP" and connection management says add/edit, but there is
+no v1 contract for WPA-Enterprise fields, certificates, identity vs anonymous
+identity, hidden networks, static IP, proxy settings, metered flags, MAC
+randomization, or 802.1X prompt behavior. Either scope v1 to open/WPA-PSK plus
+existing saved-profile activation, or define the minimum EAP form and the
+unsupported-state messages. This blocks add/edit/import because enterprise WiFi
+is too sensitive to hand-wave.
+
+Disposition: modify (scope) — scoped v1 to open + WPA-PSK add/edit, with
+*activation* of any existing saved profile (including enterprise). Enterprise /
+802.1X add/edit, static-IP, proxy, metered, and MAC-randomization editing are
+vNext, shown as "edit via nmtui/nmcli". Recorded in Scope/Out, Connection
+management, and Resolved decision 9.
+
+** DONE Split read-only diagnostics from mutating remediation :blocking:
+The panel's diagnostics section includes probe, bounce/reset, gateway ping, and
+DNS override test in one area, while =captive= currently performs resets and
+temporary DNS changes as part of its flow. Users need to know which buttons are
+read-only and which mutate NM profiles, MAC mode, DNS, or browser state. Add
+separate "Diagnose" and "Repair" actions, confirmations for destructive or
+privacy-changing operations, explicit cleanup verification for DNS override, and
+a terminal state when cleanup is unverified. This blocks readiness because
+network repair must not surprise the user or leave hidden residue.
+
+Disposition: accept — split the panel into a read-only Diagnose section and a
+confirmed, mutating Repair section (and split the CLI into =net diagnose= vs =net
+repair=). Added =cleanup_verified= + a terminal =cleanup-unverified= state to the
+diagnostics contract.
+
+** DONE Define panel state, cancellation, and permissions UX :blocking:
+The panel sections list buttons and a streaming output area, but not loading
+states, disabled states, empty states, keyboard/focus behavior, cancellation, or
+permission-denied handling. Add panel state machines for connection list loading,
+rescan in progress, activation in progress, diagnostics running, speedtest
+running, and no NetworkManager/no WiFi/no permissions/no GPG key/no
+librespeed-cli. Each long operation should be cancellable where possible or
+clearly non-cancellable with an elapsed-time display. This blocks the GTK work
+because without it the implementer must invent the user flow.
+
+Disposition: modify — accepted the state-machine requirement (added "Panel state,
+cancellation, permissions"), but scoped the state set to what can actually occur
+on the two-machine fleet: dropped "no NetworkManager" as a modeled state (NM is
+always present; a missing nmcli is a single hard-error exit) and kept
+no-wifi-hardware, missing speedtest-go, no-GPG-key, plus the in-progress states
+with elapsed-time + cancellation where the op allows.
+
+** DONE Verify speed-test dependency, server choice, and failure contract :blocking:
+The spec chooses =librespeed-cli= and notes availability/default-server research
+as an open risk, but Phase 3 still depends on parsing its JSON and showing
+progress. I checked the upstream project page
+(https://github.com/librespeed/speedtest-cli) and the AUR URL named by search is
+not sufficient as a verified package/install contract in this spec. Add the
+exact package name/source to install, command version expected, JSON shape,
+server-selection policy, timeout, cancellation behavior, offline/rate-limited
+messages, and tests with fixture JSON and fixture stderr. This blocks Phase 3
+because speed-test failure modes are otherwise undefined.
+
+Disposition: modify — verified live and changed the backend: =speedtest-go= (AUR
+=speedtest-go-bin=, 1.x) is already installed on velox and supports =--json=,
+=--server=, =--no-download/--no-upload=, so v1 needs no new dependency.
+librespeed-cli (AUR =librespeed-cli= / =-bin=) is the documented self-hosted
+fallback. Added the "Speed test" section with server policy, timeout,
+cancellation, the failure-message mapping, and fixture-JSON/stderr tests.
+
+** DONE Define dependency installation and repo boundaries :blocking:
+The files touched section alternates between archsetup paths and the external
+dotfiles repo, while pocketbook has been folded into this repo and its previous
+archsetup provisioning was intentionally removed. The spec should state where
+the =net= package actually lives, which repository owns the scripts/tests,
+whether =gtk4-layer-shell=, =python-gobject=, =librespeed-cli=, =gpg=, =nmcli=,
+=curl=, and =resolvectl= are installed by archsetup or assumed present, and the
+Makefile targets for test/lint/install. This blocks implementation because the
+current path plan can produce code that is not installed on a fresh machine.
+
+Disposition: accept — added the "Repository + dependencies" section: all code in
+=~/.dotfiles= (=net/= package in-tree like pocketbook, scripts in the hyprland
+tier, tests under =tests/=), archsetup owns only the dep install
+(=gtk4-layer-shell=, =python-gobject=, =speedtest-go-bin=; nmcli/curl/resolvectl
+already present), Makefile =make test= collects the package suite, and a
+daily-drivers note for ratio. Rewrote Files touched to match.
+
+** DONE Expand the test plan for failure, concurrency, and live verification :blocking:
+The testing plan covers normal parsing and fake command sequences, but it misses
+the riskiest behaviors: slow/hung =nmcli=/=curl=/=librespeed=, concurrent
+=net status= cache refresh, corrupt cache, stale cache after SSID change,
+permission denied, sudo declined, DNS override cleanup failure, NM partial
+activation, duplicate connection names, secret redaction, missing optional
+dependencies, no WiFi hardware, wired+tether+WiFi ambiguity, portal redirect
+tokens, and Waybar click handlers. Add unit/fixture tests for each class plus a
+manual/live checklist gated out of the normal suite. This is blocking because
+the current plan would leave the exact "things that can go wrong here" mostly
+untested.
+
+Disposition: accept — rewrote the Testing plan with the "Failure + concurrency"
+class (slow/hung commands, single-flight, corrupt/stale cache, perm-denied,
+cleanup-failure, partial activation, redaction, missing deps, no-wifi,
+multi-active) and a per-phase live checklist gated out of the suite.
+
+** DONE Define status JSON schemas and compatibility rules
+The spec says all subcommands take =--json= but does not define schemas. Add
+versioned JSON examples for =status=, =probe=, =list=, =diagnose=, =speedtest=,
+and error envelopes, including nullable fields and unknown/degraded states. This
+is non-blocking for product direction but should be fixed before code so tests
+can lock the CLI contract.
+
+Disposition: accept — added the "JSON schemas" section with versioned (=v:1=)
+envelopes for status / probe / list / diagnose / speedtest and a shared error
+envelope, including the degraded/unknown states.
+
+** DONE Rename or alias the phasing section for workflow compatibility
+The spec has a usable =Phasing= section, but the spec-review workflow expects an
+=Implementation phases= section that can be lifted into =todo.org=. Rename it or
+add an alias heading during response. This is non-blocking because the existing
+phase decomposition is understandable, but aligning the heading prevents future
+workflow friction.
+
+Disposition: accept — renamed =Phasing= → =Implementation phases= and added
+per-phase acceptance criteria.
+
+** DONE Add documentation and rollout acceptance checks
+Rollback is described, but docs and rollout are thin. Add README/user-guide
+updates for commands, panel behavior, config file, GPG opt-in, troubleshooting,
+and rollback; add acceptance checks for each phase, including a fresh-login
+Waybar smoke test and restoring =custom/netspeed=. This is non-blocking but
+important for handing the feature to a future session without re-discovery.
+
+Disposition: accept — added per-phase acceptance criteria under Implementation
+phases (incl. the fresh-login waybar smoke test and the =custom/netspeed=
+restore), a Phase 4 "Docs + rollout", and (answering Craig's cj follow-up) a
+dedicated "Help + documentation" section with the three help layers (CLI help,
+panel help affordance, user guide).
+
+** DONE Add a failure-mode coverage table :blocking:
+The spec now names many individual network failures, but it still does not carry
+one compact coverage matrix that says, for each common failure mode, whether
+=net diagnose= detects it, whether =net doctor --fix= can repair it, and what
+terminal user action remains when it cannot. Add a table covering at least:
+rfkill soft block, rfkill hard block, no WiFi hardware, associated/no DHCP,
+gateway unreachable, captive DNS hijack, broken DNS where 1.1.1.1 works, HTTP
+portal, HTTP interception without a parseable portal URL, upstream/AP outage,
+wrong WPA password or missing secret, enterprise auth/cert failure, duplicate
+SSID/connection-name ambiguity, hidden SSID, multiple active links, wedged
+NetworkManager, slow/hung command, stale/corrupt cache, DNS cleanup failure,
+missing speedtest backend, and VPN/routing interference. This blocks because
+Craig asked for confidence that the diagnostics and doctor cover the real field
+failures, and prose scattered across sections is too easy to misread.
+
+Disposition: accept — added the "Failure-mode coverage" section: a 22-row table
+(every mode the finding named) with detect / doctor-fix / terminal-action
+columns, conformed to the org-table standard (rules under every row, ≤120).
+
+** DONE Pin DNS repair semantics in doctor :blocking:
+The spec diagnoses DNS hijack, broken hotel DNS, and the temporary 1.1.1.1
+override test, but =net doctor --fix= does not say whether it merely recommends
+the override, applies a temporary override during recovery, or leaves DNS alone
+after diagnosis. Define the exact behavior for each DNS class: captive hijack
+should open the portal, broken DNS where 1.1.1.1 works should either offer an
+explicit temporary repair with cleanup verification or recommend the command,
+and port-53/egress blocking should stop as upstream/not locally fixable. This is
+blocking because DNS is one of the most common "connected but unusable" failures
+and the current doctor contract is ambiguous.
+
+Disposition: accept — added "DNS handling in doctor (explicit per class)" under
+the new Doctor section: hijack → open portal (no DNS mutation); broken-but-1.1.1.1
+→ explicit temporary override with cleanup verification under =--fix=, recommend
+otherwise; egress-blocked → terminal =upstream-not-local=.
+
+** DONE Make auth failures terminal user-action states :blocking:
+Wrong WPA password, missing NM secret, locked keyring/polkit denial, enterprise
+802.1X certificate/identity failure, and portal login-required are not fixed by
+resetting or bouncing NetworkManager. The doctor sequence should classify these
+as =needs-user-action= terminal states, stop before looping through destructive
+repairs, and tell the user the exact next action (enter password, edit profile in
+=nmtui=/=nmcli=, accept portal terms, provide cert/identity, or retry with
+admin auth). This blocks because repeated reset/bounce against auth failures is
+slow, noisy, and can make the network state worse without helping.
+
+Disposition: accept — added the =needs-user-action= terminal outcome to the
+Doctor section: wrong password / missing secret / keyring-or-polkit denial /
+802.1X cert-or-identity failure / portal-login-required all stop the doctor before
+any destructive repair and name the exact next step.
+
+** DONE Define upstream/AP/provider failure terminal states :blocking:
+Some failures are not client-repairable: AP has no uplink, hotel gateway is
+down, DHCP server is broken, gateway drops traffic, ISP outage, or captive
+portal backend is failing. The spec should define how =diagnose= proves "local
+link is up but upstream is broken" and how =doctor --fix= stops after local
+repairs are exhausted with a clear message like "local repairs tried; likely
+upstream/AP/provider" plus the evidence. This blocks because users need to know
+when to stop poking the laptop and switch networks or contact the venue.
+
+Disposition: accept — added the =upstream-not-local= terminal outcome: diagnose
+proves link-up + IP + gateway-reachable but no route out and no captive redirect;
+=doctor --fix= stops after local repairs with "local repairs tried; likely
+upstream/AP/provider" + evidence → switch network / contact venue.
+
+** DONE Decide how VPN and policy routing affect v1 diagnosis
+VPN/WireGuard management is Phase 5, but active VPNs, policy routes, DNS
+overrides, and firewall killswitches can break apparent internet access in v1.
+The current spec does not say whether v1 detects active VPN/policy routing and
+classifies "network is fine, VPN route/DNS is broken" separately from WiFi
+failure. Add either a v1 diagnostic check for active VPN/default-route/DNS
+ownership with a "deferred repair" outcome, or explicitly state that VPN-routed
+failures are out of scope and may be misclassified. This is blocking if Craig
+expects the module to diagnose normal daily-driver network failures while VPN
+tooling remains separate.
+
+Disposition: accept (chose the detect-and-classify option) — v1 detects an active
+VPN / non-NM default route / non-NM DNS owner and classifies =deferred/vpn= ("link
+is fine; internet is VPN-routed"), distinct from a WiFi failure. v1 does not
+repair it (VPN management is Phase 5); it names the VPN as the likely owner and
+stops. Added to the Doctor section + the coverage table + a doctor-classification
+test.
+
+** DONE Remove stale GPG-store references from the resolved spec
+The spec now decides "no separate credential store; secrets live in
+NetworkManager", but the Testing plan still mentions =gpg round-trip= and =GPG
+store= tests, and the panel-state list still mentions a no-GPG-key state. Remove
+those stale references and replace them with NM-secret/no-secret-leak tests.
+This is non-blocking for product behavior but blocking for implementation
+clarity: otherwise tests will be written for a credential store that no longer
+exists.
+
+Disposition: accept — replaced the Testing-plan =gpg round-trip= / =GPG store=
+bullets with an "NM secrets / no-leak" test (add/edit writes the secret via nmcli;
+assert no PSK/EAP in any JSON/log/error; no store to round-trip) and dropped the
+=no-GPG-key= panel state. Residue from the cj-comment pass that dropped the store.
+
+** DONE Reconcile status, goal, and task text before implementation :blocking:
+The spec status says "Implementation-ready with caveats" and "Phase 1 ready to
+build", but the body still has an unresolved enterprise add/edit VERIFY, the
+Goal still says "optional GPG-encrypted secret store", and the unified task title
+still names "GPG-stored secrets" even though the accepted design removed the
+store. Before implementation, make the top-level status, goal, scope, task
+mapping, and resolved decisions agree with the current design. This blocks
+readiness because a developer starting from the top of the file would still build
+or plan around abandoned GPG-store behavior.
+
+Disposition: accept — fixed the Goal ("secrets stay in NM's own store"), the
+=[#B]= task-mapping line (notes the "GPG-stored secrets" framing is superseded by
+decision 5), the enterprise VERIFY (now resolved → Status updated), and corrected
+the stale =pytest= mentions to =unittest= (the repo's actual harness). Top-of-file
+status/goal/scope/decisions now agree with the design.
+
+** DONE Resolve enterprise add/edit scope or make the caveat explicit :blocking:
+The spec still says "One open question for Craig: pull enterprise add/edit into
+v1?" and points to a VERIFY in =todo.org=. That is a real product-scope decision:
+if enterprise add/edit is in v1, panel forms, nmcli command sequences, tests,
+error messages, and docs change materially; if it is out, the UI must consistently
+show activate-only with "edit in nmtui/nmcli". Decide it in the spec before
+implementation, or downgrade the status to =Ready with caveats= with this exact
+accepted caveat. As written, the spec cannot be plain =Ready=.
+
+Disposition: accept — Craig decided (2026-06-29): enterprise add/edit is vNext,
+activate-only in v1. Settled in the Status line, the Scope/Out bullet, decision 9,
+and the VERIFY (now DONE in todo.org). The UI shows activate-only with "edit in
+nmtui/nmcli" consistently. Evidence: 24 saved profiles, 0 enterprise.
+
+** DONE Define the concrete test harness and coverage gate :blocking:
+The spec says TDD, fake binaries on PATH, and benchmark tests, but it does not
+define the actual harness contract: pytest vs unittest for the =net= package,
+where fake =nmcli=/=curl=/=speedtest-go=/=rfkill=/=resolvectl= live, how test
+fixtures encode command histories, how subprocess timeouts are simulated, how
+Waybar scripts are executed end-to-end, and how coverage is run. Add the exact
+Makefile targets (=test=, =test-unit= or package-local =pytest=), pytest config,
+coverage command (e.g. branch coverage over =net/= and =waybar-net= wrappers),
+minimum threshold, and the rule for reading the coverage report to add missing
+tests before declaring a phase done. This blocks readiness because "what is the
+test harness?" is still answerable only by analogy to older suites.
+
+Disposition: accept — added the "Harness + coverage gate" section. Corrected the
+premise: the repo is =unittest= (=make test= → =python3 -m unittest=, 33 suites),
+not pytest. Pinned the fake-binary stub convention (=tests/<name>/fake-*= on a
+temp PATH), the fixture command→output map, timeout simulation, the end-to-end
+=waybar-net= subprocess run, and coverage via a throwaway venv (coverage.py is
+absent system-wide) with a ≥90% branch target on the pure modules.
+
+** DONE Use coverage to find missing behavior, not just report a percentage :blocking:
+The spec does not say how coverage findings affect implementation. For this
+feature, line coverage alone can miss the important holes: doctor classification
+branches, cleanup-unverified paths, redaction paths, degraded hot-path fallbacks,
+timeout branches, and auth/upstream/VPN terminal states. Define coverage review
+criteria per phase: branch coverage for pure classifiers and parsers, named
+untested branches allowed only with comments or manual-check entries, and a
+required "coverage gap pass" after the first green test run that maps uncovered
+logic back to tests or consciously excluded live-only behavior. This blocks
+readiness because the current test plan is broad but does not force the suite to
+expose missing edge tests.
+
+Disposition: accept — added the "Coverage as a gap-finder, not a number (per
+phase)" subsection: branch coverage required for the doctor classifier (every
+outcome), cleanup-unverified, redaction, degraded-fallback, timeout, and the
+parsers; a mandatory coverage-gap pass after the first green run mapping each
+uncovered branch to a test or a named live-only exclusion; a phase isn't done
+until that pass is recorded.
+
+** DONE Convert error classes into exact user-facing strings and evidence fields :blocking:
+The failure table and doctor outcomes classify errors well, but many messages
+are still templates or descriptions rather than final text. Add exact strings
+for indicator tooltip, notification, CLI stderr, JSON =error.message=, and panel
+banner/step text for every failure-mode row, including cases doctor cannot fix:
+wrong password, missing secret, enterprise cert failure, upstream/AP/provider
+failure, VPN-routed failure, hard rfkill block, DNS cleanup failure, speedtest
+missing, and HTTP interception without parseable URL. For each string, specify
+the redacted evidence included and the next action. This blocks UX readiness
+because "useful error" is only testable once the actual text and evidence are
+defined.
+
+Disposition: accept — rewrote the Failure states section: each row now carries the
+exact final string (with =<placeholder>= evidence), the evidence field, and the
+next action, plus a per-surface rendering rule (indicator tooltip / notify /
+CLI+JSON error.message+detail+code / panel banner all render the one canonical
+string). Added the missing doctor-unfixable rows: hard rfkill, wrong password /
+missing secret, enterprise cert failure, upstream/AP/provider, VPN-routed, HTTP
+interception without a parseable URL, and DNS cleanup-unverified.
+
+** DONE Add an enhancement disposition table
+The spec captures several good enhancements (doctor, Makefile recovery, rfkill,
+airplane absorption, VPN phase), but it does not show that low-cost adjacent
+enhancements were considered and accepted/deferred/rejected. Add a small radar
+table for likely affordances: copy redacted doctor report, open/copy portal URL,
+retry with hardware MAC, forget network, rescan now, pin speedtest server, show
+last good network/result, watch mode for =net doctor=, desktop notification
+actions, QR-code/share WiFi import/export, and keyboard picker. Mark each
+=v1=, =vNext=, or =rejected= with a one-line reason. This is non-blocking, but it
+prevents accidental loss of cheap UX wins and keeps the v1 panel focused.
+
+Disposition: accept — added the "Enhancement radar" table dispositioning all the
+named affordances: open/copy portal URL, forget network, rescan, hardware-MAC
+retry, pin speedtest server, copy redacted doctor report = v1; last-good
+network/result, doctor watch mode, actionable notifications, keyboard picker =
+vNext; QR-share = rejected (low value for a 2-machine personal setup).
+
+** DONE Tighten the panel UX flow before Phase 2
+The panel has sections and state machines, but not a concrete interaction flow:
+default focused section, row content, primary/secondary buttons, disabled-state
+rules, confirmation wording for reset/bounce/DNS override, how "Get me online"
+reports each escalation, what stays visible after the panel closes, and keyboard
+navigation. Add a short UX flow spec or wire-level outline before Phase 2. This
+is non-blocking for Phase 1, but it blocks Phase 2 implementation because a GTK
+panel can easily become noisy or surprising if these defaults are invented while
+coding.
+
+Disposition: accept — added the "Panel UX flow (settle before Phase 2)"
+subsection: default focus (Connections, or Diagnose when opened from a captive
+state), row content, one primary button per section, disabled-state rules, exact
+confirmation wording for reset/bounce/DNS-override/remove, the live "Get me
+online" escalation reporting, what survives panel close, and keyboard nav.
+
+* Review and iteration history
+
+** 2026-06-29 Mon @ 17:00:39 -0400 — Codex — reviewer
+
+- *What changed or was recommended:* Rubric: =Not ready=. Applied the
+  spec-review workflow and added blocking findings for diagnostics structure,
+  user-facing errors, observability, nmcli contracts, cache concurrency,
+  hot-path performance, Waybar click actions, mutation rollback, credential
+  security, unsupported WiFi types, panel states, speed-test dependency
+  verification, install boundaries, and test coverage.
+- *Why:* The spec has the right broad shape, but this feature combines a status
+  indicator, NetworkManager mutation, sudo-backed remediation, captive-portal
+  probing, secret storage, and GTK UI. Too many failure modes were still left for
+  implementers to invent during coding.
+- *Artifacts:* Findings recorded in =Review findings=. Local code read included
+  =~/.dotfiles/hyprland/.local/bin/waybar-netspeed=,
+  =~/.dotfiles/common/.local/bin/captive=,
+  =~/.dotfiles/tests/waybar-netspeed/test_waybar_netspeed.py=,
+  =~/.dotfiles/tests/captive/test_captive.py=, pocketbook's GTK layer-shell
+  scaffold, current Waybar config, and the linked =todo.org= tasks. External
+  checks included the Arch nmcli man page and upstream LibreSpeed CLI page.
+
+** 2026-06-29 Mon @ 17:35:00 -0400 — Claude Code (archsetup) — responder
+
+- *What changed:* Processed all 18 findings — 13 accepted, 5 modified, 0
+  rejected. Accepts added new sections: Repository + dependencies, Performance
+  budgets, nmcli contract, JSON schemas, Diagnostics contract (+ Diagnose vs
+  Repair), Failure states/messages/recovery, Observability, cache
+  concurrency/atomicity/staleness, mutation safety + rollback, the rewritten
+  credential security model, the expanded test plan, JSON schemas, the renamed
+  Implementation phases with per-phase acceptance criteria, and a Status line.
+  Modifies: deferred the =net debug-bundle= command (log + =diagnose --json=
+  cover it), made click actions phase-aware + non-blocking, scoped enterprise
+  WiFi to activate-only (add/edit vNext), trimmed the panel state set to states
+  that occur on the two-machine fleet, and switched the speed-test backend to the
+  already-installed =speedtest-go= (verified live) instead of a new
+  librespeed-cli dependency.
+- *Why:* Codex's review was strong and mostly right; the modifies are scope calls
+  for a single-user, two-machine tool (avoid gold-plating, use installed deps,
+  don't model impossible states) and one phasing reconciliation (the panel is
+  Phase 2, so Phase-1 clicks can't open it).
+- *Artifacts:* All 18 findings completed =DONE= in place with dispositions. Live
+  check: =pacman -Qq | grep speed= (speedtest-go-bin installed; =speedtest-go
+  --json= confirmed). Status now "review incorporated (Codex, 2026-06-29)",
+  implementation-ready with the named Phase-2/3 caveats; Phase 1 ready to build.
+
+** 2026-06-29 Mon @ 17:37:58 -0400 — Claude Code (archsetup) — responder (cj comments)
+
+- *What changed:* Folded in Craig's 10 cj comments on the spec. Design changes:
+  (1) dropped the separate credential store entirely — secrets stay in NM's own
+  =system-connections= (root =0600=, inline), touched via nmcli; no GPG, no
+  gpg-agent (rewrote the secrets section, decision 5, dropped Phase 4 + the gpg
+  dep). (2) Added =net doctor [--fix]= + Makefile console-recovery targets
+  (=make online= etc.) as a first-class TTY path; reversed the earlier
+  defer-the-doctor call (decision 10). (3) Added a full-stack =bounce= repair and
+  an =rfkill= unblock repair + indicator state — the rfkill one recovers the
+  framework-laptop post-power-loss soft-block Craig hit. (4) =custom/net= absorbs
+  the airplane module; the standalone airplane scripts/tests/module are deleted on
+  ship. (5) Moved VPN/WireGuard from "out" to a planned Phase 5. (6) Added a
+  "Help + documentation" section (CLI help / panel help / user guide). Answered
+  the enterprise-defer rationale and the captive-auto-login explanation inline.
+- *Why:* Craig's comments simplify (NM-only secrets, fewer deps) and harden the
+  recovery story (doctor + make targets + rfkill/bounce reachable from a dead-GUI
+  console — his stated need). Enterprise add/edit in v1 is the one open call,
+  raised as a VERIFY.
+- *Artifacts:* All 10 cj blocks removed. Live checks: airplane scripts/tests
+  present (confirmed deletion targets); =/etc/NetworkManager/system-connections/=
+  is root =0600= with inline secrets (confirms the NM-only secret model);
+  =rfkill= present. VERIFY filed in =todo.org= for the enterprise-scope decision.
+
+** 2026-06-29 Mon @ 17:49:24 -0400 — Codex — reviewer
+
+- *What changed or was recommended:* Added a second feedback batch from the
+  common network failure-mode review: require a diagnose/doctor coverage table,
+  pin DNS repair semantics, make auth failures terminal user-action states,
+  define upstream/AP/provider terminal states, decide how active VPN/policy
+  routing affects v1 diagnosis, and remove stale GPG-store references.
+- *Why:* The design now has a solid recovery shape, but the field question is
+  sharper: for each common failure, does =diagnose= catch it, does =doctor --fix=
+  fix it, or does the user need to switch networks/provide credentials/contact
+  the venue? That mapping needs to live in the spec, not in chat.
+- *Artifacts:* Findings appended under =Review findings= after reviewing the
+  current spec's Diagnostics, Failure states, Doctor, Testing plan, and
+  responder history.
+
+** 2026-06-29 Mon @ 18:06:22 -0400 — Codex — reviewer
+
+- *What changed or was recommended:* Added a readiness pass focused on
+  development-time questions, missed enhancements, UX cleanliness, error wording,
+  and test/coverage mechanics. New findings require reconciling stale status/goal
+  text, resolving or explicitly caveating enterprise add/edit, defining the
+  concrete harness and coverage gate, making coverage actionable, turning failure
+  classes into exact user-facing strings, adding an enhancement disposition
+  table, and tightening the Phase-2 panel UX flow.
+- *Why:* The network failure model is now strong, but implementation would still
+  force developers to invent test infrastructure, coverage expectations, final
+  error copy, and several UX details. The spec should decide those up front or
+  explicitly mark them as caveats.
+- *Artifacts:* Findings appended under =Review findings= after reading the
+  updated spec, current archsetup Makefile test targets, pocketbook's pyproject,
+  and the linked =todo.org= enterprise VERIFY.
+
+** 2026-06-29 Mon @ 17:55:08 -0400 — Claude Code (archsetup) — responder (2nd review)
+
+- *What changed:* Processed all 6 second-batch findings — 6 accepted, 0 modified,
+  0 rejected. Added a "Doctor: escalation, classification, terminal states"
+  section with four outcomes (=fixable= escalating lightest-first; =needs-user-
+  action=, =upstream-not-local=, =deferred/vpn= terminal) and explicit per-class
+  DNS handling. Added a "Failure-mode coverage" table (22 rows: detect / doctor-fix
+  / terminal action, org-table-standard conformant, ≤120). Cleaned the last
+  GPG-store residue (Testing plan → NM-secret/no-leak test; dropped the no-GPG-key
+  panel state). Added a doctor-classification test class.
+- *Why:* The findings sharpen the doctor from "escalate repairs" to "classify,
+  then stop at the right terminal state" — the safety property that keeps
+  =doctor --fix= from looping destructive repairs against a wrong password or an
+  upstream outage. The VPN finding got the detect-and-classify option (not repair;
+  that's Phase 5) so v1 doesn't misread a VPN-routed failure as WiFi down. The
+  coverage table is the single artifact Craig asked for to trust the field
+  coverage.
+- *Artifacts:* All 6 findings =DONE= in place; cookie =[24/24]=. Table conformed
+  via =wrap-org-table.el= (max line 120). No new findings introduced; no scope
+  expansion past what the findings asked. Spec remains implementation-ready, the
+  enterprise add/edit VERIFY still the one open decision.
+
+** 2026-06-29 Mon @ 18:24:51 -0400 — Claude Code (archsetup) — responder (3rd review)
+
+- *What changed:* Processed the third review (7 findings). The enterprise-scope
+  finding closed first on Craig's call (vNext, activate-only). The other 6, all
+  accepted: (1) reconciled the top-of-file text — fixed the Goal's GPG-store
+  wording, the =[#B]= task-mapping line, the resolved enterprise VERIFY, and
+  corrected the stale =pytest= mentions to =unittest= (the repo's real harness).
+  (2) Added a "Harness + coverage gate" section (unittest, fake-binary stubs on a
+  temp PATH, venv coverage, ≥90% branch on pure modules). (3) Added a per-phase
+  "coverage as a gap-finder" pass. (4) Rewrote the Failure states section to exact
+  final strings + evidence fields + a per-surface rendering rule, and added the
+  missing doctor-unfixable rows. (5) Added the "Enhancement radar" table
+  (v1/vNext/rejected). (6) Added the "Panel UX flow" subsection.
+- *Why:* The findings close the gap between "design decided" and "a developer can
+  start": the harness/coverage contract, the exact UX strings, and the panel flow
+  are the things otherwise invented mid-code. The =pytest=→=unittest= correction
+  was a real defect — the spec contradicted the repo's actual test convention.
+- *Artifacts:* All 31 findings =DONE=; cookie =[31/31]=. Both new tables conformed
+  via =wrap-org-table.el= (coverage 120, radar 110). Harness verified against the
+  live repo (33 unittest suites, =make test=, coverage.py absent → venv). Status
+  raised to "Ready for Phase 1; Ready-with-caveats overall" — no open decisions
+  remain.
diff --git a/docs/design/2026-06-29-waybar-timer-module-spec.org b/docs/design/2026-06-29-waybar-timer-module-spec.org
new file mode 100644
index 0000000..4b0ed0e
--- /dev/null
+++ b/docs/design/2026-06-29-waybar-timer-module-spec.org
@@ -0,0 +1,217 @@
+#+TITLE: Waybar Timer Module (wtimer) — Design Spec
+#+AUTHOR: Craig Jennings & Claude
+#+DATE: 2026-06-29
+
+* Goal
+
+One always-visible waybar module that keeps time four ways — countdown timer,
+wall-clock alarm, count-up stopwatch, and pomodoro — with several items running
+at once. The bar shows the most urgent item with a per-type glyph; the tooltip
+lists them all. Backed by a single =wtimer= script over a small JSON state file.
+notify fires on completion. fuzzel drives creation. No GTK app.
+
+Source task: archsetup =todo.org= "Waybar timer module" (=:waybar:=), including
+the folded roam-capture scope expansion (mode-selectable single panel,
+stopwatch, multiple simultaneous, per-mode hover text).
+
+* Scope
+
+** In
+- *Timer* — count down a duration, notify on elapse, then remove.
+- *Alarm* — fire at a wall-clock time, notify, then remove.
+- *Stopwatch* — count up from start; pause/resume; manual stop.
+- *Pomodoro* — work/break cycles (25/5, long break 15 after 4 works), auto-advance with a notify at each phase change, runs until cancelled.
+- *Multiple simultaneous* — N items of any mix held in state. Bar shows one primary item plus a =+N= badge; tooltip lists every item with its remaining/elapsed and label.
+- *Pause / resume* per item; *cancel* one or all.
+- *Interactions* — click to create (fuzzel), middle-click pause/resume primary, right-click cancel (fuzzel pick), scroll to cycle which item is primary.
+- *Per-type glyph + CSS state classes* (running / paused / urgent / break).
+- *Persistence across waybar restarts* (state file in the runtime dir).
+
+** Out (v1, note for later)
+- No GTK panel — waybar module + tooltip + fuzzel only.
+- No persistence across *reboot* (runtime-dir state clears). Alarms set before a reboot won't survive. Acceptable v1; revisit with =~/.local/state= + a catch-up-on-boot pass if wanted.
+- No sound selection per item (uses notify's type sound).
+- No history/stats of completed pomodoros beyond the current run's cycle count.
+
+* Architecture
+
+- =wtimer= — a single executable Python script in =hyprland/.local/bin/=. Chosen over POSIX sh (the other waybar backings) deliberately: the multi-item state machine, time arithmetic, pomodoro FSM, and JSON I/O are cleaner in Python, and it gives real line/branch *coverage numbers* (Craig asked for them). Precedent: pocketbook is Python in this repo.
+- *Pure core + thin IO shell.* All logic is pure functions taking =now= as a parameter (dependency-injected clock — satisfies testing.md: no recursion, no scope-shadowing, production reads =time.time()=, tests pass an explicit instant). The CLI layer does the IO: read state, call pure fns, write state, emit JSON, shell out to notify/fuzzel.
+- *State file*: =$XDG_RUNTIME_DIR/waybar/wtimer.json= (env override =WTIMER_STATE= for tests). Same runtime-dir convention as =sysmon-metric=.
+- *Heartbeat*: waybar calls =wtimer render= every 1s. =render= runs the tick logic first (detect elapsed items, fire notify, advance pomodoro, drop finished timers/alarms), then prints the waybar JSON. One entry point waybar polls; no separate daemon.
+- *Concurrency (BLOCKER from review).* The 1s =render= and the click/scroll handlers (=add=, =toggle=, =cancel=, =cycle=) are separate processes doing read-modify-write on the same state file. Without serialization, last-writer-wins drops a click's =add=, or clobbers render's "item removed/advanced" write so the same item ticks and notifies again next second. So every read-modify-write takes an exclusive =flock= on the state file for the whole cycle, and writes go through a temp file + =os.replace= (atomic), so a concurrent render never reads a half-written file. This is what actually makes "notify fires once" true — the mutation is only authoritative under the lock.
+- *State dir*: =render= and the mutating commands =mkdir -p= the state dir first (=$XDG_RUNTIME_DIR/waybar/= may not exist on a fresh boot).
+- *Clock injection everywhere*: =now= comes from =WTIMER_NOW= (epoch) if set, else =time.time()=. Pure fns take =now= as a parameter; the CLI seeds it from the env. This lets the CLI integration tests hit boundary instants (exactly-at-target), not just the pure-fn tests.
+- *Instant refresh*: after any mutating command, send waybar =SIGRTMIN+14= (the module's signal) so the bar updates immediately instead of lagging up to 1s. Faked in tests (=WTIMER_REFRESH= override, default =pkill -RTMIN+14 waybar=).
+
+* State model
+
+#+begin_src json
+{
+  "items": [
+    {"id": "1", "type": "timer",     "label": "tea",  "target": 1751240400, "duration": 300, "paused_left": null},
+    {"id": "2", "type": "alarm",     "label": "",     "target": 1751251200, "paused_left": null},
+    {"id": "3", "type": "stopwatch", "label": "",     "start": 1751240000,  "paused_elapsed": null},
+    {"id": "4", "type": "pomodoro",  "label": "",     "target": 1751241900, "phase": "work",
+     "cycle": 1, "work": 1500, "short": 300, "long": 900, "interval": 4, "paused_left": null}
+  ],
+  "primary": "1",
+  "seq": 4
+}
+#+end_src
+
+- =seq= is the monotonic id source (string ids).
+- *Paused* timer/pomodoro: =paused_left= holds seconds remaining; =target= ignored while paused; resume sets =target = now + paused_left=, =paused_left = null=.
+- *Paused* stopwatch: =paused_elapsed= holds elapsed seconds; resume sets =start = now - paused_elapsed=.
+- =primary= is the id the bar shows; =null= or stale → auto-select (below).
+
+* Display logic
+
+** Primary selection (bar text)
+1. If =primary= names a live item, show it.
+2. Else the running countdown (timer/alarm/pomodoro) with the smallest remaining.
+3. Else the first running stopwatch.
+4. Else idle (no items).
+
+** Bar text
+- =<glyph> <time>= for the primary, plus = +N= when N other items exist.
+- Idle: a dim timer glyph alone (or empty — decide at render; lean dim glyph so the module has a stable click target).
+- =time= formatting: =M:SS= under 1h, =H:MM:SS= at/over 1h. Stopwatch counts up; timer/alarm/pomodoro count down to target.
+- Paused item: prefix a pause glyph or rely on the =paused= class (CSS dims it).
+
+** Glyphs (nerd font; final codepoints verified live before merge)
+- timer 󰔛, alarm 󰀠, stopwatch 󱎫, pomodoro-work , pomodoro-break  (coffee), paused 󰏤, idle 󰔛 (dim).
+- One glyph table at the top of the script so a live-render tweak is one edit.
+
+** Tooltip (all items)
+One line per item: =<glyph> <label-or-type>  <remaining/elapsed>  (<state>)=. Pomodoro line shows phase + cycle (e.g. =work 2/4=). Header line summarizes count. Empty state: "No timers".
+
+** CSS classes (the =alt=/=class= field)
+=timer= / =alarm= / =stopwatch= / =pomodoro-work= / =pomodoro-break=, plus =paused= and =urgent= (remaining < 60s). Drives color in style.css + both themes.
+
+* Commands (CLI)
+
+| Command                         | Effect                                                              |
+|---------------------------------+---------------------------------------------------------------------|
+| =wtimer render=                 | tick + emit waybar JSON (the heartbeat)                             |
+| =wtimer add timer <dur> [label]=| add a countdown (=dur= like =25m=, =90s=, =1h30m=, =5= → minutes)   |
+| =wtimer add alarm <HH:MM> [lbl]=| add a wall-clock alarm (next occurrence of that time)               |
+| =wtimer add stopwatch [label]=  | start a count-up                                                    |
+| =wtimer add pomodoro [label]=   | start a pomodoro at work phase                                      |
+| =wtimer new=                    | fuzzel: pick type, prompt value, dispatch to =add= (thin wrapper)   |
+| =wtimer toggle [id]=            | pause/resume the item (default: primary)                            |
+| =wtimer cancel <id>=            | remove one item                                                     |
+| =wtimer pick-cancel=            | fuzzel: choose an item to cancel (right-click handler)              |
+| =wtimer cancel-all=             | clear all                                                           |
+| =wtimer cycle [next|prev]=      | move the primary pointer across all items (incl. paused), state-list order, wrapping |
+
+Duration parse: =Nh=, =Nm=, =Ns= combos, or a bare integer = minutes. Reject
+unparseable input (exit non-zero, notify nothing). Alarm parse: =HH:MM= 24h; if
+that time today already passed, target tomorrow.
+
+* Notifications
+
+- Timer elapse: =notify alarm "Timer" "<label or duration> done" --persist=.
+- Alarm fire: =notify alarm "Alarm" "<HH:MM><, label>" --persist=.
+- Pomodoro phase change: =notify info "Pomodoro" "Work → short break (3/4)"= (no =--persist=; phase nudges shouldn't pile up), long-break and work-resume worded accordingly.
+- notify is faked on PATH in tests; assert type + that it fired once per event.
+
+* Pomodoro semantics
+
+- Defaults: work 25m, short 5m, long 15m, interval 4 (long break after every 4th work).
+- FSM: work → short → work → short → work → short → work → long → work …
+- =cycle= counts completed works in the current set (1..interval); resets after a long break.
+- Each phase elapse advances =phase=, recomputes =target=, fires the phase notify. Pomodoro never auto-removes; cancel ends it.
+
+* Waybar wiring
+
+** Module def (config) — signal 14 (next free; 8–13 used)
+#+begin_src json
+"custom/timer": {
+    "exec": "wtimer render",
+    "return-type": "json",
+    "interval": 1,
+    "signal": 14,
+    "on-click": "wtimer new",
+    "on-click-middle": "wtimer toggle",
+    "on-click-right": "wtimer pick-cancel",
+    "on-scroll-up": "wtimer cycle next",
+    "on-scroll-down": "wtimer cycle prev"
+}
+#+end_src
+
+** Position — right of the sysmon (battery/resource) module
+Insert =custom/timer= into =modules-right= immediately after =custom/sysmon=
+(between =custom/sysmon= and =custom/netspeed=). On screen that places it just
+right of the battery/resource readout.
+
+** Not collapsible — survives the right-side collapse
+The module *definition* lives in the canonical config object, and =waybar-collapse=
+only swaps the =modules-right= *array* in the runtime copy (which it seeds from
+canonical, so the def is always present). So making the timer non-collapsible is
+purely an array-membership change: add =custom/timer= to the =waybar-collapse=
+right *base set* so it stays listed when the right side collapses:
+- laptop: =["custom/arrow-right","custom/sysmon","custom/timer","tray","custom/date","custom/worldclock"]=
+- desktop: =["custom/arrow-right","custom/timer","tray","custom/date","custom/worldclock"]=
+Update the =tests/waybar-collapse= base-set expectations to match (TDD the change).
+
+* CSS
+
+Add =#custom-timer= plus the state classes to all three stylesheets. Keep the
+*selectors and structure* parallel across the three (what the theme-drift test
+checks); the actual color *values* are per-theme (dupre vs hudson) and differ by
+design, so this is structural parity, not byte-identity. Confirm against the real
+CSS files what the drift test compares before editing.
+- =hyprland/.config/waybar/style.css=
+- =hyprland/.config/themes/dupre/...= waybar css
+- =hyprland/.config/themes/hudson/...= waybar css
+Colors: normal = foreground; =urgent= = a warning hue (reuse the sysmon
+warn/crit palette); =paused= = dimmed; =pomodoro-break= = a calmer accent.
+
+* Testing plan (TDD)
+
+- Suite: =tests/wtimer/test_wtimer.py= (auto-discovered by =make test='s =tests/*/test_*.py= glob — no enumeration gap).
+- *Pure-function tests* (fast, the bulk), explicit injected =now=:
+  - =parse_duration=: =25m=, =90s=, =1h30m=, =5= (→min), =0=, negative, garbage, empty (Normal/Boundary/Error).
+  - =parse_alarm=: future today, already-passed-today → tomorrow, =00:00=, =23:59=, =24:00=/=12:60= invalid, non-=HH:MM=.
+  - =format_time=: 0, 59s, 60s, 3599s, 3600s, multi-hour, negative clamps to 0.
+  - =add_item= for each type; =seq= increments; ids unique.
+  - =tick=: timer not-yet-elapsed (no change), exactly-at-target, past-target (fires once, removed); alarm same; pomodoro work→short→…→long→work advance + cycle counting + the 4th-work→long boundary; paused items never tick; multiple items in one tick.
+  - =select_primary=: explicit primary, stale primary falls back, soonest-remaining rule, stopwatch-only, empty.
+  - =render_payload=: text/tooltip/class for each type + paused + urgent + =+N= badge + idle.
+  - =toggle= pause then resume round-trips remaining/elapsed exactly; =cycle= wraps; =cancel= / =cancel-all=.
+- *CLI integration tests* (subprocess, fakes on PATH, =WTIMER_NOW= to hit boundaries): =add= then =render= round-trip; =render= fires the faked =notify= once on an elapsed item and drops it; state file created if absent; *missing parent dir* created (fresh-boot case); corrupt/empty state file → treated as empty, no crash; mutating command sends the faked refresh signal.
+- *Concurrency test*: spawn overlapping =render= + a mutating command against one state file; assert no lost update (the added item survives) and exactly-once notify (no double-fire from a clobbered tick). This is the regression guard for the flock/atomic-write fix.
+- *Mocking boundary*: fake =notify=, =fuzzel=, =killall= on PATH (record calls); never mock the wtimer logic. Clock injected as a parameter.
+- *Coverage*: measure with =coverage.py= if present (target 90%+ on the logic per testing.md business-logic bar); report the actual number. If =coverage= is absent, report per-command/per-branch case coverage explicitly and flag the tool gap (verification.md).
+- =tests/waybar-collapse= base-set expectations updated for the new module.
+- =tests/= theme-drift check stays green (CSS parity).
+
+* Files touched
+
+dotfiles branch =waybar-timer-module=:
+- =hyprland/.local/bin/wtimer= (new, executable)
+- =tests/wtimer/test_wtimer.py= (new)
+- =hyprland/.config/waybar/config= (module def + modules-right position)
+- =hyprland/.local/bin/waybar-collapse= (base-set) + =tests/waybar-collapse/...= (expectations)
+- =hyprland/.config/waybar/style.css= + dupre + hudson waybar css (CSS)
+
+archsetup (main, at the end):
+- this spec
+- =todo.org= task closure
+
+* Resolved decisions (no approvals — my calls)
+
+- Python, not sh — testability + coverage; pocketbook precedent.
+- One =render= heartbeat (no daemon) — simplest, waybar already polls.
+- notify fires from =render='s tick, mutation guarantees once-only.
+- Primary = user-cycled, else soonest-remaining; =+N= badge for the rest.
+- Multiple simultaneous via tooltip list + badge (not a GTK panel) — keeps it "cool yet simple".
+- Pomodoro is one self-advancing item, not four chained timers.
+- Runtime-dir state (waybar-restart durable, not reboot durable) — v1.
+
+* Rollback
+
+All code on the dotfiles =waybar-timer-module= branch off =09815f3=. Squash-merge
+at the end; =git switch main && git branch -D waybar-timer-module= reverts cleanly
+if it goes sideways.