1 files changed, 238 insertions, 0 deletions
diff --git a/docs/design/2026-06-25-testinfra-validation.org b/docs/design/2026-06-25-testinfra-validation.org
new file mode 100644
index 0000000..5c82aa2
--- /dev/null
+++ b/docs/design/2026-06-25-testinfra-validation.org
@@ -0,0 +1,238 @@
+#+TITLE: Design: Testinfra Post-Install Validation for archsetup
+#+AUTHOR: Craig Jennings
+#+DATE: 2026-06-25
+#+STATUS: Accepted (2026-06-25)
+
+* Problem
+
+The VM integration harness (=scripts/testing/run-test.sh=) runs archsetup in a
+QEMU VM, then verifies the result two ways:
+
+1. Parses archsetup's own install log for its Error Summary and the
+   =ARCHSETUP_EXECUTION_COMPLETE= marker (did the script finish, did it log
+   errors).
+2. Runs =run_all_validations= from =scripts/testing/lib/validation.sh= — a
+   hand-rolled, shell-based post-install assertion sweep of ~26 checks over SSH.
+
+The shell sweep works, but each check is 6-40 lines of =ssh_cmd= +
+=validation_pass/fail= + =attribute_issue= boilerplate, the pass/fail counters
+are hand-maintained globals, and the reporting is bespoke. Adding or reading a
+check is heavier than it should be, and growing the suite (archsetup configures
+far more than the 26 checks cover) compounds that weight.
+
+This doc proposes porting the post-install validation to Testinfra (Python +
+pytest) for more expressive checks and better reporting, then growing coverage.
+
+* Decision
+
+Port the post-install validation layer to Testinfra + pytest, reaching parity
+with the existing =validation.sh= sweep, then expand coverage. Recorded
+rationale: the up-front port cost (parity rewrite + a test-only dependency) is
+an accepted trade — the priority is a robust, well-reported, growing validation
+suite over feature speed. The framework swap alone buys ergonomics and
+reporting, not coverage, so it is paired with real new coverage (below).
+
+This replaces the shell sweep; it does not touch archsetup's own install-log
+parsing (that stays as a separate signal). The full coverage expansion (P4)
+lands in this task too, sequenced strictly after the parity cutover so the
+parity verification stays clean.
+
+* Current harness (what exists today)
+
+** Flow (run-test.sh)
+1. Revert VM to base snapshot, boot, wait for SSH.
+2. =capture_pre_install_state=.
+3. Bundle + copy archsetup + dotfiles into the VM, run archsetup in background,
+   poll to completion.
+4. =capture_post_install_state=.
+5. =run_all_validations= (the shell sweep).
+6. =analyze_log_diff= + =generate_issue_report= (issue attribution).
+7. Explicit pass/fail exit code; cleanup.
+
+** The shell sweep (validation.sh)
+~26 checks under =run_all_validations=: user created / shell / groups, dotfiles,
+yay, pacman working, window manager, firewall, DNS, avahi, fail2ban,
+NetworkManager, emacs, git config, dev tools, zfs, boot config, autologin,
+gnome-keyring, terminus font, mkinitcpio hooks, initramfs consolefont, nvme
+module, archsetup log, state markers.
+
+** Issue attribution
+=attribute_issue <msg> <bucket>= sorts each failure into one of three arrays —
+=ARCHSETUP_ISSUES=, =BASE_INSTALL_ISSUES=, =UNKNOWN_ISSUES= — and
+=generate_issue_report= writes them out (base-install issues route to the
+archzfs inbox). This is domain logic Testinfra has no equivalent for; the port
+must preserve it.
+
+** Connection
+=ssh_cmd= uses =sshpass -p "$ROOT_PASSWORD" ssh ... -p "$SSH_PORT" root@$VM_IP=,
+with =VM_IP=localhost=, =SSH_PORT=2222=, =ROOT_PASSWORD=archsetup=.
+
+* Design
+
+** Where Testinfra fits
+Replace the =run_all_validations= call (step 5) with a pytest invocation against
+the running VM. Steps 1-4 and 6-7 are unchanged; =analyze_log_diff= stays.
+Testinfra connects over the same SSH the harness already exposes.
+
+** Connection model
+Testinfra's paramiko/ssh backend targets the live VM via its host spec:
+
+#+begin_src sh
+pytest scripts/testing/tests/ \
+  --hosts="ssh://root@localhost:2222" \
+  --ssh-config=<generated> \
+  --json-report --json-report-file="$TEST_RESULTS_DIR/testinfra.json"
+#+end_src
+
+Password auth: generate a throwaway ssh-config (or reuse sshpass via a
+=--ssh-identity= once archsetup drops the key, but at validation time we only
+have the root password). Simplest: a tiny generated ssh config + sshpass
+wrapper, or switch the test VM to a known test key injected pre-run. Open
+question below.
+
+** Test layout
+#+begin_example
+scripts/testing/tests/
+  conftest.py            # host fixture, markers, attribution hook, report glue
+  test_users.py          # user created / shell / groups
+  test_dotfiles.py       # stow symlinks, readable by user
+  test_packages.py       # yay, pacman working, dev tools, key packages
+  test_services.py       # firewall, dns, avahi, fail2ban, networkmanager
+  test_boot.py           # zfs, mkinitcpio hooks, nvme, consolefont, terminus
+  test_desktop.py        # window manager, autologin, gnome-keyring
+  test_archsetup.py      # install log, state markers
+  test_hardening.py      # NEW: sshd drop-in, sysctl, /etc fstab perms, backups
+#+end_example
+
+** Example tests (parity)
+#+begin_src python
+def test_ufw_enabled(host):
+    assert host.service("ufw").is_enabled
+
+def test_user_cjennings_exists(host):
+    u = host.user("cjennings")
+    assert u.exists
+    assert u.shell == "/usr/bin/zsh"
+
+def test_zshrc_stowed_and_readable(host):
+    f = host.file("/home/cjennings/.zshrc")
+    assert f.is_symlink
+    assert ".dotfiles/" in f.linked_to
+    assert f.exists                       # not broken
+    assert host.run("sudo -u cjennings test -r %s" % f.path).rc == 0
+
+def test_mkinitcpio_systemd_hook(host):
+    # non-ZFS systems delegate fsck from udev to systemd
+    conf = host.file("/etc/mkinitcpio.conf").content_string
+    assert "systemd" in conf
+#+end_src
+
+Compare =test_ufw_enabled= (1 line) to the current =validate_firewall= (8 lines
+of ssh_cmd + branch + counters).
+
+** Preserving issue attribution
+Map the three buckets to pytest markers and collect them in a =conftest.py=
+hook:
+
+#+begin_src python
+@pytest.mark.attribution("archsetup")   # or "base_install" / "unknown"
+def test_ufw_enabled(host): ...
+#+end_src
+
+A =pytest_runtest_makereport= hook records each failure under its marker's
+bucket and writes the same three-way report =generate_issue_report= produces
+(base-install failures still route to the archzfs inbox). Default bucket =
+archsetup when unmarked.
+
+** Tiered strategy
+Markers =@pytest.mark.smoke= (user, key packages, dotfiles present) and
+=@pytest.mark.integration= (services, configs, boot). =pytest -m smoke= for a
+fast gate, full run otherwise. Drop the task's original X11/startx end-to-end
+slice — the fleet is Wayland/Hyprland and headless GUI e2e is flaky and
+expensive; a Wayland-session smoke check can be reconsidered later as its own
+task.
+
+** Reporting
+=pytest-json-report= (or junit-xml) → =$TEST_RESULTS_DIR/=, surfaced in the
+test report alongside the install-log analysis. pytest's own per-test
+pass/fail/skip output replaces the hand-maintained counters.
+
+* Coverage
+
+** Parity (port all current checks)
+All ~26 =validation.sh= checks, grouped per the layout above.
+
+** Expansion (new — the coverage win)
+archsetup configures much that isn't validated today. Candidates:
+- sshd hardening drop-in (=/etc/ssh/sshd_config.d/10-hardening.conf=,
+  PermitRootLogin prohibit-password).
+- =backup_system_file= behavior — assert =.archsetup.bak= exists for files
+  archsetup edited in place (fstab, mkinitcpio.conf, sudoers, …).
+- pacman.conf (ParallelDownloads, Color, multilib) and makepkg.conf (MAKEFLAGS,
+  OPTIONS) settings actually applied.
+- systemd-resolved DNS-over-TLS drop-in; NetworkManager wifi-privacy.
+- fail2ban jail.local present; reflector config; sysctl printk; /etc/issue
+  emptied; vconsole font; fstab /efi fmask/dmask perms.
+- sanoid / zfs-replicate units (ZFS hosts).
+
+* Dependencies
+
+Add =python-pytest=, =python-pytest-testinfra= (pulls paramiko), and a JSON
+reporter to =make deps= (test host only — not installed by archsetup itself).
+Note: the existing unit suites run under =python3 -m unittest=; the integration
+layer runs under pytest. Two runners, both Python; =make test-unit= unchanged,
+=make test= gains the pytest step.
+
+* Goss comparison (the task asked)
+
+- *Goss* — YAML-declarative health specs, a single Go binary executed *on the
+  target*. Fast, no Python. But the spec must be pushed into the VM and run
+  there, the assertions are less programmable, and it adds a Go binary to the
+  flow.
+- *Testinfra* — Python, runs *on the host* over SSH (nothing installed in the
+  VM), assertions are full Python with rich built-in modules
+  (File/Package/Service/User/Command), integrates with pytest's tooling.
+
+Choose Testinfra: it runs from the host (the VM stays clean), it's far more
+programmable for the conditional checks archsetup needs (DESKTOP_ENV branches,
+ZFS-vs-not), and it aligns with the repo's existing Python test tooling.
+
+* Migration plan (phased, TDD where the helper logic is ours)
+
+- *P1 — Scaffold.* conftest.py (host fixture + connection), the attribution
+  marker + report hook, and 3 parity checks (firewall, user, dotfiles). Wire a
+  pytest step into run-test.sh behind a flag so the shell sweep still runs.
+- *P2 — Full parity.* Port all ~26 checks; diff a real VM run's results against
+  the shell sweep to confirm no check was lost.
+- *P3 — Cut over.* Make pytest the primary sweep in run-test.sh; keep
+  =analyze_log_diff= and the install-log signal.
+- *P4 — Expand.* Add the new coverage (hardening, backups, applied settings).
+- *P5 — Retire.* Remove =run_all_validations= from validation.sh (keep the
+  capture/analyze helpers that pytest doesn't replace).
+
+* Acceptance criteria
+
+- =make test= runs archsetup in a VM, then a pytest sweep over SSH, and a real
+  run reports parity with (or a superset of) the current shell checks.
+- Failures still sort into archsetup / base-install / unknown, with base-install
+  issues routed to the archzfs inbox as today.
+- =make deps= installs the test dependencies; the VM has nothing extra installed.
+- A documented =pytest -m smoke= fast path exists.
+
+* Resolved decisions (2026-06-25)
+
+1. *Auth at validation time — inject a throwaway test key.* Pre-run, generate
+   an ephemeral keypair, push the pubkey into the VM's
+   =/root/.ssh/authorized_keys= over the existing sshpass channel, and point
+   Testinfra at the private key via a generated ssh-config. No password in the
+   pytest invocation; paramiko key auth just works; the keypair is discarded
+   after the run. (Chosen over wrapping sshpass around Testinfra, which is
+   awkward since Testinfra spawns its own ssh connections.)
+2. *Cut over — run both through parity, then switch.* Keep the shell sweep
+   running alongside pytest through P2 so a real VM run can diff pytest's
+   results against the shell sweep and prove no check was dropped. pytest
+   becomes primary at P3; =run_all_validations= is deleted at P5 after the
+   expanded suite proves out.
+3. *Expansion scope — full, in this task, after cutover.* All of P4 lands here,
+   sequenced strictly after the P3 parity cutover so the parity diff is clean
+   before new checks are added.