diff options
Diffstat (limited to 'docs/design')
| -rw-r--r-- | docs/design/2026-06-25-testinfra-validation.org | 238 |
1 files changed, 238 insertions, 0 deletions
diff --git a/docs/design/2026-06-25-testinfra-validation.org b/docs/design/2026-06-25-testinfra-validation.org new file mode 100644 index 0000000..5c82aa2 --- /dev/null +++ b/docs/design/2026-06-25-testinfra-validation.org @@ -0,0 +1,238 @@ +#+TITLE: Design: Testinfra Post-Install Validation for archsetup +#+AUTHOR: Craig Jennings +#+DATE: 2026-06-25 +#+STATUS: Accepted (2026-06-25) + +* Problem + +The VM integration harness (=scripts/testing/run-test.sh=) runs archsetup in a +QEMU VM, then verifies the result two ways: + +1. Parses archsetup's own install log for its Error Summary and the + =ARCHSETUP_EXECUTION_COMPLETE= marker (did the script finish, did it log + errors). +2. Runs =run_all_validations= from =scripts/testing/lib/validation.sh= — a + hand-rolled, shell-based post-install assertion sweep of ~26 checks over SSH. + +The shell sweep works, but each check is 6-40 lines of =ssh_cmd= + +=validation_pass/fail= + =attribute_issue= boilerplate, the pass/fail counters +are hand-maintained globals, and the reporting is bespoke. Adding or reading a +check is heavier than it should be, and growing the suite (archsetup configures +far more than the 26 checks cover) compounds that weight. + +This doc proposes porting the post-install validation to Testinfra (Python + +pytest) for more expressive checks and better reporting, then growing coverage. + +* Decision + +Port the post-install validation layer to Testinfra + pytest, reaching parity +with the existing =validation.sh= sweep, then expand coverage. Recorded +rationale: the up-front port cost (parity rewrite + a test-only dependency) is +an accepted trade — the priority is a robust, well-reported, growing validation +suite over feature speed. The framework swap alone buys ergonomics and +reporting, not coverage, so it is paired with real new coverage (below). + +This replaces the shell sweep; it does not touch archsetup's own install-log +parsing (that stays as a separate signal). The full coverage expansion (P4) +lands in this task too, sequenced strictly after the parity cutover so the +parity verification stays clean. + +* Current harness (what exists today) + +** Flow (run-test.sh) +1. Revert VM to base snapshot, boot, wait for SSH. +2. =capture_pre_install_state=. +3. Bundle + copy archsetup + dotfiles into the VM, run archsetup in background, + poll to completion. +4. =capture_post_install_state=. +5. =run_all_validations= (the shell sweep). +6. =analyze_log_diff= + =generate_issue_report= (issue attribution). +7. Explicit pass/fail exit code; cleanup. + +** The shell sweep (validation.sh) +~26 checks under =run_all_validations=: user created / shell / groups, dotfiles, +yay, pacman working, window manager, firewall, DNS, avahi, fail2ban, +NetworkManager, emacs, git config, dev tools, zfs, boot config, autologin, +gnome-keyring, terminus font, mkinitcpio hooks, initramfs consolefont, nvme +module, archsetup log, state markers. + +** Issue attribution +=attribute_issue <msg> <bucket>= sorts each failure into one of three arrays — +=ARCHSETUP_ISSUES=, =BASE_INSTALL_ISSUES=, =UNKNOWN_ISSUES= — and +=generate_issue_report= writes them out (base-install issues route to the +archzfs inbox). This is domain logic Testinfra has no equivalent for; the port +must preserve it. + +** Connection +=ssh_cmd= uses =sshpass -p "$ROOT_PASSWORD" ssh ... -p "$SSH_PORT" root@$VM_IP=, +with =VM_IP=localhost=, =SSH_PORT=2222=, =ROOT_PASSWORD=archsetup=. + +* Design + +** Where Testinfra fits +Replace the =run_all_validations= call (step 5) with a pytest invocation against +the running VM. Steps 1-4 and 6-7 are unchanged; =analyze_log_diff= stays. +Testinfra connects over the same SSH the harness already exposes. + +** Connection model +Testinfra's paramiko/ssh backend targets the live VM via its host spec: + +#+begin_src sh +pytest scripts/testing/tests/ \ + --hosts="ssh://root@localhost:2222" \ + --ssh-config=<generated> \ + --json-report --json-report-file="$TEST_RESULTS_DIR/testinfra.json" +#+end_src + +Password auth: generate a throwaway ssh-config (or reuse sshpass via a +=--ssh-identity= once archsetup drops the key, but at validation time we only +have the root password). Simplest: a tiny generated ssh config + sshpass +wrapper, or switch the test VM to a known test key injected pre-run. Open +question below. + +** Test layout +#+begin_example +scripts/testing/tests/ + conftest.py # host fixture, markers, attribution hook, report glue + test_users.py # user created / shell / groups + test_dotfiles.py # stow symlinks, readable by user + test_packages.py # yay, pacman working, dev tools, key packages + test_services.py # firewall, dns, avahi, fail2ban, networkmanager + test_boot.py # zfs, mkinitcpio hooks, nvme, consolefont, terminus + test_desktop.py # window manager, autologin, gnome-keyring + test_archsetup.py # install log, state markers + test_hardening.py # NEW: sshd drop-in, sysctl, /etc fstab perms, backups +#+end_example + +** Example tests (parity) +#+begin_src python +def test_ufw_enabled(host): + assert host.service("ufw").is_enabled + +def test_user_cjennings_exists(host): + u = host.user("cjennings") + assert u.exists + assert u.shell == "/usr/bin/zsh" + +def test_zshrc_stowed_and_readable(host): + f = host.file("/home/cjennings/.zshrc") + assert f.is_symlink + assert ".dotfiles/" in f.linked_to + assert f.exists # not broken + assert host.run("sudo -u cjennings test -r %s" % f.path).rc == 0 + +def test_mkinitcpio_systemd_hook(host): + # non-ZFS systems delegate fsck from udev to systemd + conf = host.file("/etc/mkinitcpio.conf").content_string + assert "systemd" in conf +#+end_src + +Compare =test_ufw_enabled= (1 line) to the current =validate_firewall= (8 lines +of ssh_cmd + branch + counters). + +** Preserving issue attribution +Map the three buckets to pytest markers and collect them in a =conftest.py= +hook: + +#+begin_src python +@pytest.mark.attribution("archsetup") # or "base_install" / "unknown" +def test_ufw_enabled(host): ... +#+end_src + +A =pytest_runtest_makereport= hook records each failure under its marker's +bucket and writes the same three-way report =generate_issue_report= produces +(base-install failures still route to the archzfs inbox). Default bucket = +archsetup when unmarked. + +** Tiered strategy +Markers =@pytest.mark.smoke= (user, key packages, dotfiles present) and +=@pytest.mark.integration= (services, configs, boot). =pytest -m smoke= for a +fast gate, full run otherwise. Drop the task's original X11/startx end-to-end +slice — the fleet is Wayland/Hyprland and headless GUI e2e is flaky and +expensive; a Wayland-session smoke check can be reconsidered later as its own +task. + +** Reporting +=pytest-json-report= (or junit-xml) → =$TEST_RESULTS_DIR/=, surfaced in the +test report alongside the install-log analysis. pytest's own per-test +pass/fail/skip output replaces the hand-maintained counters. + +* Coverage + +** Parity (port all current checks) +All ~26 =validation.sh= checks, grouped per the layout above. + +** Expansion (new — the coverage win) +archsetup configures much that isn't validated today. Candidates: +- sshd hardening drop-in (=/etc/ssh/sshd_config.d/10-hardening.conf=, + PermitRootLogin prohibit-password). +- =backup_system_file= behavior — assert =.archsetup.bak= exists for files + archsetup edited in place (fstab, mkinitcpio.conf, sudoers, …). +- pacman.conf (ParallelDownloads, Color, multilib) and makepkg.conf (MAKEFLAGS, + OPTIONS) settings actually applied. +- systemd-resolved DNS-over-TLS drop-in; NetworkManager wifi-privacy. +- fail2ban jail.local present; reflector config; sysctl printk; /etc/issue + emptied; vconsole font; fstab /efi fmask/dmask perms. +- sanoid / zfs-replicate units (ZFS hosts). + +* Dependencies + +Add =python-pytest=, =python-pytest-testinfra= (pulls paramiko), and a JSON +reporter to =make deps= (test host only — not installed by archsetup itself). +Note: the existing unit suites run under =python3 -m unittest=; the integration +layer runs under pytest. Two runners, both Python; =make test-unit= unchanged, +=make test= gains the pytest step. + +* Goss comparison (the task asked) + +- *Goss* — YAML-declarative health specs, a single Go binary executed *on the + target*. Fast, no Python. But the spec must be pushed into the VM and run + there, the assertions are less programmable, and it adds a Go binary to the + flow. +- *Testinfra* — Python, runs *on the host* over SSH (nothing installed in the + VM), assertions are full Python with rich built-in modules + (File/Package/Service/User/Command), integrates with pytest's tooling. + +Choose Testinfra: it runs from the host (the VM stays clean), it's far more +programmable for the conditional checks archsetup needs (DESKTOP_ENV branches, +ZFS-vs-not), and it aligns with the repo's existing Python test tooling. + +* Migration plan (phased, TDD where the helper logic is ours) + +- *P1 — Scaffold.* conftest.py (host fixture + connection), the attribution + marker + report hook, and 3 parity checks (firewall, user, dotfiles). Wire a + pytest step into run-test.sh behind a flag so the shell sweep still runs. +- *P2 — Full parity.* Port all ~26 checks; diff a real VM run's results against + the shell sweep to confirm no check was lost. +- *P3 — Cut over.* Make pytest the primary sweep in run-test.sh; keep + =analyze_log_diff= and the install-log signal. +- *P4 — Expand.* Add the new coverage (hardening, backups, applied settings). +- *P5 — Retire.* Remove =run_all_validations= from validation.sh (keep the + capture/analyze helpers that pytest doesn't replace). + +* Acceptance criteria + +- =make test= runs archsetup in a VM, then a pytest sweep over SSH, and a real + run reports parity with (or a superset of) the current shell checks. +- Failures still sort into archsetup / base-install / unknown, with base-install + issues routed to the archzfs inbox as today. +- =make deps= installs the test dependencies; the VM has nothing extra installed. +- A documented =pytest -m smoke= fast path exists. + +* Resolved decisions (2026-06-25) + +1. *Auth at validation time — inject a throwaway test key.* Pre-run, generate + an ephemeral keypair, push the pubkey into the VM's + =/root/.ssh/authorized_keys= over the existing sshpass channel, and point + Testinfra at the private key via a generated ssh-config. No password in the + pytest invocation; paramiko key auth just works; the keypair is discarded + after the run. (Chosen over wrapping sshpass around Testinfra, which is + awkward since Testinfra spawns its own ssh connections.) +2. *Cut over — run both through parity, then switch.* Keep the shell sweep + running alongside pytest through P2 so a real VM run can diff pytest's + results against the shell sweep and prove no check was dropped. pytest + becomes primary at P3; =run_all_validations= is deleted at P5 after the + expanded suite proves out. +3. *Expansion scope — full, in this task, after cutover.* All of P4 lands here, + sequenced strictly after the P3 parity cutover so the parity diff is clean + before new checks are added. |
