From 3ff0ca70b9b7333ec100bd4d4212923a077553c5 Mon Sep 17 00:00:00 2001 From: Craig Jennings Date: Thu, 25 Jun 2026 11:58:04 -0400 Subject: docs(design): plan ZFS VM test coverage + bare-metal runner migration Adds a design note for building a ZFS base VM via archangel with a filesystem profile selector (so make test covers the ZFS install path, currently only exercised on bare metal), migrating run-test-baremetal.sh to key auth and the Testinfra sweep, and then deleting the dead shell-sweep functions. Links it from the bare-metal migration follow-up. --- docs/design/2026-06-25-zfs-vm-test-coverage.org | 120 ++++++++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 docs/design/2026-06-25-zfs-vm-test-coverage.org (limited to 'docs/design') diff --git a/docs/design/2026-06-25-zfs-vm-test-coverage.org b/docs/design/2026-06-25-zfs-vm-test-coverage.org new file mode 100644 index 0000000..694478f --- /dev/null +++ b/docs/design/2026-06-25-zfs-vm-test-coverage.org @@ -0,0 +1,120 @@ +#+TITLE: Design: ZFS VM Test Coverage + Bare-Metal Runner Migration +#+AUTHOR: Craig Jennings +#+DATE: 2026-06-25 +#+STATUS: Draft — for review + +* Problem + +Two gaps, one root: + +1. *The ZFS install path is untested in automation.* The VM harness + (=make test=) uses a single non-ZFS base image, so every ZFS-conditional + check skips (mkinitcpio udev hook on ZFS, sanoid, zfs-scrub timer, the whole + ZFS branch of archsetup). ZFS is exercised *only* by =run-test-baremetal.sh= + against real hardware. + +2. *=run-test-baremetal.sh= is latently broken by the sshd hardening.* It SSHes + to the target as root *by password* throughout the run, exactly the pattern + archsetup's =PermitRootLogin prohibit-password= (shipped 2026-06-24) kills + mid-install. The VM runner already hit and fixed this (=inject_root_key= + + key auth, commit f50fc1d); the bare-metal runner never got that fix, so it + almost certainly aborts mid-install now, the same way the VM runner did. + +The fix for both is the same shape: a ZFS base VM gives a safe, repeatable, +snapshot-rollback ZFS target (no sacrificial hardware), which both fills the +coverage gap *and* provides a target to migrate + validate the bare-metal +runner against. This also unblocks P5 (deleting the dead shell-sweep functions +from validation.sh), which is gated on the bare-metal runner leaving the shell +sweep. + +* Decision + +Build a ZFS base VM via archangel, add a filesystem-profile selector to the VM +harness so =make test= can target zfs or non-zfs, then migrate +=run-test-baremetal.sh= to key auth + the Testinfra sweep and validate it +against the ZFS VM. Finish by deleting the now-dead shell-sweep functions (P5). + +Explicitly rejected: loosening =PermitRootLogin= (or adding a skip-hardening +test flag). That trades a real security feature for harness convenience and +would mean never validating the hardened config. Key auth is the correct fix, +already proven in the VM runner. + +* Current state (grounded) + +- =create-base-vm.sh= boots an =archangel-*.iso=, copies =archsetup-test.conf= + into the live env, runs =archangel --config-file /root/archsetup-test.conf= + (the base-OS install — partitioning/filesystem live here), powers off, and + snapshots =clean-install= onto =vm-images/archsetup-base.qcow2=. +- =run-test.sh= hardcodes that one image + snapshot, and copies + =scripts/testing/archsetup-vm.conf= (DESKTOP_ENV=hyprland, non-ZFS) into the + VM as the archsetup config. +- =run-test-baremetal.sh= takes =--host= / =--password=, SSHes as root by + password, rolls back ZFS =@genesis= snapshots, transfers + runs archsetup, + then calls =run_all_validations= / =validate_all_services= (overriding + =VM_IP= to the target). It is the only remaining caller of the shell sweep. +- Key auth machinery already exists and is reusable: =inject_root_key= and + =SSH_KEY_OPT= in =vm-utils.sh=, and =run_testinfra_validation= in + =testinfra.sh= (drives connection from a generated ssh-config keyed on + =VM_IP= / =SSH_PORT=). + +* Design + +** A. ZFS archangel base +Add a ZFS archangel config (a =archsetup-test-zfs.conf= or equivalent) that +installs a ZFS root. Confirm archangel supports a ZFS-root config (it's a +separate project — verify its config options first). Unencrypted ZFS for the +test VM (skip the passphrase prompt; encryption isn't what we're validating). + +** B. Per-profile base images + selector +- =create-base-vm.sh= takes a profile (e.g. =FS_PROFILE=zfs|ext4=, default + current/non-ZFS), picks the matching archangel config, and writes a + profile-named image: =vm-images/archsetup-base.qcow2= (default) vs + =vm-images/archsetup-base-zfs.qcow2=. Same =clean-install= snapshot name. +- =run-test.sh= + Makefile take the same =FS_PROFILE= and select the image + + the matching archsetup config (=archsetup-vm.conf= vs =archsetup-vm-zfs.conf=, + the latter with the ZFS filesystem settings). =make test FS_PROFILE=zfs=. + +** C. Bare-metal runner migration +Mirror the VM runner's fix in =run-test-baremetal.sh=: +- After the first successful SSH to =TARGET_HOST=, call =inject_root_key= (it + authorizes a key over the password session; set =VM_IP=TARGET_HOST=, + =SSH_PORT=22= so the helpers + ssh-config target the real host). +- Replace =run_all_validations= / =validate_all_services= with + =run_testinfra_validation= (now authoritative). +- Everything downstream already routes through =$SSH_KEY_OPT= (the vm-utils + helpers) and the ssh-config, so it survives the hardening. + +** D. Validate +- =make test FS_PROFILE=zfs= → the ZFS-conditional pytest checks now *run* + (not skip): mkinitcpio uses the udev hook, sanoid installed, zfs-scrub timer, + zfs root. Fix any real ZFS-path findings archsetup has. +- Point =run-test-baremetal.sh= at the ZFS VM (or real hardware) → confirm the + key-auth migration carries it through the hardening to a green pytest sweep. + +** E. Delete the shell sweep (P5) +Once both runners use =run_testinfra_validation=, delete the dead functions from +=validation.sh= (run_all_validations, validate_all_services, the ~26 validate_* +checks, validate_service*, run_full_validation, validation_pass/fail/warn/skip). +Keep the live helpers: ssh_cmd, attribute_issue, capture_pre/post_install_state, +analyze_log_diff, categorize_errors, generate_issue_report, VALIDATION_*. + +* Phases +- *P-A* archangel ZFS config (verify archangel ZFS support first). +- *P-B* create-base-vm.sh + run-test.sh + Makefile profile selector; build the + ZFS base image + snapshot. +- *P-C* =make test FS_PROFILE=zfs= green (ZFS-conditional tests run; fix + findings). VM-validatable here. +- *P-D* migrate run-test-baremetal.sh to key auth + Testinfra; validate against + the ZFS VM. +- *P-E* delete the dead shell-sweep functions (the standing P5 follow-up). + +* Open questions +1. *Does archangel support a ZFS-root config out of the box?* Verify before P-A; + if not, that's its own sub-task (or a feature request to archangel). +2. *Two images vs one image + two snapshots?* Lean two images — ZFS vs ext4 are + different on-disk layouts; cleaner than juggling snapshots on one disk. +3. *Profile on run-test.sh vs a separate run-test-zfs.sh?* Lean a =FS_PROFILE= + param on the existing runner — avoids duplicating the harness. +4. *Disk size / RAM for the ZFS VM* (ZFS wants more RAM than the 4G default?). +5. *Should the bare-metal runner stay at all once a ZFS VM exists*, or does the + ZFS VM profile make it redundant for everything except real-hardware smoke? -- cgit v1.2.3