aboutsummaryrefslogtreecommitdiff
path: root/docs/design
diff options
context:
space:
mode:
Diffstat (limited to 'docs/design')
-rw-r--r--docs/design/2026-06-25-zfs-vm-test-coverage.org120
1 files changed, 120 insertions, 0 deletions
diff --git a/docs/design/2026-06-25-zfs-vm-test-coverage.org b/docs/design/2026-06-25-zfs-vm-test-coverage.org
new file mode 100644
index 0000000..694478f
--- /dev/null
+++ b/docs/design/2026-06-25-zfs-vm-test-coverage.org
@@ -0,0 +1,120 @@
+#+TITLE: Design: ZFS VM Test Coverage + Bare-Metal Runner Migration
+#+AUTHOR: Craig Jennings
+#+DATE: 2026-06-25
+#+STATUS: Draft — for review
+
+* Problem
+
+Two gaps, one root:
+
+1. *The ZFS install path is untested in automation.* The VM harness
+ (=make test=) uses a single non-ZFS base image, so every ZFS-conditional
+ check skips (mkinitcpio udev hook on ZFS, sanoid, zfs-scrub timer, the whole
+ ZFS branch of archsetup). ZFS is exercised *only* by =run-test-baremetal.sh=
+ against real hardware.
+
+2. *=run-test-baremetal.sh= is latently broken by the sshd hardening.* It SSHes
+ to the target as root *by password* throughout the run, exactly the pattern
+ archsetup's =PermitRootLogin prohibit-password= (shipped 2026-06-24) kills
+ mid-install. The VM runner already hit and fixed this (=inject_root_key= +
+ key auth, commit f50fc1d); the bare-metal runner never got that fix, so it
+ almost certainly aborts mid-install now, the same way the VM runner did.
+
+The fix for both is the same shape: a ZFS base VM gives a safe, repeatable,
+snapshot-rollback ZFS target (no sacrificial hardware), which both fills the
+coverage gap *and* provides a target to migrate + validate the bare-metal
+runner against. This also unblocks P5 (deleting the dead shell-sweep functions
+from validation.sh), which is gated on the bare-metal runner leaving the shell
+sweep.
+
+* Decision
+
+Build a ZFS base VM via archangel, add a filesystem-profile selector to the VM
+harness so =make test= can target zfs or non-zfs, then migrate
+=run-test-baremetal.sh= to key auth + the Testinfra sweep and validate it
+against the ZFS VM. Finish by deleting the now-dead shell-sweep functions (P5).
+
+Explicitly rejected: loosening =PermitRootLogin= (or adding a skip-hardening
+test flag). That trades a real security feature for harness convenience and
+would mean never validating the hardened config. Key auth is the correct fix,
+already proven in the VM runner.
+
+* Current state (grounded)
+
+- =create-base-vm.sh= boots an =archangel-*.iso=, copies =archsetup-test.conf=
+ into the live env, runs =archangel --config-file /root/archsetup-test.conf=
+ (the base-OS install — partitioning/filesystem live here), powers off, and
+ snapshots =clean-install= onto =vm-images/archsetup-base.qcow2=.
+- =run-test.sh= hardcodes that one image + snapshot, and copies
+ =scripts/testing/archsetup-vm.conf= (DESKTOP_ENV=hyprland, non-ZFS) into the
+ VM as the archsetup config.
+- =run-test-baremetal.sh= takes =--host= / =--password=, SSHes as root by
+ password, rolls back ZFS =@genesis= snapshots, transfers + runs archsetup,
+ then calls =run_all_validations= / =validate_all_services= (overriding
+ =VM_IP= to the target). It is the only remaining caller of the shell sweep.
+- Key auth machinery already exists and is reusable: =inject_root_key= and
+ =SSH_KEY_OPT= in =vm-utils.sh=, and =run_testinfra_validation= in
+ =testinfra.sh= (drives connection from a generated ssh-config keyed on
+ =VM_IP= / =SSH_PORT=).
+
+* Design
+
+** A. ZFS archangel base
+Add a ZFS archangel config (a =archsetup-test-zfs.conf= or equivalent) that
+installs a ZFS root. Confirm archangel supports a ZFS-root config (it's a
+separate project — verify its config options first). Unencrypted ZFS for the
+test VM (skip the passphrase prompt; encryption isn't what we're validating).
+
+** B. Per-profile base images + selector
+- =create-base-vm.sh= takes a profile (e.g. =FS_PROFILE=zfs|ext4=, default
+ current/non-ZFS), picks the matching archangel config, and writes a
+ profile-named image: =vm-images/archsetup-base.qcow2= (default) vs
+ =vm-images/archsetup-base-zfs.qcow2=. Same =clean-install= snapshot name.
+- =run-test.sh= + Makefile take the same =FS_PROFILE= and select the image +
+ the matching archsetup config (=archsetup-vm.conf= vs =archsetup-vm-zfs.conf=,
+ the latter with the ZFS filesystem settings). =make test FS_PROFILE=zfs=.
+
+** C. Bare-metal runner migration
+Mirror the VM runner's fix in =run-test-baremetal.sh=:
+- After the first successful SSH to =TARGET_HOST=, call =inject_root_key= (it
+ authorizes a key over the password session; set =VM_IP=TARGET_HOST=,
+ =SSH_PORT=22= so the helpers + ssh-config target the real host).
+- Replace =run_all_validations= / =validate_all_services= with
+ =run_testinfra_validation= (now authoritative).
+- Everything downstream already routes through =$SSH_KEY_OPT= (the vm-utils
+ helpers) and the ssh-config, so it survives the hardening.
+
+** D. Validate
+- =make test FS_PROFILE=zfs= → the ZFS-conditional pytest checks now *run*
+ (not skip): mkinitcpio uses the udev hook, sanoid installed, zfs-scrub timer,
+ zfs root. Fix any real ZFS-path findings archsetup has.
+- Point =run-test-baremetal.sh= at the ZFS VM (or real hardware) → confirm the
+ key-auth migration carries it through the hardening to a green pytest sweep.
+
+** E. Delete the shell sweep (P5)
+Once both runners use =run_testinfra_validation=, delete the dead functions from
+=validation.sh= (run_all_validations, validate_all_services, the ~26 validate_*
+checks, validate_service*, run_full_validation, validation_pass/fail/warn/skip).
+Keep the live helpers: ssh_cmd, attribute_issue, capture_pre/post_install_state,
+analyze_log_diff, categorize_errors, generate_issue_report, VALIDATION_*.
+
+* Phases
+- *P-A* archangel ZFS config (verify archangel ZFS support first).
+- *P-B* create-base-vm.sh + run-test.sh + Makefile profile selector; build the
+ ZFS base image + snapshot.
+- *P-C* =make test FS_PROFILE=zfs= green (ZFS-conditional tests run; fix
+ findings). VM-validatable here.
+- *P-D* migrate run-test-baremetal.sh to key auth + Testinfra; validate against
+ the ZFS VM.
+- *P-E* delete the dead shell-sweep functions (the standing P5 follow-up).
+
+* Open questions
+1. *Does archangel support a ZFS-root config out of the box?* Verify before P-A;
+ if not, that's its own sub-task (or a feature request to archangel).
+2. *Two images vs one image + two snapshots?* Lean two images — ZFS vs ext4 are
+ different on-disk layouts; cleaner than juggling snapshots on one disk.
+3. *Profile on run-test.sh vs a separate run-test-zfs.sh?* Lean a =FS_PROFILE=
+ param on the existing runner — avoids duplicating the harness.
+4. *Disk size / RAM for the ZFS VM* (ZFS wants more RAM than the 4G default?).
+5. *Should the bare-metal runner stay at all once a ZFS VM exists*, or does the
+ ZFS VM profile make it redundant for everything except real-hardware smoke?