#+TITLE: Design: ZFS VM Test Coverage + Bare-Metal Runner Migration #+AUTHOR: Craig Jennings #+DATE: 2026-06-25 #+STATUS: Draft — for review * Problem Two gaps, one root: 1. *The ZFS install path is untested in automation.* The VM harness (=make test=) uses a single non-ZFS base image, so every ZFS-conditional check skips (mkinitcpio udev hook on ZFS, sanoid, zfs-scrub timer, the whole ZFS branch of archsetup). ZFS is exercised *only* by =run-test-baremetal.sh= against real hardware. 2. *=run-test-baremetal.sh= is latently broken by the sshd hardening.* It SSHes to the target as root *by password* throughout the run, exactly the pattern archsetup's =PermitRootLogin prohibit-password= (shipped 2026-06-24) kills mid-install. The VM runner already hit and fixed this (=inject_root_key= + key auth, commit f50fc1d); the bare-metal runner never got that fix, so it almost certainly aborts mid-install now, the same way the VM runner did. The fix for both is the same shape: a ZFS base VM gives a safe, repeatable, snapshot-rollback ZFS target (no sacrificial hardware), which both fills the coverage gap *and* provides a target to migrate + validate the bare-metal runner against. This also unblocks P5 (deleting the dead shell-sweep functions from validation.sh), which is gated on the bare-metal runner leaving the shell sweep. * Decision Build a ZFS base VM via archangel, add a filesystem-profile selector to the VM harness so =make test= can target zfs or non-zfs, then migrate =run-test-baremetal.sh= to key auth + the Testinfra sweep and validate it against the ZFS VM. Finish by deleting the now-dead shell-sweep functions (P5). Explicitly rejected: loosening =PermitRootLogin= (or adding a skip-hardening test flag). That trades a real security feature for harness convenience and would mean never validating the hardened config. Key auth is the correct fix, already proven in the VM runner. * Current state (grounded) - =create-base-vm.sh= boots an =archangel-*.iso=, copies =archsetup-test.conf= into the live env, runs =archangel --config-file /root/archsetup-test.conf= (the base-OS install — partitioning/filesystem live here), powers off, and snapshots =clean-install= onto =vm-images/archsetup-base.qcow2=. - =run-test.sh= hardcodes that one image + snapshot, and copies =scripts/testing/archsetup-vm.conf= (DESKTOP_ENV=hyprland, non-ZFS) into the VM as the archsetup config. - =run-test-baremetal.sh= takes =--host= / =--password=, SSHes as root by password, rolls back ZFS =@genesis= snapshots, transfers + runs archsetup, then calls =run_all_validations= / =validate_all_services= (overriding =VM_IP= to the target). It is the only remaining caller of the shell sweep. - Key auth machinery already exists and is reusable: =inject_root_key= and =SSH_KEY_OPT= in =vm-utils.sh=, and =run_testinfra_validation= in =testinfra.sh= (drives connection from a generated ssh-config keyed on =VM_IP= / =SSH_PORT=). * Design ** A. ZFS archangel base Add a ZFS archangel config (a =archsetup-test-zfs.conf= or equivalent) that installs a ZFS root. Confirm archangel supports a ZFS-root config (it's a separate project — verify its config options first). Unencrypted ZFS for the test VM (skip the passphrase prompt; encryption isn't what we're validating). ** B. Per-profile base images + selector - =create-base-vm.sh= takes a profile (e.g. =FS_PROFILE=zfs|ext4=, default current/non-ZFS), picks the matching archangel config, and writes a profile-named image: =vm-images/archsetup-base.qcow2= (default) vs =vm-images/archsetup-base-zfs.qcow2=. Same =clean-install= snapshot name. - =run-test.sh= + Makefile take the same =FS_PROFILE= and select the image (via =init_vm_paths=). The archsetup run config (=archsetup-vm.conf=) is *shared* — archsetup auto-detects ZFS from the live root, so no per-profile run config is needed. =make test FS_PROFILE=zfs=. ** C. Bare-metal runner migration Mirror the VM runner's fix in =run-test-baremetal.sh=: - After the first successful SSH to =TARGET_HOST=, call =inject_root_key= (it authorizes a key over the password session; set =VM_IP=TARGET_HOST=, =SSH_PORT=22= so the helpers + ssh-config target the real host). - Replace =run_all_validations= / =validate_all_services= with =run_testinfra_validation= (now authoritative). - Everything downstream already routes through =$SSH_KEY_OPT= (the vm-utils helpers) and the ssh-config, so it survives the hardening. ** D. Validate - =make test FS_PROFILE=zfs= → the ZFS-conditional pytest checks now *run* (not skip): mkinitcpio uses the udev hook, sanoid installed, zfs-scrub timer, zfs root. Fix any real ZFS-path findings archsetup has. - Point =run-test-baremetal.sh= at the ZFS VM (or real hardware) → confirm the key-auth migration carries it through the hardening to a green pytest sweep. ** E. Delete the shell sweep (P5) Once both runners use =run_testinfra_validation=, delete the dead functions from =validation.sh= (run_all_validations, validate_all_services, the ~26 validate_* checks, validate_service*, run_full_validation, validation_pass/fail/warn/skip). Keep the live helpers: ssh_cmd, attribute_issue, capture_pre/post_install_state, analyze_log_diff, categorize_errors, generate_issue_report, VALIDATION_*. * Phases - *P-A* archangel ZFS config (verify archangel ZFS support first). - *P-B* create-base-vm.sh + run-test.sh + Makefile profile selector; build the ZFS base image + snapshot. - *P-C* =make test FS_PROFILE=zfs= green (ZFS-conditional tests run; fix findings). VM-validatable here. - *P-D* migrate run-test-baremetal.sh to key auth + Testinfra; validate against the ZFS VM. - *P-E* delete the dead shell-sweep functions (the standing P5 follow-up). * Open questions 1. *Does archangel support a ZFS-root config out of the box?* RESOLVED (yes). ZFS is archangel's *default* filesystem (=FILESYSTEM=zfs=, validated by =installer/lib/config.sh:validate_filesystem=), with =NO_ENCRYPT=yes= for an unattended unencrypted install and a ready =installer/velox-zfs.conf.example= to model. No archangel work needed. 2. *Two images vs one image + two snapshots?* RESOLVED — two images. ZFS vs btrfs are different on-disk layouts; cleaner than juggling snapshots on one disk. =btrfs= keeps the legacy unsuffixed =archsetup-base.qcow2=; =zfs= gets =archsetup-base-zfs.qcow2=. 3. *Profile on run-test.sh vs a separate run-test-zfs.sh?* RESOLVED — =FS_PROFILE= env param on the existing runner + Makefile, no duplicate harness. 4. *Disk size / RAM for the ZFS VM* — start at the 4G RAM / 50G disk defaults; bump =VM_RAM= only if the ZFS install OOMs (decide at P-C build time). 5. *Should the bare-metal runner stay at all once a ZFS VM exists*, or does the ZFS VM profile make it redundant for everything except real-hardware smoke? Defer until after P-D. * Design corrections (found during P-A/P-B grounding) - The "non-ZFS" base is *btrfs*, not ext4 — =archsetup-test.conf= sets =FILESYSTEM=btrfs=. The profile axis is zfs vs btrfs throughout. - *No =archsetup-vm-zfs.conf= is needed.* archsetup reads no filesystem key; it auto-detects ZFS from the live root via =is_zfs_root()= (=findmnt -n -o FSTYPE /=, archsetup:688). The ZFS branch (sanoid, zfs-scrub timer, mkinitcpio udev hook, docker zfs storage driver) fires whenever the running root is ZFS. So only the *archangel* base config and the base *image* differ per profile; the archsetup run config (=archsetup-vm.conf=) is shared.