aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2026-06-25 11:58:04 -0400
committerCraig Jennings <c@cjennings.net>2026-06-25 11:58:04 -0400
commit3ff0ca70b9b7333ec100bd4d4212923a077553c5 (patch)
tree100905cd2f6bba64a7c0b37ffdefd15ef6ed184b
parenteed3f5ee29f0099d5510b249923558b1301ad889 (diff)
downloadarchsetup-3ff0ca70b9b7333ec100bd4d4212923a077553c5.tar.gz
archsetup-3ff0ca70b9b7333ec100bd4d4212923a077553c5.zip
docs(design): plan ZFS VM test coverage + bare-metal runner migration
Adds a design note for building a ZFS base VM via archangel with a filesystem profile selector (so make test covers the ZFS install path, currently only exercised on bare metal), migrating run-test-baremetal.sh to key auth and the Testinfra sweep, and then deleting the dead shell-sweep functions. Links it from the bare-metal migration follow-up.
-rw-r--r--docs/design/2026-06-25-zfs-vm-test-coverage.org120
-rw-r--r--todo.org1
2 files changed, 121 insertions, 0 deletions
diff --git a/docs/design/2026-06-25-zfs-vm-test-coverage.org b/docs/design/2026-06-25-zfs-vm-test-coverage.org
new file mode 100644
index 0000000..694478f
--- /dev/null
+++ b/docs/design/2026-06-25-zfs-vm-test-coverage.org
@@ -0,0 +1,120 @@
+#+TITLE: Design: ZFS VM Test Coverage + Bare-Metal Runner Migration
+#+AUTHOR: Craig Jennings
+#+DATE: 2026-06-25
+#+STATUS: Draft — for review
+
+* Problem
+
+Two gaps, one root:
+
+1. *The ZFS install path is untested in automation.* The VM harness
+ (=make test=) uses a single non-ZFS base image, so every ZFS-conditional
+ check skips (mkinitcpio udev hook on ZFS, sanoid, zfs-scrub timer, the whole
+ ZFS branch of archsetup). ZFS is exercised *only* by =run-test-baremetal.sh=
+ against real hardware.
+
+2. *=run-test-baremetal.sh= is latently broken by the sshd hardening.* It SSHes
+ to the target as root *by password* throughout the run, exactly the pattern
+ archsetup's =PermitRootLogin prohibit-password= (shipped 2026-06-24) kills
+ mid-install. The VM runner already hit and fixed this (=inject_root_key= +
+ key auth, commit f50fc1d); the bare-metal runner never got that fix, so it
+ almost certainly aborts mid-install now, the same way the VM runner did.
+
+The fix for both is the same shape: a ZFS base VM gives a safe, repeatable,
+snapshot-rollback ZFS target (no sacrificial hardware), which both fills the
+coverage gap *and* provides a target to migrate + validate the bare-metal
+runner against. This also unblocks P5 (deleting the dead shell-sweep functions
+from validation.sh), which is gated on the bare-metal runner leaving the shell
+sweep.
+
+* Decision
+
+Build a ZFS base VM via archangel, add a filesystem-profile selector to the VM
+harness so =make test= can target zfs or non-zfs, then migrate
+=run-test-baremetal.sh= to key auth + the Testinfra sweep and validate it
+against the ZFS VM. Finish by deleting the now-dead shell-sweep functions (P5).
+
+Explicitly rejected: loosening =PermitRootLogin= (or adding a skip-hardening
+test flag). That trades a real security feature for harness convenience and
+would mean never validating the hardened config. Key auth is the correct fix,
+already proven in the VM runner.
+
+* Current state (grounded)
+
+- =create-base-vm.sh= boots an =archangel-*.iso=, copies =archsetup-test.conf=
+ into the live env, runs =archangel --config-file /root/archsetup-test.conf=
+ (the base-OS install — partitioning/filesystem live here), powers off, and
+ snapshots =clean-install= onto =vm-images/archsetup-base.qcow2=.
+- =run-test.sh= hardcodes that one image + snapshot, and copies
+ =scripts/testing/archsetup-vm.conf= (DESKTOP_ENV=hyprland, non-ZFS) into the
+ VM as the archsetup config.
+- =run-test-baremetal.sh= takes =--host= / =--password=, SSHes as root by
+ password, rolls back ZFS =@genesis= snapshots, transfers + runs archsetup,
+ then calls =run_all_validations= / =validate_all_services= (overriding
+ =VM_IP= to the target). It is the only remaining caller of the shell sweep.
+- Key auth machinery already exists and is reusable: =inject_root_key= and
+ =SSH_KEY_OPT= in =vm-utils.sh=, and =run_testinfra_validation= in
+ =testinfra.sh= (drives connection from a generated ssh-config keyed on
+ =VM_IP= / =SSH_PORT=).
+
+* Design
+
+** A. ZFS archangel base
+Add a ZFS archangel config (a =archsetup-test-zfs.conf= or equivalent) that
+installs a ZFS root. Confirm archangel supports a ZFS-root config (it's a
+separate project — verify its config options first). Unencrypted ZFS for the
+test VM (skip the passphrase prompt; encryption isn't what we're validating).
+
+** B. Per-profile base images + selector
+- =create-base-vm.sh= takes a profile (e.g. =FS_PROFILE=zfs|ext4=, default
+ current/non-ZFS), picks the matching archangel config, and writes a
+ profile-named image: =vm-images/archsetup-base.qcow2= (default) vs
+ =vm-images/archsetup-base-zfs.qcow2=. Same =clean-install= snapshot name.
+- =run-test.sh= + Makefile take the same =FS_PROFILE= and select the image +
+ the matching archsetup config (=archsetup-vm.conf= vs =archsetup-vm-zfs.conf=,
+ the latter with the ZFS filesystem settings). =make test FS_PROFILE=zfs=.
+
+** C. Bare-metal runner migration
+Mirror the VM runner's fix in =run-test-baremetal.sh=:
+- After the first successful SSH to =TARGET_HOST=, call =inject_root_key= (it
+ authorizes a key over the password session; set =VM_IP=TARGET_HOST=,
+ =SSH_PORT=22= so the helpers + ssh-config target the real host).
+- Replace =run_all_validations= / =validate_all_services= with
+ =run_testinfra_validation= (now authoritative).
+- Everything downstream already routes through =$SSH_KEY_OPT= (the vm-utils
+ helpers) and the ssh-config, so it survives the hardening.
+
+** D. Validate
+- =make test FS_PROFILE=zfs= → the ZFS-conditional pytest checks now *run*
+ (not skip): mkinitcpio uses the udev hook, sanoid installed, zfs-scrub timer,
+ zfs root. Fix any real ZFS-path findings archsetup has.
+- Point =run-test-baremetal.sh= at the ZFS VM (or real hardware) → confirm the
+ key-auth migration carries it through the hardening to a green pytest sweep.
+
+** E. Delete the shell sweep (P5)
+Once both runners use =run_testinfra_validation=, delete the dead functions from
+=validation.sh= (run_all_validations, validate_all_services, the ~26 validate_*
+checks, validate_service*, run_full_validation, validation_pass/fail/warn/skip).
+Keep the live helpers: ssh_cmd, attribute_issue, capture_pre/post_install_state,
+analyze_log_diff, categorize_errors, generate_issue_report, VALIDATION_*.
+
+* Phases
+- *P-A* archangel ZFS config (verify archangel ZFS support first).
+- *P-B* create-base-vm.sh + run-test.sh + Makefile profile selector; build the
+ ZFS base image + snapshot.
+- *P-C* =make test FS_PROFILE=zfs= green (ZFS-conditional tests run; fix
+ findings). VM-validatable here.
+- *P-D* migrate run-test-baremetal.sh to key auth + Testinfra; validate against
+ the ZFS VM.
+- *P-E* delete the dead shell-sweep functions (the standing P5 follow-up).
+
+* Open questions
+1. *Does archangel support a ZFS-root config out of the box?* Verify before P-A;
+ if not, that's its own sub-task (or a feature request to archangel).
+2. *Two images vs one image + two snapshots?* Lean two images — ZFS vs ext4 are
+ different on-disk layouts; cleaner than juggling snapshots on one disk.
+3. *Profile on run-test.sh vs a separate run-test-zfs.sh?* Lean a =FS_PROFILE=
+ param on the existing runner — avoids duplicating the harness.
+4. *Disk size / RAM for the ZFS VM* (ZFS wants more RAM than the 4G default?).
+5. *Should the bare-metal runner stay at all once a ZFS VM exists*, or does the
+ ZFS VM profile make it redundant for everything except real-hardware smoke?
diff --git a/todo.org b/todo.org
index ed9ac7d..a57d672 100644
--- a/todo.org
+++ b/todo.org
@@ -533,6 +533,7 @@ If modifications fail or are incorrect, difficult to recover - should backup fil
Done 2026-06-25: added a =backup_system_file <path>= helper next to =safe_rm_rf= — it snapshots a pre-existing file to =<path>.archsetup.bak= before an in-place edit, idempotent (never clobbers an existing backup, so the pristine original survives repeated edits and re-runs), =cp -p= to preserve mode/ownership, no-op when the file is absent. Took the narrow scope (Craig's call): route only the in-place =sed -i= / append edits to *pre-existing* files through it — locale.gen, makepkg.conf, pacman.conf, sudoers, conf.d/wireless-regdom, geoclue.conf, conf.d/pacman-contrib, fstab, mkinitcpio.conf, vconsole.conf — and skip the brand-new drop-in files archsetup fully owns (nothing to back up; recovery is just deleting them). Tests: =tests/backup-system-file/= (7 Normal/Boundary/Error, incl. mode-preserved, existing-backup-not-overwritten, missing-target no-op, cp-failure). =make test-unit= green across all 5 suites; =bash -n= clean; only shellcheck note is the known SC2329 false positive (indirect STEPS dispatch). Integration verification is the next VM run.
** TODO [#B] Migrate bare-metal test runner to Testinfra, then delete the shell sweep :test:
+Plan + ZFS-coverage expansion: [[file:docs/design/2026-06-25-zfs-vm-test-coverage.org]] (build a ZFS base VM via archangel + a =FS_PROFILE= selector so =make test= covers the ZFS path, then migrate this runner to key auth + Testinfra against it, then delete the dead =validation.sh= functions = phase E here).
=run-test.sh= (VM) now uses the Testinfra/pytest sweep as its authoritative validator, but =run-test-baremetal.sh= (lines ~243-244) still calls the old =run_all_validations= / =validate_all_services= from =scripts/testing/lib/validation.sh=. Migrate the bare-metal runner to =run_testinfra_validation= too (same key + ssh-config approach, adapted for a real host), then delete the now-dead shell-sweep functions from =validation.sh=. Keep the live helpers: =ssh_cmd=, =attribute_issue=, =capture_pre/post_install_state=, =analyze_log_diff=, =categorize_errors=, =generate_issue_report=, and the =VALIDATION_*= counters/arrays. Deferred from the Testinfra cutover because it needs a bare-metal test loop to validate, out of scope for the VM-only autonomous run.
** DONE [#B] Implement Testinfra test suite for archsetup