1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
|
#+TITLE: Design: ZFS VM Test Coverage + Bare-Metal Runner Migration
#+AUTHOR: Craig Jennings
#+DATE: 2026-06-25
#+STATUS: Draft — for review
* Problem
Two gaps, one root:
1. *The ZFS install path is untested in automation.* The VM harness
(=make test=) uses a single non-ZFS base image, so every ZFS-conditional
check skips (mkinitcpio udev hook on ZFS, sanoid, zfs-scrub timer, the whole
ZFS branch of archsetup). ZFS is exercised *only* by =run-test-baremetal.sh=
against real hardware.
2. *=run-test-baremetal.sh= is latently broken by the sshd hardening.* It SSHes
to the target as root *by password* throughout the run, exactly the pattern
archsetup's =PermitRootLogin prohibit-password= (shipped 2026-06-24) kills
mid-install. The VM runner already hit and fixed this (=inject_root_key= +
key auth, commit f50fc1d); the bare-metal runner never got that fix, so it
almost certainly aborts mid-install now, the same way the VM runner did.
The fix for both is the same shape: a ZFS base VM gives a safe, repeatable,
snapshot-rollback ZFS target (no sacrificial hardware), which both fills the
coverage gap *and* provides a target to migrate + validate the bare-metal
runner against. This also unblocks P5 (deleting the dead shell-sweep functions
from validation.sh), which is gated on the bare-metal runner leaving the shell
sweep.
* Decision
Build a ZFS base VM via archangel, add a filesystem-profile selector to the VM
harness so =make test= can target zfs or non-zfs, then migrate
=run-test-baremetal.sh= to key auth + the Testinfra sweep and validate it
against the ZFS VM. Finish by deleting the now-dead shell-sweep functions (P5).
Explicitly rejected: loosening =PermitRootLogin= (or adding a skip-hardening
test flag). That trades a real security feature for harness convenience and
would mean never validating the hardened config. Key auth is the correct fix,
already proven in the VM runner.
* Current state (grounded)
- =create-base-vm.sh= boots an =archangel-*.iso=, copies =archsetup-test.conf=
into the live env, runs =archangel --config-file /root/archsetup-test.conf=
(the base-OS install — partitioning/filesystem live here), powers off, and
snapshots =clean-install= onto =vm-images/archsetup-base.qcow2=.
- =run-test.sh= hardcodes that one image + snapshot, and copies
=scripts/testing/archsetup-vm.conf= (DESKTOP_ENV=hyprland, non-ZFS) into the
VM as the archsetup config.
- =run-test-baremetal.sh= takes =--host= / =--password=, SSHes as root by
password, rolls back ZFS =@genesis= snapshots, transfers + runs archsetup,
then calls =run_all_validations= / =validate_all_services= (overriding
=VM_IP= to the target). It is the only remaining caller of the shell sweep.
- Key auth machinery already exists and is reusable: =inject_root_key= and
=SSH_KEY_OPT= in =vm-utils.sh=, and =run_testinfra_validation= in
=testinfra.sh= (drives connection from a generated ssh-config keyed on
=VM_IP= / =SSH_PORT=).
* Design
** A. ZFS archangel base
Add a ZFS archangel config (a =archsetup-test-zfs.conf= or equivalent) that
installs a ZFS root. Confirm archangel supports a ZFS-root config (it's a
separate project — verify its config options first). Unencrypted ZFS for the
test VM (skip the passphrase prompt; encryption isn't what we're validating).
** B. Per-profile base images + selector
- =create-base-vm.sh= takes a profile (e.g. =FS_PROFILE=zfs|ext4=, default
current/non-ZFS), picks the matching archangel config, and writes a
profile-named image: =vm-images/archsetup-base.qcow2= (default) vs
=vm-images/archsetup-base-zfs.qcow2=. Same =clean-install= snapshot name.
- =run-test.sh= + Makefile take the same =FS_PROFILE= and select the image (via
=init_vm_paths=). The archsetup run config (=archsetup-vm.conf=) is *shared* —
archsetup auto-detects ZFS from the live root, so no per-profile run config is
needed. =make test FS_PROFILE=zfs=.
** C. Bare-metal runner migration
Mirror the VM runner's fix in =run-test-baremetal.sh=:
- After the first successful SSH to =TARGET_HOST=, call =inject_root_key= (it
authorizes a key over the password session; set =VM_IP=TARGET_HOST=,
=SSH_PORT=22= so the helpers + ssh-config target the real host).
- Replace =run_all_validations= / =validate_all_services= with
=run_testinfra_validation= (now authoritative).
- Everything downstream already routes through =$SSH_KEY_OPT= (the vm-utils
helpers) and the ssh-config, so it survives the hardening.
** D. Validate
- =make test FS_PROFILE=zfs= → the ZFS-conditional pytest checks now *run*
(not skip): mkinitcpio uses the udev hook, sanoid installed, zfs-scrub timer,
zfs root. Fix any real ZFS-path findings archsetup has.
- Point =run-test-baremetal.sh= at the ZFS VM (or real hardware) → confirm the
key-auth migration carries it through the hardening to a green pytest sweep.
** E. Delete the shell sweep (P5)
Once both runners use =run_testinfra_validation=, delete the dead functions from
=validation.sh= (run_all_validations, validate_all_services, the ~26 validate_*
checks, validate_service*, run_full_validation, validation_pass/fail/warn/skip).
Keep the live helpers: ssh_cmd, attribute_issue, capture_pre/post_install_state,
analyze_log_diff, categorize_errors, generate_issue_report, VALIDATION_*.
* Phases
- *P-A* archangel ZFS config (verify archangel ZFS support first).
- *P-B* create-base-vm.sh + run-test.sh + Makefile profile selector; build the
ZFS base image + snapshot.
- *P-C* =make test FS_PROFILE=zfs= green (ZFS-conditional tests run; fix
findings). VM-validatable here.
- *P-D* migrate run-test-baremetal.sh to key auth + Testinfra; validate against
the ZFS VM.
- *P-E* delete the dead shell-sweep functions (the standing P5 follow-up).
* Open questions
1. *Does archangel support a ZFS-root config out of the box?* RESOLVED (yes).
ZFS is archangel's *default* filesystem (=FILESYSTEM=zfs=, validated by
=installer/lib/config.sh:validate_filesystem=), with =NO_ENCRYPT=yes= for an
unattended unencrypted install and a ready =installer/velox-zfs.conf.example=
to model. No archangel work needed.
2. *Two images vs one image + two snapshots?* RESOLVED — two images. ZFS vs
btrfs are different on-disk layouts; cleaner than juggling snapshots on one
disk. =btrfs= keeps the legacy unsuffixed =archsetup-base.qcow2=; =zfs= gets
=archsetup-base-zfs.qcow2=.
3. *Profile on run-test.sh vs a separate run-test-zfs.sh?* RESOLVED —
=FS_PROFILE= env param on the existing runner + Makefile, no duplicate
harness.
4. *Disk size / RAM for the ZFS VM* — start at the 4G RAM / 50G disk defaults;
bump =VM_RAM= only if the ZFS install OOMs (decide at P-C build time).
5. *Should the bare-metal runner stay at all once a ZFS VM exists*, or does the
ZFS VM profile make it redundant for everything except real-hardware smoke?
Defer until after P-D.
* Design corrections (found during P-A/P-B grounding)
- The "non-ZFS" base is *btrfs*, not ext4 — =archsetup-test.conf= sets
=FILESYSTEM=btrfs=. The profile axis is zfs vs btrfs throughout.
- *No =archsetup-vm-zfs.conf= is needed.* archsetup reads no filesystem key; it
auto-detects ZFS from the live root via =is_zfs_root()= (=findmnt -n -o FSTYPE
/=, archsetup:688). The ZFS branch (sanoid, zfs-scrub timer, mkinitcpio udev
hook, docker zfs storage driver) fires whenever the running root is ZFS. So
only the *archangel* base config and the base *image* differ per profile; the
archsetup run config (=archsetup-vm.conf=) is shared.
|