archangel/scripts, branch main

archangel/scripts, branch main Arch Linux installer ISO — ZFS-on-root or BTRFS, doubles as rescue disk https://git.cjennings.net/archangel/atom?h=main 2026-05-23T01:58:01+00:00 fix(test): run the ZFS-encryption check on the booted system 2026-05-23T01:58:01+00:00 Craig Jennings c@cjennings.net 2026-05-23T01:58:01+00:00 urn:sha1:3165c50fed266fef0b388190296c149c0ae0ee47 The ZFS native-encryption assertion lived in verify_install, which runs in the live ISO before reboot. But archangel exports zroot at the end of the install, so verify_install bails at "ZFS pool not found" and never reaches the check. It was dead code: the encrypted-config tests passed on the reboot path (entering the passphrase at ZFSBootMenu and booting is itself proof), while the explicit aes-256-gcm assertion gave false confidence by never running. I moved it into verify_reboot_survival, which ssh's into the booted system where zroot is imported, so zfs get encryption zroot/ROOT actually returns aes-256-gcm and the assertion fires. Confirmed on a zfs-encrypt VM run: "ZFS encryption (aes-256-gcm) verified on running system." fix(build): clear stale archzfs from the pacoloco cache too 2026-05-23T01:28:15+00:00 Craig Jennings c@cjennings.net 2026-05-23T01:28:15+00:00 urn:sha1:bed054f46e3b41aae0d599ed7fbc3e1e42d6ddd7 archzfs re-uploads its GitHub release assets under the same filename, so pacoloco keeps serving a zfs-dkms/zfs-utils it cached earlier while pacman fetches a fresh archzfs.db with a new checksum. The two mismatch and pacstrap aborts with "invalid or corrupted package." build.sh already drops the stale packages from the host pacman cache, but it never cleared the pacoloco layer, which the VM test installs route through too, so test-install.sh kept hitting the corruption (four times in one session). build.sh runs as root, so it now clears /var/cache/pacoloco/pkgs/archzfs/zfs-* alongside the host cache, which makes the build-then-test flow self-healing. The pacoloco cache is root-owned and test-install.sh runs as the user, so it can't clear it unattended. Instead, test-install.sh now recognizes the corruption (is_archzfs_cache_corruption) and prints how to clear it, the way it already names the SSH_PORT override on a port collision. A retry alone won't help since it hits the same cached file, so this fails fast with the hint rather than retrying. fix(test): fail clearly when the VM forward port is taken 2026-05-23T01:12:14+00:00 Craig Jennings c@cjennings.net 2026-05-23T01:12:14+00:00 urn:sha1:0f8bbc7c1e2c2f6fec0b17753ac0d9c4a3ad4317 A test run launched qemu without first checking the SSH forward port, so a collision with another VM already holding it surfaced only as an opaque "Failed to start VM," with qemu unable to bind and no hint why. I added a port_in_use check in run_test before the launch: it errors with the port number and the SSH_PORT override to set, records the failure, and moves on. The check lives in run_test, not start_vm, because start_vm runs in a command substitution (vm_pid=$(start_vm ...)) where this harness's non-exiting error() would be captured as the PID instead of failing the run. The pure half, port_listening_in, takes an `ss -tln` snapshot as a string so it's unit-testable. test: make SSH_PORT overridable in test-install.sh 2026-05-22T23:09:50+00:00 Craig Jennings c@cjennings.net 2026-05-22T23:09:50+00:00 urn:sha1:0a57d75d3947fddd6c6ab62924c52a456f4776b0 The port was hardcoded, so a test run collided with any other VM already forwarding 2222. It now defaults to 2222, so existing invocations are unchanged. SSH_PORT=2223 scripts/test-install.sh picks a free port to run alongside another VM. feat(test): retry pacstrap through transient mirror flakes 2026-05-20T14:58:01+00:00 Craig Jennings c@cjennings.net 2026-05-20T14:58:01+00:00 urn:sha1:4ef30e5c84ab22ba1724608009093d6725a1ceda test-install.sh aborts a whole 5-minute VM run when pacstrap hits a transient mirror blip, and the suite reports a failure indistinguishable from a real install regression. run_test now retries the install up to twice, but only when the in-VM log shows both pacstrap's "Failed to install packages to new root" marker and a download/network indicator. A deterministic failure like "target not found" carries the marker without a network indicator, so it still fails fast. archangel's failure trap exports the pool and unmounts on abort, so each retry re-partitions and re-pacstraps from a clean state. Wiring the predicate up needed a source-guard so bats can source the harness, which had none. With that in place I unit-covered the pure helpers — is_transient_install_failure, char_to_qemu_key, get_disk_count, get_disk_args — and lifted char_to_qemu_key out of monitor_sendkeys so the QEMU keymap is testable on its own. The keymap test found a dead branch. The backslash case pattern was '\\', which never matches a lone backslash because bash matches one against '\', so a passphrase containing a backslash would have sent an invalid QEMU keyname instead of "backslash". No test passphrase uses one, so it never bit. I fixed the pattern. feat(build): route VM-internal pacstrap through host pacoloco 2026-05-19T18:12:56+00:00 Craig Jennings c@cjennings.net 2026-05-19T18:12:56+00:00 urn:sha1:21b745d7634cf8e743020b591df101b439883511 The build-host pacoloco routing from e2eb958 only covered mkarchiso's pacstrap. VMs spawned by scripts/test-install.sh ran their own pacstrap inside the guest, fetching ~600 packages per config from upstream and re-hitting the same archzfs corruption that bites the build host. A full 12-config test-install run exposed 7200+ package downloads to upstream flake. I added a routing step to run_install() in test-install.sh, after the config file gets SCP'd to the VM and before archangel runs. It detects pacoloco on the host (port 9129, same probe as build.sh's) and rewrites the live system's /etc/pacman.conf over SSH. [core] and [extra] swap their Include lines for Server lines pointing at 10.0.2.2:9129/repo/archlinux/$repo/os/$arch. A preempt [archzfs] block lands ahead of archangel's default insertion. 10.0.2.2 is QEMU's SLIRP default gateway as seen from the guest, so the host's localhost:9129 maps to that address inside the VM. Pacoloco binds 0.0.0.0:9129, reachable from there without firewall changes. The preempt matters because archangel's install_base checks for an existing [archzfs] block in /etc/pacman.conf and skips its own insertion when one is already there. Writing the pacoloco-routed [archzfs] up front means archangel keeps the routed version. The installed system's $MNTPOINT/etc/pacman.conf isn't touched: it gets upstream URLs like before, since the installed system shouldn't depend on the test host's proxy. The status message uses a plain echo rather than test-install.sh's info() function. run_install() runs inside a bash -c subshell at line 864 that only exports ssh_cmd and run_install via declare -f. A bare info call there resolves to /usr/bin/info (the GNU info reader) and prints a confusing "No menu item" error. An inline comment in the code records the pitfall. Verified end-to-end with scripts/test-install.sh single-disk: pacoloco's cache grew from 77MB (post-build) to 953MB (post-VM-install), the VM's pacstrap completed cleanly, and the install verified. Bats: still 181. test(install): exercise zfssnapshot wrapper in VM verification 2026-05-14T23:08:40+00:00 Craig Jennings c@cjennings.net 2026-05-14T23:08:40+00:00 urn:sha1:33579ee72ed97a671a898267555a50fb8411144b The wrapper had no runtime coverage — bats tests pin pure helpers and arg parsing only, and verify_rollback bypassed it by calling zfs snapshot / zfs rollback directly via SSH. A regression in cmd_create, cmd_rollback, or cmd_delete would only have surfaced in production. verify_zfssnapshot_wrapper runs after verify_rollback for ZFS configs (no-op for Btrfs) and exercises: - list confirms @genesis baseline - create runtime-test — recursive snapshot across all datasets - echo no | delete --name — confirms the gate aborts (catches the -n vs = regression class) - echo yes | delete --name — destroys across all datasets, list confirms gone - create wrapper-rollback + drop sentinel + rollback --name — round-trip restores the sentinel The function scps the working-tree wrapper to the VM before testing so the run reflects current source rather than what the ISO froze at build time. A regression here fails the test (no warn-only path) — it's the wrapper's only runtime check. feat: consolidate zfssnapshot and zfsrollback into one subcommand-driven script 2026-04-27T23:33:03+00:00 Craig Jennings c@cjennings.net 2026-04-27T23:33:03+00:00 urn:sha1:422d1098cd89beaeed81cc40488252233e2ca0ad Problem: zfssnapshot and zfsrollback were two separate scripts with overlapping pre-flight checks (zfs / fzf / root) and parallel UX patterns (description sanitization in one, fzf selection in the other). Users had to remember which script was for which operation, and a "list" view meant typing the raw `zfs list -t snapshot` command. There was no path to destroy individual snapshots short of `zfs destroy` directly, which is dangerous without a confirmation flow. Solution: rewrite zfssnapshot as a single multi-subcommand script (list, create, rollback, delete). Drop installer/zfsrollback. The new script uses a source-guard at the bottom (`if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then main "$@"; fi`) so bats can source it without triggering the install-time pre-flight checks, matching the pattern in installer/archangel. Pure helpers (sanitize_description, validate_description, format_snapshot_name) get extracted as named functions so they're testable in isolation. The destructive flows (rollback, delete) keep the explicit "yes" confirmation prompt, the genesis-snapshot warning, and the recursive-rollback-destroys-newer-snapshots warning. Delete uses fzf --multi so the user can pick several snapshot names at once. Updated build.sh to copy only the consolidated script. Dropped the zfsrollback profiledef permission line. Updated Makefile, README, scripts/sanity-test.sh, and testing-strategy.org to reflect the single-script layout. Bats: 147 → 168 (+21). Coverage spans sanitize_description (normal / boundary / error), validate_description (alphanumerics, hyphens, underscores accepted; spaces, slashes, shell metacharacters, empty rejected), format_snapshot_name (timestamp + description composition), and main subcommand dispatch (list / create / rollback / delete / help / unknown). Lint clean. The zfs-, fzf-, and arch-chroot-shelling subcommand bodies stay VM-tested per testing-strategy.org. fix: verify_rollback sentinel must live on the rolled-back dataset 2026-04-22T01:11:15+00:00 Craig Jennings c@cjennings.net 2026-04-22T01:11:15+00:00 urn:sha1:4c5af5c5bf2112c301efc3e4da1cf8812051692a /root is mounted on a separate dataset (zroot/home/root, created by archangel:create_datasets), but verify_rollback was snapshotting zroot/ROOT/default. The rollback was a no-op for the sentinel file, so the post-rollback existence check failed — the visible symptom was a PASSED test with a soft-failure warning ("Rollback failed - test file not restored" → "Rollback verification had issues") that persisted across ZFS configs for weeks. Move the sentinel to /etc/archangel-rollback-test. /etc has no child dataset mounted there, so the file lives on zroot/ROOT/default — the dataset actually being snapshotted and rolled back. Defensively single-quote $test_file at the five ssh_cmd call-sites so future path changes (whitespace, special chars) stay correct without touching each call again. The 2026-04-21 VM run logged "Rollback verified - test file restored" on zfs-mirror-encrypt, confirming the fix. fix: bump INSTALL_TIMEOUT from 600 to 1800 for kernel 6.18+ DKMS builds 2026-04-13T09:53:01+00:00 Craig Jennings c@cjennings.net 2026-04-13T09:53:01+00:00 urn:sha1:6a63c74e60bd13f84bd4f5f9503f82b5b73ad9df ZFS DKMS compile + depmod against kernel 6.18.22 in a 4-CPU VM under host load exceeds 10 minutes. With INSTALL_TIMEOUT=600, all 6 ZFS test configs timed out during the DKMS install step after pacstrap. The one ZFS config that passed ('custom-locale', first ZFS config alphabetically) squeaked in just under the deadline. Bumped to 1800s (30 min). Session notes from 2026-02-12 mention this bump but the change never made it into git.