archangel, branch main

archangel, branch main Arch Linux installer ISO — ZFS-on-root or BTRFS, doubles as rescue disk https://git.cjennings.net/archangel/atom?h=main 2026-05-23T14:49:32+00:00 chore: gitignore emacs backup, autosave, and lockfiles 2026-05-23T14:49:32+00:00 Craig Jennings c@cjennings.net 2026-05-23T14:49:32+00:00 urn:sha1:2536ac6965eadc54e29a1d7269cdf7f98515d689 Batch and interactive emacs edits of the gitignored todo.org leave todo.org~ (and would leave #todo.org# / .#todo.org) in the working tree. Ignore the patterns so they stop showing up in git status. refactor: drop the dead duplicate disk_in_use from common.sh 2026-05-23T14:40:30+00:00 Craig Jennings c@cjennings.net 2026-05-23T14:40:30+00:00 urn:sha1:c82761fe3352294ef644a9b614cc2677f8f2e339 common.sh and disk.sh both defined disk_in_use. archangel sources common.sh first, then disk.sh, so disk.sh's thorough version (mount, active swap, imported zpool, md array) won at runtime everywhere — including list_available_disks, the common.sh function that calls it. common.sh's older mount-and-holders version was dead. I deleted it. list_available_disks now resolves disk_in_use to disk.sh's, which is what already happened at runtime. The disk.sh unit tests cover the surviving version. Suite stays at 245, lint clean. test: cover disk_in_use and network_available failure paths 2026-05-23T09:08:53+00:00 Craig Jennings c@cjennings.net 2026-05-23T09:08:53+00:00 urn:sha1:98fc424f7edb26314ffe124d3bc24549146a06d5 These two boundary functions backed the pre-flight guards from #215 but had no unit coverage of their own. The VM harness exercised them instead. I added 7 bats tests that mock the system commands they query, so the real branching logic runs. test_disk.bats covers disk_in_use across mountpoint, active swap, imported-zpool member, and idle — that's the gate that refuses to wipe an already-mounted disk. test_archangel.bats covers network_available for DNS failure, TCP-connect failure, and success, the check that fails the install before pacstrap. The /proc/mdstat-positive branch and the live probes stay in the VM harness, since neither drives cleanly without writing to /proc or hitting the network. Suite 238 to 245, lint clean. fix(test): run the ZFS-encryption check on the booted system 2026-05-23T01:58:01+00:00 Craig Jennings c@cjennings.net 2026-05-23T01:58:01+00:00 urn:sha1:3165c50fed266fef0b388190296c149c0ae0ee47 The ZFS native-encryption assertion lived in verify_install, which runs in the live ISO before reboot. But archangel exports zroot at the end of the install, so verify_install bails at "ZFS pool not found" and never reaches the check. It was dead code: the encrypted-config tests passed on the reboot path (entering the passphrase at ZFSBootMenu and booting is itself proof), while the explicit aes-256-gcm assertion gave false confidence by never running. I moved it into verify_reboot_survival, which ssh's into the booted system where zroot is imported, so zfs get encryption zroot/ROOT actually returns aes-256-gcm and the assertion fires. Confirmed on a zfs-encrypt VM run: "ZFS encryption (aes-256-gcm) verified on running system." fix(build): clear stale archzfs from the pacoloco cache too 2026-05-23T01:28:15+00:00 Craig Jennings c@cjennings.net 2026-05-23T01:28:15+00:00 urn:sha1:bed054f46e3b41aae0d599ed7fbc3e1e42d6ddd7 archzfs re-uploads its GitHub release assets under the same filename, so pacoloco keeps serving a zfs-dkms/zfs-utils it cached earlier while pacman fetches a fresh archzfs.db with a new checksum. The two mismatch and pacstrap aborts with "invalid or corrupted package." build.sh already drops the stale packages from the host pacman cache, but it never cleared the pacoloco layer, which the VM test installs route through too, so test-install.sh kept hitting the corruption (four times in one session). build.sh runs as root, so it now clears /var/cache/pacoloco/pkgs/archzfs/zfs-* alongside the host cache, which makes the build-then-test flow self-healing. The pacoloco cache is root-owned and test-install.sh runs as the user, so it can't clear it unattended. Instead, test-install.sh now recognizes the corruption (is_archzfs_cache_corruption) and prints how to clear it, the way it already names the SSH_PORT override on a port collision. A retry alone won't help since it hits the same cached file, so this fails fast with the hint rather than retrying. fix(test): fail clearly when the VM forward port is taken 2026-05-23T01:12:14+00:00 Craig Jennings c@cjennings.net 2026-05-23T01:12:14+00:00 urn:sha1:0f8bbc7c1e2c2f6fec0b17753ac0d9c4a3ad4317 A test run launched qemu without first checking the SSH forward port, so a collision with another VM already holding it surfaced only as an opaque "Failed to start VM," with qemu unable to bind and no hint why. I added a port_in_use check in run_test before the launch: it errors with the port number and the SSH_PORT override to set, records the failure, and moves on. The check lives in run_test, not start_vm, because start_vm runs in a command substitution (vm_pid=$(start_vm ...)) where this harness's non-exiting error() would be captured as the PID instead of failing the run. The pure half, port_listening_in, takes an `ss -tln` snapshot as a string so it's unit-testable. test: make SSH_PORT overridable in test-install.sh 2026-05-22T23:09:50+00:00 Craig Jennings c@cjennings.net 2026-05-22T23:09:50+00:00 urn:sha1:0a57d75d3947fddd6c6ab62924c52a456f4776b0 The port was hardcoded, so a test run collided with any other VM already forwarding 2222. It now defaults to 2222, so existing invocations are unchanged. SSH_PORT=2223 scripts/test-install.sh picks a free port to run alongside another VM. feat(install): add pre-flight environment and disk-target validation 2026-05-22T23:03:40+00:00 Craig Jennings c@cjennings.net 2026-05-22T23:03:40+00:00 urn:sha1:b6525a50fabf3aedf41eee70c164519b00d27704 archangel went straight from filesystem selection into a destructive install behind only a root check and a ZFS module load. A missing tool, a BIOS boot, a too-small or in-use disk, or a dead network surfaced as a confusing abort partway through, sometimes after partitioning had already run. Two gates now fail fast. validate_environment runs after filesystem selection, before any disk is touched: it confirms UEFI boot mode and that every required command is present, with the list coming from a new required_commands helper built like pacstrap_packages. validate_install_targets runs after disk selection, before the first wipe: it refuses a target that's mounted, holds active swap, or belongs to an imported pool or md array, rejects disks under 20 GB, and confirms a mirror is reachable via DNS plus a TCP probe (no ICMP, since some networks drop it). I folded the install_failure_cleanup hardening into the same change. It now falls back to lazy unmounts, so a pacstrap-interrupted target with busy bind mounts still releases the pool and unmounts the EFI partition. Without that, the disk-in-use guard would block the very retry the cleanup exists to enable. "Re-run to retry" only holds if the disk is genuinely freed first. The 20 GB floor is decimal on purpose. It reads as the natural minimum and clears a 20 GiB disk image with headroom instead of sitting on the boundary. feat(test): retry pacstrap through transient mirror flakes 2026-05-20T14:58:01+00:00 Craig Jennings c@cjennings.net 2026-05-20T14:58:01+00:00 urn:sha1:4ef30e5c84ab22ba1724608009093d6725a1ceda test-install.sh aborts a whole 5-minute VM run when pacstrap hits a transient mirror blip, and the suite reports a failure indistinguishable from a real install regression. run_test now retries the install up to twice, but only when the in-VM log shows both pacstrap's "Failed to install packages to new root" marker and a download/network indicator. A deterministic failure like "target not found" carries the marker without a network indicator, so it still fails fast. archangel's failure trap exports the pool and unmounts on abort, so each retry re-partitions and re-pacstraps from a clean state. Wiring the predicate up needed a source-guard so bats can source the harness, which had none. With that in place I unit-covered the pure helpers — is_transient_install_failure, char_to_qemu_key, get_disk_count, get_disk_args — and lifted char_to_qemu_key out of monitor_sendkeys so the QEMU keymap is testable on its own. The keymap test found a dead branch. The backslash case pattern was '\\', which never matches a lone backslash because bash matches one against '\', so a passphrase containing a backslash would have sent an invalid QEMU keyname instead of "backslash". No test passphrase uses one, so it never bit. I fixed the pattern. feat(build): route VM-internal pacstrap through host pacoloco 2026-05-19T18:12:56+00:00 Craig Jennings c@cjennings.net 2026-05-19T18:12:56+00:00 urn:sha1:21b745d7634cf8e743020b591df101b439883511 The build-host pacoloco routing from e2eb958 only covered mkarchiso's pacstrap. VMs spawned by scripts/test-install.sh ran their own pacstrap inside the guest, fetching ~600 packages per config from upstream and re-hitting the same archzfs corruption that bites the build host. A full 12-config test-install run exposed 7200+ package downloads to upstream flake. I added a routing step to run_install() in test-install.sh, after the config file gets SCP'd to the VM and before archangel runs. It detects pacoloco on the host (port 9129, same probe as build.sh's) and rewrites the live system's /etc/pacman.conf over SSH. [core] and [extra] swap their Include lines for Server lines pointing at 10.0.2.2:9129/repo/archlinux/$repo/os/$arch. A preempt [archzfs] block lands ahead of archangel's default insertion. 10.0.2.2 is QEMU's SLIRP default gateway as seen from the guest, so the host's localhost:9129 maps to that address inside the VM. Pacoloco binds 0.0.0.0:9129, reachable from there without firewall changes. The preempt matters because archangel's install_base checks for an existing [archzfs] block in /etc/pacman.conf and skips its own insertion when one is already there. Writing the pacoloco-routed [archzfs] up front means archangel keeps the routed version. The installed system's $MNTPOINT/etc/pacman.conf isn't touched: it gets upstream URLs like before, since the installed system shouldn't depend on the test host's proxy. The status message uses a plain echo rather than test-install.sh's info() function. run_install() runs inside a bash -c subshell at line 864 that only exports ssh_cmd and run_install via declare -f. A bare info call there resolves to /usr/bin/info (the GNU info reader) and prints a confusing "No menu item" error. An inline comment in the code records the pitfall. Verified end-to-end with scripts/test-install.sh single-disk: pacoloco's cache grew from 77MB (post-build) to 953MB (post-VM-install), the VM's pacstrap completed cleanly, and the install verified. Bats: still 181.