<feed xmlns='http://www.w3.org/2005/Atom'>
<title>archangel/scripts, branch main</title>
<subtitle>Arch Linux installer ISO — ZFS-on-root or BTRFS, doubles as rescue disk
</subtitle>
<id>https://git.cjennings.net/archangel/atom?h=main</id>
<link rel='self' href='https://git.cjennings.net/archangel/atom?h=main'/>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/'/>
<updated>2026-05-23T01:58:01+00:00</updated>
<entry>
<title>fix(test): run the ZFS-encryption check on the booted system</title>
<updated>2026-05-23T01:58:01+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-05-23T01:58:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=3165c50fed266fef0b388190296c149c0ae0ee47'/>
<id>urn:sha1:3165c50fed266fef0b388190296c149c0ae0ee47</id>
<content type='text'>
The ZFS native-encryption assertion lived in verify_install, which runs in the live ISO before reboot. But archangel exports zroot at the end of the install, so verify_install bails at "ZFS pool not found" and never reaches the check. It was dead code: the encrypted-config tests passed on the reboot path (entering the passphrase at ZFSBootMenu and booting is itself proof), while the explicit aes-256-gcm assertion gave false confidence by never running.

I moved it into verify_reboot_survival, which ssh's into the booted system where zroot is imported, so zfs get encryption zroot/ROOT actually returns aes-256-gcm and the assertion fires. Confirmed on a zfs-encrypt VM run: "ZFS encryption (aes-256-gcm) verified on running system."
</content>
</entry>
<entry>
<title>fix(build): clear stale archzfs from the pacoloco cache too</title>
<updated>2026-05-23T01:28:15+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-05-23T01:28:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=bed054f46e3b41aae0d599ed7fbc3e1e42d6ddd7'/>
<id>urn:sha1:bed054f46e3b41aae0d599ed7fbc3e1e42d6ddd7</id>
<content type='text'>
archzfs re-uploads its GitHub release assets under the same filename, so pacoloco keeps serving a zfs-dkms/zfs-utils it cached earlier while pacman fetches a fresh archzfs.db with a new checksum. The two mismatch and pacstrap aborts with "invalid or corrupted package." build.sh already drops the stale packages from the host pacman cache, but it never cleared the pacoloco layer, which the VM test installs route through too, so test-install.sh kept hitting the corruption (four times in one session).

build.sh runs as root, so it now clears /var/cache/pacoloco/pkgs/archzfs/zfs-* alongside the host cache, which makes the build-then-test flow self-healing. The pacoloco cache is root-owned and test-install.sh runs as the user, so it can't clear it unattended. Instead, test-install.sh now recognizes the corruption (is_archzfs_cache_corruption) and prints how to clear it, the way it already names the SSH_PORT override on a port collision. A retry alone won't help since it hits the same cached file, so this fails fast with the hint rather than retrying.
</content>
</entry>
<entry>
<title>fix(test): fail clearly when the VM forward port is taken</title>
<updated>2026-05-23T01:12:14+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-05-23T01:12:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=0f8bbc7c1e2c2f6fec0b17753ac0d9c4a3ad4317'/>
<id>urn:sha1:0f8bbc7c1e2c2f6fec0b17753ac0d9c4a3ad4317</id>
<content type='text'>
A test run launched qemu without first checking the SSH forward port, so a collision with another VM already holding it surfaced only as an opaque "Failed to start VM," with qemu unable to bind and no hint why. I added a port_in_use check in run_test before the launch: it errors with the port number and the SSH_PORT override to set, records the failure, and moves on.

The check lives in run_test, not start_vm, because start_vm runs in a command substitution (vm_pid=$(start_vm ...)) where this harness's non-exiting error() would be captured as the PID instead of failing the run. The pure half, port_listening_in, takes an `ss -tln` snapshot as a string so it's unit-testable.
</content>
</entry>
<entry>
<title>test: make SSH_PORT overridable in test-install.sh</title>
<updated>2026-05-22T23:09:50+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-05-22T23:09:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=0a57d75d3947fddd6c6ab62924c52a456f4776b0'/>
<id>urn:sha1:0a57d75d3947fddd6c6ab62924c52a456f4776b0</id>
<content type='text'>
The port was hardcoded, so a test run collided with any other VM already forwarding 2222. It now defaults to 2222, so existing invocations are unchanged. SSH_PORT=2223 scripts/test-install.sh picks a free port to run alongside another VM.
</content>
</entry>
<entry>
<title>feat(test): retry pacstrap through transient mirror flakes</title>
<updated>2026-05-20T14:58:01+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-05-20T14:58:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=4ef30e5c84ab22ba1724608009093d6725a1ceda'/>
<id>urn:sha1:4ef30e5c84ab22ba1724608009093d6725a1ceda</id>
<content type='text'>
test-install.sh aborts a whole 5-minute VM run when pacstrap hits a transient mirror blip, and the suite reports a failure indistinguishable from a real install regression. run_test now retries the install up to twice, but only when the in-VM log shows both pacstrap's "Failed to install packages to new root" marker and a download/network indicator. A deterministic failure like "target not found" carries the marker without a network indicator, so it still fails fast. archangel's failure trap exports the pool and unmounts on abort, so each retry re-partitions and re-pacstraps from a clean state.

Wiring the predicate up needed a source-guard so bats can source the harness, which had none. With that in place I unit-covered the pure helpers — is_transient_install_failure, char_to_qemu_key, get_disk_count, get_disk_args — and lifted char_to_qemu_key out of monitor_sendkeys so the QEMU keymap is testable on its own.

The keymap test found a dead branch. The backslash case pattern was '\\', which never matches a lone backslash because bash matches one against '\', so a passphrase containing a backslash would have sent an invalid QEMU keyname instead of "backslash". No test passphrase uses one, so it never bit. I fixed the pattern.
</content>
</entry>
<entry>
<title>feat(build): route VM-internal pacstrap through host pacoloco</title>
<updated>2026-05-19T18:12:56+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-05-19T18:12:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=21b745d7634cf8e743020b591df101b439883511'/>
<id>urn:sha1:21b745d7634cf8e743020b591df101b439883511</id>
<content type='text'>
The build-host pacoloco routing from e2eb958 only covered mkarchiso's pacstrap. VMs spawned by scripts/test-install.sh ran their own pacstrap inside the guest, fetching ~600 packages per config from upstream and re-hitting the same archzfs corruption that bites the build host. A full 12-config test-install run exposed 7200+ package downloads to upstream flake.

I added a routing step to run_install() in test-install.sh, after the config file gets SCP'd to the VM and before archangel runs. It detects pacoloco on the host (port 9129, same probe as build.sh's) and rewrites the live system's /etc/pacman.conf over SSH. [core] and [extra] swap their Include lines for Server lines pointing at 10.0.2.2:9129/repo/archlinux/$repo/os/$arch. A preempt [archzfs] block lands ahead of archangel's default insertion.

10.0.2.2 is QEMU's SLIRP default gateway as seen from the guest, so the host's localhost:9129 maps to that address inside the VM. Pacoloco binds 0.0.0.0:9129, reachable from there without firewall changes.

The preempt matters because archangel's install_base checks for an existing [archzfs] block in /etc/pacman.conf and skips its own insertion when one is already there. Writing the pacoloco-routed [archzfs] up front means archangel keeps the routed version. The installed system's $MNTPOINT/etc/pacman.conf isn't touched: it gets upstream URLs like before, since the installed system shouldn't depend on the test host's proxy.

The status message uses a plain echo rather than test-install.sh's info() function. run_install() runs inside a bash -c subshell at line 864 that only exports ssh_cmd and run_install via declare -f. A bare info call there resolves to /usr/bin/info (the GNU info reader) and prints a confusing "No menu item" error. An inline comment in the code records the pitfall.

Verified end-to-end with scripts/test-install.sh single-disk: pacoloco's cache grew from 77MB (post-build) to 953MB (post-VM-install), the VM's pacstrap completed cleanly, and the install verified. Bats: still 181.
</content>
</entry>
<entry>
<title>test(install): exercise zfssnapshot wrapper in VM verification</title>
<updated>2026-05-14T23:08:40+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-05-14T23:08:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=33579ee72ed97a671a898267555a50fb8411144b'/>
<id>urn:sha1:33579ee72ed97a671a898267555a50fb8411144b</id>
<content type='text'>
The wrapper had no runtime coverage — bats tests pin pure helpers and arg parsing only, and verify_rollback bypassed it by calling zfs snapshot / zfs rollback directly via SSH. A regression in cmd_create, cmd_rollback, or cmd_delete would only have surfaced in production.

verify_zfssnapshot_wrapper runs after verify_rollback for ZFS configs (no-op for Btrfs) and exercises:
- list confirms @genesis baseline
- create runtime-test — recursive snapshot across all datasets
- echo no | delete --name — confirms the gate aborts (catches the -n vs = regression class)
- echo yes | delete --name — destroys across all datasets, list confirms gone
- create wrapper-rollback + drop sentinel + rollback --name — round-trip restores the sentinel

The function scps the working-tree wrapper to the VM before testing so the run reflects current source rather than what the ISO froze at build time. A regression here fails the test (no warn-only path) — it's the wrapper's only runtime check.
</content>
</entry>
<entry>
<title>feat: consolidate zfssnapshot and zfsrollback into one subcommand-driven script</title>
<updated>2026-04-27T23:33:03+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-04-27T23:33:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=422d1098cd89beaeed81cc40488252233e2ca0ad'/>
<id>urn:sha1:422d1098cd89beaeed81cc40488252233e2ca0ad</id>
<content type='text'>
Problem: zfssnapshot and zfsrollback were two separate scripts with overlapping pre-flight checks (zfs / fzf / root) and parallel UX patterns (description sanitization in one, fzf selection in the other). Users had to remember which script was for which operation, and a "list" view meant typing the raw `zfs list -t snapshot` command. There was no path to destroy individual snapshots short of `zfs destroy` directly, which is dangerous without a confirmation flow.

Solution: rewrite zfssnapshot as a single multi-subcommand script (list, create, rollback, delete). Drop installer/zfsrollback. The new script uses a source-guard at the bottom (`if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then main "$@"; fi`) so bats can source it without triggering the install-time pre-flight checks, matching the pattern in installer/archangel.

Pure helpers (sanitize_description, validate_description, format_snapshot_name) get extracted as named functions so they're testable in isolation. The destructive flows (rollback, delete) keep the explicit "yes" confirmation prompt, the genesis-snapshot warning, and the recursive-rollback-destroys-newer-snapshots warning. Delete uses fzf --multi so the user can pick several snapshot names at once.

Updated build.sh to copy only the consolidated script. Dropped the zfsrollback profiledef permission line. Updated Makefile, README, scripts/sanity-test.sh, and testing-strategy.org to reflect the single-script layout.

Bats: 147 → 168 (+21). Coverage spans sanitize_description (normal / boundary / error), validate_description (alphanumerics, hyphens, underscores accepted; spaces, slashes, shell metacharacters, empty rejected), format_snapshot_name (timestamp + description composition), and main subcommand dispatch (list / create / rollback / delete / help / unknown). Lint clean. The zfs-, fzf-, and arch-chroot-shelling subcommand bodies stay VM-tested per testing-strategy.org.
</content>
</entry>
<entry>
<title>fix: verify_rollback sentinel must live on the rolled-back dataset</title>
<updated>2026-04-22T01:11:15+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-04-22T01:11:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=4c5af5c5bf2112c301efc3e4da1cf8812051692a'/>
<id>urn:sha1:4c5af5c5bf2112c301efc3e4da1cf8812051692a</id>
<content type='text'>
/root is mounted on a separate dataset (zroot/home/root, created by
archangel:create_datasets), but verify_rollback was snapshotting
zroot/ROOT/default. The rollback was a no-op for the sentinel file,
so the post-rollback existence check failed — the visible symptom
was a PASSED test with a soft-failure warning
("Rollback failed - test file not restored" →
"Rollback verification had issues") that persisted across ZFS
configs for weeks.

Move the sentinel to /etc/archangel-rollback-test. /etc has no child
dataset mounted there, so the file lives on zroot/ROOT/default —
the dataset actually being snapshotted and rolled back.

Defensively single-quote $test_file at the five ssh_cmd call-sites
so future path changes (whitespace, special chars) stay correct
without touching each call again.

The 2026-04-21 VM run logged "Rollback verified - test file restored"
on zfs-mirror-encrypt, confirming the fix.
</content>
</entry>
<entry>
<title>fix: bump INSTALL_TIMEOUT from 600 to 1800 for kernel 6.18+ DKMS builds</title>
<updated>2026-04-13T09:53:01+00:00</updated>
<author>
<name>Craig Jennings</name>
<email>c@cjennings.net</email>
</author>
<published>2026-04-13T09:53:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.cjennings.net/archangel/commit/?id=6a63c74e60bd13f84bd4f5f9503f82b5b73ad9df'/>
<id>urn:sha1:6a63c74e60bd13f84bd4f5f9503f82b5b73ad9df</id>
<content type='text'>
ZFS DKMS compile + depmod against kernel 6.18.22 in a 4-CPU VM under
host load exceeds 10 minutes. With INSTALL_TIMEOUT=600, all 6 ZFS test
configs timed out during the DKMS install step after pacstrap. The one
ZFS config that passed ('custom-locale', first ZFS config alphabetically)
squeaked in just under the deadline.

Bumped to 1800s (30 min). Session notes from 2026-02-12 mention this
bump but the change never made it into git.
</content>
</entry>
</feed>
