archangel - Arch Linux installer ISO — ZFS-on-root or BTRFS, doubles as rescue disk

	Commit message (Collapse)	Author	Age	Files	Lines
*	feat: consolidate zfssnapshot and zfsrollback into one subcommand-driven script	Craig Jennings	2026-04-27	1	-0/+153
\| \| \| \| \| \| \| \| \| \| \| \|	Problem: zfssnapshot and zfsrollback were two separate scripts with overlapping pre-flight checks (zfs / fzf / root) and parallel UX patterns (description sanitization in one, fzf selection in the other). Users had to remember which script was for which operation, and a "list" view meant typing the raw `zfs list -t snapshot` command. There was no path to destroy individual snapshots short of `zfs destroy` directly, which is dangerous without a confirmation flow. Solution: rewrite zfssnapshot as a single multi-subcommand script (list, create, rollback, delete). Drop installer/zfsrollback. The new script uses a source-guard at the bottom (`if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then main "$@"; fi`) so bats can source it without triggering the install-time pre-flight checks, matching the pattern in installer/archangel. Pure helpers (sanitize_description, validate_description, format_snapshot_name) get extracted as named functions so they're testable in isolation. The destructive flows (rollback, delete) keep the explicit "yes" confirmation prompt, the genesis-snapshot warning, and the recursive-rollback-destroys-newer-snapshots warning. Delete uses fzf --multi so the user can pick several snapshot names at once. Updated build.sh to copy only the consolidated script. Dropped the zfsrollback profiledef permission line. Updated Makefile, README, scripts/sanity-test.sh, and testing-strategy.org to reflect the single-script layout. Bats: 147 → 168 (+21). Coverage spans sanitize_description (normal / boundary / error), validate_description (alphanumerics, hyphens, underscores accepted; spaces, slashes, shell metacharacters, empty rejected), format_snapshot_name (timestamp + description composition), and main subcommand dispatch (list / create / rollback / delete / help / unknown). Lint clean. The zfs-, fzf-, and arch-chroot-shelling subcommand bodies stay VM-tested per testing-strategy.org.
*	refactor: extract MNTPOINT constant for the install chroot mount point	Craig Jennings	2026-04-27	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Last on the tech-debt drain. The installer hardcoded /mnt at 50+ sites: pacstrap, arch-chroot, mount/umount, fstab writes, and every host-side write into the chroot's /etc, /usr, /var, /boot, /tmp. Same magic-string smell as /mnt/efi but at much larger scale. Add MNTPOINT="/mnt" to lib/common.sh next to EFI_DIR. Replace literal /mnt/... with $MNTPOINT/... across installer/archangel, installer/lib/btrfs.sh, and installer/lib/common.sh. Replace bare /mnt (mount target, arch-chroot root, umount target, install_dropin parameter) with $MNTPOINT. EFI_DIR's own definition becomes EFI_DIR="$MNTPOINT/efi" for the natural composition. Folded in the related ticket: /mnt${chroot_efi_dir} in btrfs.sh:install_grub_all_efi becomes ${MNTPOINT}${chroot_efi_dir}. Was filed as a separate item but the ticket said it should ship with the MNTPOINT extraction, since the composition pattern is unusual and easy to miss in a global sed. Three /mnt references kept literal in comments where the comment describes the string concept rather than the mount point ("Remove /mnt prefix - config is used inside chroot where root is /", etc.). Substituting to $MNTPOINT in those comments would obscure the documentation. Bats: 146 → 147. One new test in test_common.bats pins MNTPOINT="/mnt". Lint clean (one shellcheck SC2295 warning fixed by quoting the parameter expansion: ${isp_firmware#"$MNTPOINT"}). VM verification deferred to a single full make test-install run after all three tech-debt commits land.
*	refactor: verify GRUB_CMDLINE_LINUX seds via prepend_grub_cmdline_linux helper	Craig Jennings	2026-04-27	1	-0/+64
\| \| \| \| \| \| \| \| \| \| \| \|	Audited the ~10 silent sed -i sites in the installer against the verification-after pattern that landed for sshd_config last session. Triaged each by failure mode. The two GRUB_CMDLINE_LINUX seds in lib/btrfs.sh have a real silent-failure risk. If /etc/default/grub is missing or malformed and the sed pattern doesn't match, nothing happens. The kernel boots without cryptdevice=. The system can't unlock LUKS at boot. Added prepend_grub_cmdline_linux to lib/common.sh. Same shape as enable_sshd_root_login (sed, then grep, then error if the line wasn't modified). Replaced the two inline seds with helper calls. The HOOKS= seds in installer/archangel and lib/btrfs.sh (six total) don't need verification. A missing HOOKS= line makes mkinitcpio -P fail loudly downstream, so silent-replace failure can't reach a booted system. Added a one-line audit-rationale comment at each of the three locations so the next reader doesn't re-litigate the decision. The FILES= sed at lib/btrfs.sh:213 already self-heals via a sed-then-grep-then-append pattern, so no behavior change there. Filed a separate follow-up to lift that pattern into a named helper for clarity. Bats: 142 → 146. Four new tests in test_common.bats cover normal (empty cmdline, existing cmdline preserved, other lines preserved) and error (missing GRUB_CMDLINE_LINUX line). Lint clean.
*	refactor: consolidate installer defaults and FILESYSTEM validation into ↵	Craig Jennings	2026-04-27	2	-39/+55
\| \| \| \| \| \| \| \| \| \| \| \|	config.sh The installer had three sites touching FILESYSTEM: a top-level default in the monolith, a re-default block in gather_input, and a runtime validation block also in gather_input. The same scattering existed for LOCALE, KEYMAP, ENABLE_SSH, and NO_ENCRYPT. A future contributor changing one site wouldn't have known the other two existed. Move all five defaults into the lib/config.sh declarations so config.sh is the single source of truth. Add validate_filesystem() in lib/config.sh and call it from main() between check_config and gather_input, so a typo in a config file's FILESYSTEM= fails fast before any install action runs. The behavior change is stricter. An empty FILESYSTEM in a config file used to be silently defaulted to zfs, now it errors. Interactive mode is unaffected. select_filesystem still controls the value and already errored on cancellation. Bats: 140 → 142. Five tests added in test_config.bats for the defaults pinning and validate_filesystem coverage. Three removed from test_archangel.bats for behavior that moved out of gather_input. Lint clean.
*	refactor: collapse sshd_config seds into enable_sshd_root_login	Craig Jennings	2026-04-26	1	-0/+73
\| \| \| \| \| \| \| \| \| \| \| \|	The two sed -i invocations in configure_ssh worked on stock Arch sshd_config but had a real silent-failure mode. If neither the commented (#PermitRootLogin) nor the uncommented form was present, both seds did nothing and the install shipped without root SSH. The user discovered it at first ssh attempt, not at install time. The second sed was also redundant. By the time it ran, the first sed had produced a line matching the second sed's pattern. The new enable_sshd_root_login helper in lib/common.sh combines both substitutions into one sed -i -e ..., then verifies PermitRootLogin yes is present in the file. If the verification fails, it calls error rather than silently appending. Silent appending would mask a corrupted starting file, which is exactly the failure mode worth flagging loudly. The helper takes the config path as an argument so the bats tests in commit 7486abb can run unprivileged against tempfiles. configure_ssh passes /mnt/etc/ssh/sshd_config and is now a single call instead of two seds. Verified: bats 135 → 140 (+5 covering normal/boundary/error). Lint clean. Helper smoke-tested against current Arch sshd_config. The loud-error path can't be exercised against the live default but is covered by the bats error case. Filed as a follow-up :techdebt: item: ~10 other sed -i sites in installer/archangel and lib/btrfs.sh follow the same silent-replace pattern. The FILES= site for LUKS is the worst (silent failure means LUKS prompts on every boot). Triage each per this same recipe in a future session.
*	refactor: extract EFI_DIR constant for the install-time EFI mount point	Craig Jennings	2026-04-26	1	-0/+8
\| \| \| \| \| \| \| \| \| \|	The literal /mnt/efi appeared at 17 sites across installer/archangel and installer/lib/btrfs.sh. Renaming it (or pointing tests at a different mount) meant touching every site and risking incomplete sweeps. One canonical name in installer/lib/common.sh now backs every reference. EFI_DIR has no trailing slash so the three expansion patterns in the codebase compose cleanly. Bare ($EFI_DIR), sub-path ($EFI_DIR/EFI/ZBM), and the index-suffix used by install_grub_all_efi for secondary EFI mounts (${EFI_DIR}${i}). The sync_efi_partitions staging path also moves from the literal /mnt/efi_sync to ${EFI_DIR}_sync, so it follows EFI_DIR if anyone ever changes the base. Two follow-ups filed as separate :techdebt: items. MNTPOINT=/mnt extraction across the 50+ /mnt/... sites (pacstrap, arch-chroot, fstab writes), and the related /mnt${chroot_efi_dir} composition pattern at btrfs.sh:681-682. Both ship together when MNTPOINT lands. Verified: bats 134 → 135 (+1 pinning EFI_DIR=/mnt/efi). Lint clean. All four expansion patterns smoke-tested at runtime and produce the original literal byte-for-byte. VM run skipped, pure constant substitution with zero behavior change.
*	refactor: unify partition_disks across ZFS and Btrfs install paths	Craig Jennings	2026-04-26	1	-44/+146
\| \| \| \| \| \| \| \| \| \| \| \|	The monolith's partition_disks() at installer/archangel was ZFS-only and silently shadowed lib/disk.sh:partition_disks(), which had been dead code since the Btrfs install path was added. install_btrfs was assembling partitioning manually via partition_disk (singular) plus a separate format_efi_partitions call. Two parallel implementations meant fixes had to land in two places and the lib version drifted with no visible warning. The unified partition_disks now lives in lib/disk.sh. It reads SELECTED_DISKS, dispatches the per-disk layout on FILESYSTEM (BF00 for ZFS, 8300 for Btrfs), populates EFI_PARTS + ROOT_PARTS, and formats each EFI partition with EFI0, EFI1, ... labels. Folded in two pre-existing divergences while consolidating. wipefs -af now runs on every disk, not just the ZFS path. The Btrfs path was missing this defense against non-GPT signatures (LVM, mdadm, ext) that sgdisk --zap-all alone won't touch. EFI labels standardized on EFI0, EFI1, ... across both paths. The lib version was producing asymmetric EFI / EFI2 labels. No consumer in the repo references the labels after format, so that change is cosmetic. ZFS_PARTS renamed to ROOT_PARTS for symmetry with get_root_partition. Deleted four orphaned helpers: format_efi, format_efi_partitions (only caller was the collapsed install_btrfs section), get_efi_partitions, get_root_partitions (test-only callers after install_btrfs simplified). Verified end to end: bats 134/134, make test-install passing 12/12 configs across both install paths.
*	fix: clean up partial installs via ERR/INT/TERM trap	Craig Jennings	2026-04-26	1	-0/+105
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Failed installs left the system in an inconsistent state: /mnt mounted, the zpool imported, possibly LUKS containers open. The user had to manually unmount, export the pool, and clean up partitions before re-running the installer. The existing trap at the top of archangel was `trap 'error "Installation interrupted!"' INT TERM`. It just printed a message and exited. There was no ERR trap, so `set -e`-aborted commands ran no cleanup either. I added `install_failure_cleanup` next to the existing `cleanup`. It captures `$?` first, disarms the trap to prevent recursion, clears sensitive variables (ROOT_PASSWORD, ZFS_PASSPHRASE, LUKS_PASSPHRASE), and dispatches on FILESYSTEM: - ZFS path: unmount /mnt/efi, recursive umount /mnt, export the pool (with `-f` fallback) if it's still imported. - Btrfs path: unmount /mnt/efi, call the existing btrfs_cleanup and btrfs_close_encryption helpers. All cleanup steps swallow their own errors with `\|\| true` since partial state is expected when this fires mid-install. `install_zfs` and `install_btrfs` now both arm the trap as their first action and disarm it just before the success-path cleanup. The disarm matters because the success cleanup calls `zpool export` (or btrfs_close_encryption) directly, and those can produce non-zero exit codes that we don't want to interpret as "installation failed". The note in notes.org described this as "install_zfs has no mid-step recovery." The framing was off. Both paths were exposed: install_btrfs's `btrfs_cleanup` only runs on the success path, same as `cleanup` for ZFS. Both paths now have the same recovery shape. Added 4 bats tests for `install_failure_cleanup` that mock the system tools (umount, zpool, btrfs_cleanup, btrfs_close_encryption) via function override and track invocations through a CALLS array. Array assignment isn't affected by the production code's `>/dev/null 2>&1` redirects on `zpool list`, so we capture the call regardless of where the mock's stdout would have gone. Verified end-to-end on the dev box: sourced archangel, set FILESYSTEM=zfs, armed the trap, ran `false` to trigger `set -e`. The trap fired with exit code 1, dispatched to the ZFS cleanup path, called `umount /mnt/efi` and `umount -R /mnt`, checked `zpool list` (returned non-zero since no pool exists on the dev box), skipped the export, and exited via `error`. No behavior change on the success path. The existing `cleanup` and `btrfs_cleanup` stay unchanged.
*	test: expand bats coverage across installer modules	Craig Jennings	2026-04-26	4	-0/+449
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added unit tests for `disk.sh`, `btrfs.sh`, the archangel monolith's `gather_input` unattended branch, and filled gap cases in `config.sh`. The suite grew from 71 to 110 tests. `installer/lib/disk.sh` was completely uncovered. New `tests/unit/test_disk.bats` covers the four pure partition-path helpers (`get_efi_partition`, `get_root_partition`, `get_efi_partitions`, `get_root_partitions`) across SATA, virtio, and NVMe inputs, mixed arrays, and the empty-input behavior. Side-effecting functions in the same file (sgdisk, mkfs.fat, partprobe, and fzf wrappers) stay deliberately VM-tested. `installer/lib/btrfs.sh` had no bats coverage. New `tests/unit/test_btrfs.bats` covers `get_luks_devices`, the only pure helper in the file. It pins the asymmetric naming convention where the first device gets the bare `LUKS_MAPPER_NAME` and subsequent devices append the index. The archangel monolith was un-source-able for tests because its top-level code created a /tmp log file and redirected stdout via `exec > >(tee...)`, plus called `main "$@"` unconditionally at the bottom. I extracted the logging setup into an `init_logging` function called from `main`, and wrapped the main call in a `[[ "${BASH_SOURCE[0]}" == "${0}" ]]` guard. Sourcing the script now loads function definitions silently, with no log file and no banner. Running it directly works exactly as before. Verified both paths. That refactor unlocks `tests/unit/test_archangel.bats`, which covers `gather_input` in unattended mode. Required-field validation for HOSTNAME, TIMEZONE, ROOT_PASSWORD, and DISKS. Optional-field defaulting (FILESYSTEM to zfs, LOCALE to en_US.UTF-8, KEYMAP to us, ENABLE_SSH to yes). Filesystem-specific encryption checks (ZFS_PASSPHRASE required when not NO_ENCRYPT, same for LUKS_PASSPHRASE on Btrfs). Filesystem validity. RAID_LEVEL defaulting for multi-disk installs. The interactive branch stays out of scope per the testing-strategy policy. `tests/unit/test_config.bats` got five gap tests: `check_config` when CONFIG_FILE is set, `validate_config` against a non-block-device entry (e.g. /dev/null) and a missing path, and `parse_args` accepting `--color` and `--config-file` together in either order. `testing-strategy.org` got an expanded "What bats does NOT cover" section. The doc previously named six tools (mkfs, cryptsetup, zpool create, pacstrap, arch-chroot, grub-install). The new list adds sgdisk, partprobe, blkid, mkfs.fat, mkfs.btrfs, snapper, efibootmgr, mount, umount, findmnt, mountpoint, and fzf. It also names the conditions (root needed, real /dev or /sys state) that make a function VM-only. The coverage table at the top now lists the three new test files. No behavior change in production code. The init_logging extraction preserves the existing log path and banner format byte-for-byte.
*	refactor: extract get_raid_level fzf preview into raid.sh helper	Craig Jennings	2026-04-26	1	-0/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	`get_raid_level()` carried a 98-line inline fzf `--preview` shell snippet that contained five nearly-parallel `case` branches emitting per-level preview text, plus three exported variables (`RAID_DISK_COUNT`, `RAID_TOTAL_GB`, `RAID_SMALLEST_GB`) just to pass values into the fzf preview subshell. The inline shell-in-shell had no syntax highlighting, no shellcheck on the inner snippet, and any edit to preview copy meant editing inside a single-quoted argument. I extracted the per-level text into a new `raid_preview(level, disk_count, total_gb, smallest_gb)` helper in `lib/raid.sh`. It reuses the existing `raid_fault_tolerance` and `raid_usable_bytes` primitives for the data lines instead of redoing the arithmetic inline. That keeps the math in one place. The fzf `--preview` argument is now a one-liner that calls `raid_preview` with the sizing values, and the env-var exports are gone. `export -f raid_preview raid_fault_tolerance raid_usable_bytes` makes the functions visible in fzf's preview subshell. I verified this against a fresh `bash -c` subshell, which is what fzf spawns internally. `get_raid_level()` shrinks from 144 to 49 lines. Preview text is now bats-tested. Added 8 unit tests across the 5 RAID levels (headline, fault-tolerance line, computed usable space), mixed-size handling for mirror (smallest disk, not average), unknown level returning 1 with empty output, and a sanity loop confirming every valid level produces non-empty output. No behavior change. The preview pane shows the same text, the same level options, the same selected output. The pure logic in `lib/raid.sh` is unchanged.
*	fix: fail fast on missing ZFSBootMenu efibootmgr entry	Craig Jennings	2026-04-26	1	-0/+78
\| \| \| \| \| \| \| \| \| \|	The post-bootloader boot-order step in `configure_zfsbootmenu` parsed `efibootmgr` output through a `grep \| head \| grep -oP` chain with no null guards. If any link returned empty (the entry wasn't created, the label was different, or efibootmgr itself failed), the surrounding `if [[ -n "$bootnum" ]]` silently skipped, the install reported success, and the user rebooted into a machine that wouldn't boot ZFSBootMenu by default. I replaced the chain with two pure helpers in `lib/common.sh`, `parse_efibootmgr_entry` and `parse_efibootmgr_bootorder`. The caller in `archangel` invokes them with explicit `\|\| error` guards on each parse stage. The helpers capture `efibootmgr` output once and reuse it (it was called twice before). The same hardening covers the BootOrder lookup at the adjacent line. It used to rely on the now-removed `bootnum` guard for safety. The helpers are stdin-driven and use bash regex, so they're easy to test in bats without exercising the real efibootmgr binary. Added 9 unit tests across normal cases, hex-character boot numbers, multi-match selection, missing label, missing BootOrder line, empty input, and an empty label argument. The empty-label case would otherwise falsely match `BootCurrent` via the hex regex, capturing "C". The helper now guards it explicitly. Verified manually against real efibootmgr output (GRUB entry at Boot0001, BootOrder 0006,0001,2001,2002,2003). Both helpers parsed correctly. VM integration not re-run for this small post-bootloader change. The next scheduled `make test-install` exercises the green path.
*	feat: PrivateTmp=yes drop-in for systemd-tmpfiles on ZFS-root	Craig Jennings	2026-04-21	1	-0/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On ZFS-on-root, statx() across sibling services' /var/tmp/systemd-private-*/tmp mounts returns errno 132 (ENOTNAM). This produces 10-30 journal errors per boot and causes systemd-tmpfiles-clean.service to fail every periodic run (exit 73 / CANTCREAT). Running tmpfiles inside its own mount namespace avoids traversing sibling private-tmp paths. install_zfs() now calls configure_tmpfiles_private_tmp() between configure_zfs_tools and sync_efi_partitions, so the genesis snapshot captures the drop-ins. Btrfs path is untouched — errno 132 is ZFS-specific. The drop-in file-writing is factored into install_dropin() in lib/common.sh (service, name, root; body from stdin). Six bats tests exercise path, content, directory permissions, idempotent overwrite, empty content, and special-character preservation. Full root-cause write-up and verification steps in docs/zfs-tmpfiles-private-tmp-fix.md.
*	refactor: merge install_base and install_base_btrfs	Craig Jennings	2026-04-13	1	-0/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extract the pacstrap package list into pacstrap_packages(filesystem) in lib/common.sh (common + filesystem-specific). install_base() now dispatches on FILESYSTEM for both the archzfs-repo-append and the package list. install_base_btrfs() deleted; install_btrfs() call site updated to invoke install_base. Old: 49 + 38 lines of ~95% copy-paste. New: 32 lines + a 20-line pure helper. 7 bats tests cover: zfs has zfs-dkms/zfs-utils, btrfs has btrfs-progs + grub + grub-btrfs + snapper + snap-pac, each flavor excludes the other's specifics, common packages are in both, unknown filesystem returns status 1, output is one-per-line. make test: 65/65.
*	refactor: unify get_{luks,zfs}_passphrase and get_root_password	Craig Jennings	2026-04-13	1	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extract the prompt/confirm/min-length loop into prompt_password() in lib/common.sh using a nameref for the output variable, so UI output stays on the terminal (no command-substitution capture) and the three callers collapse from ~30 lines each to a single helper call. - get_luks_passphrase() — min 8 chars - get_zfs_passphrase() — min 8 chars - get_root_password() — no min (was unchecked before; preserved) 5 bats tests added: match+min-ok path, length-retry loop, mismatch-retry loop, min_len=0 disables check, empty passphrase when min_len=0. make test: 58/58.
*	refactor: extract pure RAID logic to lib/raid.sh with bats coverage	Craig Jennings	2026-04-12	1	-0/+199
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Peel the testable pieces of get_raid_level() out of the 1600-line installer monolith into installer/lib/raid.sh: - raid_valid_levels_for_count(count) — replaces the inline option-list builder in get_raid_level() - raid_is_valid(level, count) — useful for unattended-config validation - raid_usable_bytes(level, count, smallest, total) — usable-space math - raid_fault_tolerance(level, count) — max tolerable disk failures archangel now sources lib/raid.sh and uses raid_valid_levels_for_count for the fzf option list. Fzf preview subshell still inlines its own usable-bytes arithmetic (calling exported lib functions across preview subshells is fragile; left for a later pass). 30 bats tests in tests/unit/test_raid.bats cover the full enumeration table, every valid/invalid level-vs-count combo from 2 to 5 disks, mixed-size mirror, and unknown-level error paths. make test: 53/53.
*	test: add bats unit tests for common.sh and config.sh	Craig Jennings	2026-04-12	2	-0/+199
	23 bats tests covering the pure logic in installer/lib/common.sh (command_exists, require_command, info/warn/error, enable_color, require_root, log) and installer/lib/config.sh (parse_args, load_config, validate_config, check_config). Makefile adds a 'bats' target; 'test' now runs lint + bats (VM integration tests remain under test-install).