diff options
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/2026-01-22-mkinitcpio-config-boot-failure.org | 159 | ||||
| -rw-r--r-- | docs/2026-01-22-ratio-boot-fix-session.org | 175 |
2 files changed, 334 insertions, 0 deletions
diff --git a/docs/2026-01-22-mkinitcpio-config-boot-failure.org b/docs/2026-01-22-mkinitcpio-config-boot-failure.org new file mode 100644 index 0000000..ba5bc72 --- /dev/null +++ b/docs/2026-01-22-mkinitcpio-config-boot-failure.org @@ -0,0 +1,159 @@ +#+TITLE: install-archzfs leaves broken mkinitcpio configuration +#+DATE: 2026-01-22 + +* Problem Summary + +After installing Arch Linux with ZFS via install-archzfs, the system has incorrect mkinitcpio configuration that can cause boot failures. The configuration issues are latent - the system may boot initially but will fail after any mkinitcpio regeneration (kernel updates, manual rebuilds, etc.). + +* Root Cause + +The install-archzfs script does not properly configure mkinitcpio for a ZFS boot environment. Three issues were identified: + +** Issue 1: Wrong HOOKS in mkinitcpio.conf + +The installed system had: +#+begin_example +HOOKS=(base systemd autodetect microcode modconf kms keyboard keymap sd-vconsole block filesystems fsck) +#+end_example + +This is wrong for ZFS because: +- Uses =systemd= init hook, but ZFS hook is busybox-based and incompatible with systemd init +- Missing =zfs= hook entirely +- Has =fsck= hook which is unnecessary/wrong for ZFS + +Correct HOOKS for ZFS: +#+begin_example +HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems) +#+end_example + +** Issue 2: Leftover archiso.conf drop-in + +The file =/etc/mkinitcpio.conf.d/archiso.conf= was left over from the live ISO: +#+begin_example +HOOKS=(base udev microcode modconf kms memdisk archiso archiso_loop_mnt archiso_pxe_common archiso_pxe_nbd archiso_pxe_http archiso_pxe_nfs block filesystems keyboard) +COMPRESSION="xz" +COMPRESSION_OPTIONS=(-9e) +#+end_example + +This drop-in OVERRIDES the HOOKS setting in mkinitcpio.conf, so even if mkinitcpio.conf were correct, this file would break it. + +** Issue 3: Wrong preset file + +The file =/etc/mkinitcpio.d/linux-lts.preset= contained archiso-specific configuration: +#+begin_example +# mkinitcpio preset file for the 'linux-lts' package on archiso + +PRESETS=('archiso') + +ALL_kver='/boot/vmlinuz-linux-lts' +archiso_config='/etc/mkinitcpio.conf.d/archiso.conf' + +archiso_image="/boot/initramfs-linux-lts.img" +#+end_example + +Should be: +#+begin_example +# mkinitcpio preset file for linux-lts + +PRESETS=(default fallback) + +ALL_kver="/boot/vmlinuz-linux-lts" + +default_image="/boot/initramfs-linux-lts.img" + +fallback_image="/boot/initramfs-linux-lts-fallback.img" +fallback_options="-S autodetect" +#+end_example + +* How This Manifests + +1. Fresh install appears to work (initramfs built during install has ZFS support somehow) +2. System boots fine initially +3. Kernel update or manual =mkinitcpio -P= rebuilds initramfs +4. New initramfs lacks ZFS support due to wrong config +5. Next reboot fails with "cannot import pool" or "failed to mount /sysroot" + +* Fix Required in install-archzfs + +The script needs to, after arch-chroot setup: + +1. *Set correct mkinitcpio.conf HOOKS*: + #+begin_src bash + sed -i 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/' /mnt/etc/mkinitcpio.conf + #+end_src + +2. *Remove archiso drop-in*: + #+begin_src bash + rm -f /mnt/etc/mkinitcpio.conf.d/archiso.conf + #+end_src + +3. *Create proper preset file*: + #+begin_src bash + cat > /mnt/etc/mkinitcpio.d/linux-lts.preset << 'EOF' + # mkinitcpio preset file for linux-lts + + PRESETS=(default fallback) + + ALL_kver="/boot/vmlinuz-linux-lts" + + default_image="/boot/initramfs-linux-lts.img" + + fallback_image="/boot/initramfs-linux-lts-fallback.img" + fallback_options="-S autodetect" + EOF + #+end_src + +4. *Rebuild initramfs after fixing config*: + #+begin_src bash + arch-chroot /mnt mkinitcpio -P + #+end_src + +* Recovery Procedure (for affected systems) + +Boot from archzfs live ISO, then: + +#+begin_src bash +# Import and mount ZFS +zpool import -f zroot +zfs mount zroot/ROOT/default +mount /dev/nvme0n1p1 /boot # adjust device as needed + +# Fix mkinitcpio.conf +sed -i 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/' /etc/mkinitcpio.conf + +# Remove archiso drop-in +rm -f /etc/mkinitcpio.conf.d/archiso.conf + +# Fix preset (adjust for your kernel: linux, linux-lts, linux-zen, etc.) +cat > /etc/mkinitcpio.d/linux-lts.preset << 'EOF' +PRESETS=(default fallback) +ALL_kver="/boot/vmlinuz-linux-lts" +default_image="/boot/initramfs-linux-lts.img" +fallback_image="/boot/initramfs-linux-lts-fallback.img" +fallback_options="-S autodetect" +EOF + +# Mount system directories for chroot +mount --rbind /dev /dev +mount --rbind /sys /sys +mount --rbind /proc /proc +mount --rbind /run /run + +# Rebuild initramfs +chroot / mkinitcpio -P + +# Reboot +reboot +#+end_src + +* Machine Details (ratio) + +- Two NVMe drives in ZFS mirror (nvme0n1, nvme1n1) +- Pool: zroot +- Root dataset: zroot/ROOT/default +- Kernel: linux-lts 6.12.66-1 +- Boot partition: /dev/nvme0n1p1 (FAT32, mounted at /boot) + +* Related Information + +The immediate trigger for discovering this was a system freeze during mkinitcpio regeneration. That freeze was caused by the AMD GPU VPE power gating bug (separate issue - see archsetup NOTES.org for details). However, the system's inability to boot afterward exposed these latent mkinitcpio configuration problems. diff --git a/docs/2026-01-22-ratio-boot-fix-session.org b/docs/2026-01-22-ratio-boot-fix-session.org new file mode 100644 index 0000000..463774f --- /dev/null +++ b/docs/2026-01-22-ratio-boot-fix-session.org @@ -0,0 +1,175 @@ +#+TITLE: Ratio Boot Fix Session - 2026-01-22 +#+DATE: 2026-01-22 +#+AUTHOR: Craig Jennings + Claude + +* Summary + +Successfully diagnosed and fixed boot failures on ratio (Framework Desktop with AMD Ryzen AI Max 300 / Strix Halo GPU). The primary issue was outdated/missing linux-firmware causing the amdgpu driver to hang during boot. + +* Hardware + +- Machine: Framework Desktop +- CPU/GPU: AMD Ryzen AI Max 300 (Strix Halo APU, codenamed GFX 1151) +- Storage: 2x NVMe in ZFS mirror (zroot) +- Installed via: install-archzfs script from this project + +* Initial Symptoms + +1. System froze at "triggering udev events..." during boot +2. Only visible message before freeze: "RDSEED32 is broken" (red herring - just a warning) +3. Freeze occurred with both linux-lts (6.12.66) and linux (6.15.2) kernels +4. Blacklisting amdgpu allowed boot to proceed but caused kernel panic (no display = init killed) + +* Root Cause + +The linux-firmware package was either missing or outdated. Specifically: +- linux-firmware 20251125 is known to break AMD Strix Halo (GFX 1151) +- linux-firmware 20260110 contains fixes for Strix Halo stability + +Source: Donato Capitella video "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!" +- Firmware 20251125 completely broke ROCm/GPU on Strix Halo +- Firmware 20260110+ restores functionality + +* Troubleshooting Timeline + +** Phase 1: Initial Diagnosis + +- SSH'd to ratio via archzfs live ISO +- Found mkinitcpio configuration issues (separate bug) +- Fixed mkinitcpio.conf HOOKS and removed archiso.conf drop-in +- System still froze after fixes + +** Phase 2: Kernel Investigation + +- Researched AMD Strix Halo issues on Framework community forums +- Found reports of VPE (Video Processing Engine) idle timeout bug +- Attempted kernel parameter workarounds: + - amdgpu.pg_mask=0 (disable power gating) - didn't help + - amdgpu.cwsr_enable=0 (disable compute wavefront save/restore) - not tested +- Installed kernel 6.15.2 from Arch Linux Archive (has VPE fix) +- Installed matching zfs-linux package for 6.15.2 +- System still froze + +** Phase 3: ZFS Rollback Complications + +- Rolled back to pre-kernel-switch ZFS snapshot +- Discovered /boot is NOT on ZFS (EFI partition) +- Rollback caused mismatch: root filesystem rolled back, but /boot kept newer kernels +- Kernel 6.15.2 panicked because its modules didn't exist on rolled-back root +- Documented this as a fundamental ZFS-on-root issue (see todo.org) + +** Phase 4: Firmware Discovery + +- Found video transcript explaining Strix Halo firmware requirements +- Discovered linux-firmware package was not installed (orphaned files from rollback) +- Repo had linux-firmware 20260110-1 (the fixed version) +- Installed linux-firmware 20260110-1 + +** Phase 5: Boot Success with Secondary Issues + +After firmware install, encountered additional issues: + +1. *Hostid mismatch*: Pool showed "previously in use from another system" + - Fix: Clean export from live ISO (zpool export zroot) + +2. *ZFS mountpoint=legacy*: Root dataset had legacy mountpoint from chroot work + - Fix: zfs set mountpoint=/ zroot/ROOT/default + +3. *ZFS mountpoints with /mnt prefix*: All child datasets had /mnt prefix + - Cause: zpool import -R /mnt persisted mountpoint changes + - Fix: Reset all mountpoints (zfs set mountpoint=/home zroot/home, etc.) + +* Final Working Configuration + +#+BEGIN_SRC +Kernel: linux-lts 6.12.66-1-lts +Firmware: linux-firmware 20260110-1 +ZFS: zfs-linux-lts (DKMS built for 6.12.66) +Boot: GRUB with spl.spl_hostid=0x564478f3 +#+END_SRC + +* Key Learnings + +** 1. Firmware Matters for AMD APUs + +The linux-firmware package is critical for AMD integrated GPUs. Strix Halo specifically requires firmware 20260110 or newer. The kernel version (6.12 vs 6.15) was less important than having correct firmware. + +** 2. ZFS Rollback + Separate /boot = Danger + +When /boot is on a separate EFI partition (not ZFS): +- ZFS rollback doesn't affect /boot +- Kernel images remain at newer version +- Modules on root get rolled back +- Result: Boot failure or kernel panic + +Solutions: +- Use ZFSBootMenu (stores kernel/initramfs on ZFS) +- Put /boot on ZFS (GRUB can read it) +- Always rebuild initramfs after rollback +- Sync /boot backups with ZFS snapshots + +** 3. zpool import -R Persists Mountpoints + +Using =zpool import -R /mnt= for chroot work can permanently change dataset mountpoints. The -R flag sets altroot, but if you then modify datasets, those changes persist with the /mnt prefix. + +Fix after chroot work: +#+BEGIN_SRC bash +zfs set mountpoint=/ zroot/ROOT/default +zfs set mountpoint=/home zroot/home +# ... etc for all datasets +#+END_SRC + +** 4. Hostid Consistency Required + +ZFS pools track which system last accessed them. If hostid changes (e.g., between live ISO and installed system), import fails with "pool was previously in use from another system." + +Solutions: +- Clean export before switching systems (zpool export) +- Force import (zpool import -f) +- Ensure consistent hostid via /etc/hostid and spl.spl_hostid kernel parameter + +* Files Modified on Ratio + +- /etc/mkinitcpio.conf - Fixed HOOKS +- /etc/mkinitcpio.conf.d/archiso.conf - Removed (was overriding HOOKS) +- /etc/default/grub - GRUB_TIMEOUT=5 (was 0) +- /boot/grub/grub.cfg - Regenerated, added TEST label to mainline kernel +- /etc/hostid - Regenerated to match GRUB hostid parameter +- ZFS dataset mountpoints - Reset from /mnt/* to /* + +* Packages Installed + +- linux-firmware 20260110-1 (critical fix) +- linux 6.15.2 + zfs-linux (available as TEST kernel, not needed for boot) +- Various system packages updated during troubleshooting + +* Resources Referenced + +** Framework Community Posts +- https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221 +- https://community.frame.work/t/fyi-linux-firmware-amdgpu-20251125-breaks-rocm-on-ai-max-395-8060s/78554 +- https://github.com/FrameworkComputer/SoftwareFirmwareIssueTracker/issues/162 + +** Donato Capitella Video +- Title: "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!" +- Key info: Firmware 20260110+ required, kernel 6.18.4+ for ROCm stability +- Transcript saved: assets/Donato Capitella-ROCm+Linux Support on Strix Halo...txt + +** Other +- Arch Linux Archive (for kernel 6.15.2 package) +- Jeff Geerling blog (VRAM allocation on AMD APUs) + +* TODO Items Created + +Added to todo.org: +- [#A] Fix ZFS rollback breaking boot (/boot not on ZFS) +- Links to existing [#A] Integrate ZFSBootMenu task + +* Test Kernel Available + +The TEST kernel (linux 6.15.2) is installed and available in GRUB Advanced menu. It has matching zfs-linux modules and should work if needed. The mainline kernel may be useful for: +- ROCm/AI workloads (combined with ROCm 7.2+ when released) +- Future GPU stability improvements +- Testing newer kernel features + +Current recommendation: Use linux-lts for stability, TEST kernel for experimentation. |
