aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/2026-01-22-mkinitcpio-config-boot-failure.org159
-rw-r--r--docs/2026-01-22-ratio-boot-fix-session.org175
2 files changed, 334 insertions, 0 deletions
diff --git a/docs/2026-01-22-mkinitcpio-config-boot-failure.org b/docs/2026-01-22-mkinitcpio-config-boot-failure.org
new file mode 100644
index 0000000..ba5bc72
--- /dev/null
+++ b/docs/2026-01-22-mkinitcpio-config-boot-failure.org
@@ -0,0 +1,159 @@
+#+TITLE: install-archzfs leaves broken mkinitcpio configuration
+#+DATE: 2026-01-22
+
+* Problem Summary
+
+After installing Arch Linux with ZFS via install-archzfs, the system has incorrect mkinitcpio configuration that can cause boot failures. The configuration issues are latent - the system may boot initially but will fail after any mkinitcpio regeneration (kernel updates, manual rebuilds, etc.).
+
+* Root Cause
+
+The install-archzfs script does not properly configure mkinitcpio for a ZFS boot environment. Three issues were identified:
+
+** Issue 1: Wrong HOOKS in mkinitcpio.conf
+
+The installed system had:
+#+begin_example
+HOOKS=(base systemd autodetect microcode modconf kms keyboard keymap sd-vconsole block filesystems fsck)
+#+end_example
+
+This is wrong for ZFS because:
+- Uses =systemd= init hook, but ZFS hook is busybox-based and incompatible with systemd init
+- Missing =zfs= hook entirely
+- Has =fsck= hook which is unnecessary/wrong for ZFS
+
+Correct HOOKS for ZFS:
+#+begin_example
+HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)
+#+end_example
+
+** Issue 2: Leftover archiso.conf drop-in
+
+The file =/etc/mkinitcpio.conf.d/archiso.conf= was left over from the live ISO:
+#+begin_example
+HOOKS=(base udev microcode modconf kms memdisk archiso archiso_loop_mnt archiso_pxe_common archiso_pxe_nbd archiso_pxe_http archiso_pxe_nfs block filesystems keyboard)
+COMPRESSION="xz"
+COMPRESSION_OPTIONS=(-9e)
+#+end_example
+
+This drop-in OVERRIDES the HOOKS setting in mkinitcpio.conf, so even if mkinitcpio.conf were correct, this file would break it.
+
+** Issue 3: Wrong preset file
+
+The file =/etc/mkinitcpio.d/linux-lts.preset= contained archiso-specific configuration:
+#+begin_example
+# mkinitcpio preset file for the 'linux-lts' package on archiso
+
+PRESETS=('archiso')
+
+ALL_kver='/boot/vmlinuz-linux-lts'
+archiso_config='/etc/mkinitcpio.conf.d/archiso.conf'
+
+archiso_image="/boot/initramfs-linux-lts.img"
+#+end_example
+
+Should be:
+#+begin_example
+# mkinitcpio preset file for linux-lts
+
+PRESETS=(default fallback)
+
+ALL_kver="/boot/vmlinuz-linux-lts"
+
+default_image="/boot/initramfs-linux-lts.img"
+
+fallback_image="/boot/initramfs-linux-lts-fallback.img"
+fallback_options="-S autodetect"
+#+end_example
+
+* How This Manifests
+
+1. Fresh install appears to work (initramfs built during install has ZFS support somehow)
+2. System boots fine initially
+3. Kernel update or manual =mkinitcpio -P= rebuilds initramfs
+4. New initramfs lacks ZFS support due to wrong config
+5. Next reboot fails with "cannot import pool" or "failed to mount /sysroot"
+
+* Fix Required in install-archzfs
+
+The script needs to, after arch-chroot setup:
+
+1. *Set correct mkinitcpio.conf HOOKS*:
+ #+begin_src bash
+ sed -i 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/' /mnt/etc/mkinitcpio.conf
+ #+end_src
+
+2. *Remove archiso drop-in*:
+ #+begin_src bash
+ rm -f /mnt/etc/mkinitcpio.conf.d/archiso.conf
+ #+end_src
+
+3. *Create proper preset file*:
+ #+begin_src bash
+ cat > /mnt/etc/mkinitcpio.d/linux-lts.preset << 'EOF'
+ # mkinitcpio preset file for linux-lts
+
+ PRESETS=(default fallback)
+
+ ALL_kver="/boot/vmlinuz-linux-lts"
+
+ default_image="/boot/initramfs-linux-lts.img"
+
+ fallback_image="/boot/initramfs-linux-lts-fallback.img"
+ fallback_options="-S autodetect"
+ EOF
+ #+end_src
+
+4. *Rebuild initramfs after fixing config*:
+ #+begin_src bash
+ arch-chroot /mnt mkinitcpio -P
+ #+end_src
+
+* Recovery Procedure (for affected systems)
+
+Boot from archzfs live ISO, then:
+
+#+begin_src bash
+# Import and mount ZFS
+zpool import -f zroot
+zfs mount zroot/ROOT/default
+mount /dev/nvme0n1p1 /boot # adjust device as needed
+
+# Fix mkinitcpio.conf
+sed -i 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/' /etc/mkinitcpio.conf
+
+# Remove archiso drop-in
+rm -f /etc/mkinitcpio.conf.d/archiso.conf
+
+# Fix preset (adjust for your kernel: linux, linux-lts, linux-zen, etc.)
+cat > /etc/mkinitcpio.d/linux-lts.preset << 'EOF'
+PRESETS=(default fallback)
+ALL_kver="/boot/vmlinuz-linux-lts"
+default_image="/boot/initramfs-linux-lts.img"
+fallback_image="/boot/initramfs-linux-lts-fallback.img"
+fallback_options="-S autodetect"
+EOF
+
+# Mount system directories for chroot
+mount --rbind /dev /dev
+mount --rbind /sys /sys
+mount --rbind /proc /proc
+mount --rbind /run /run
+
+# Rebuild initramfs
+chroot / mkinitcpio -P
+
+# Reboot
+reboot
+#+end_src
+
+* Machine Details (ratio)
+
+- Two NVMe drives in ZFS mirror (nvme0n1, nvme1n1)
+- Pool: zroot
+- Root dataset: zroot/ROOT/default
+- Kernel: linux-lts 6.12.66-1
+- Boot partition: /dev/nvme0n1p1 (FAT32, mounted at /boot)
+
+* Related Information
+
+The immediate trigger for discovering this was a system freeze during mkinitcpio regeneration. That freeze was caused by the AMD GPU VPE power gating bug (separate issue - see archsetup NOTES.org for details). However, the system's inability to boot afterward exposed these latent mkinitcpio configuration problems.
diff --git a/docs/2026-01-22-ratio-boot-fix-session.org b/docs/2026-01-22-ratio-boot-fix-session.org
new file mode 100644
index 0000000..463774f
--- /dev/null
+++ b/docs/2026-01-22-ratio-boot-fix-session.org
@@ -0,0 +1,175 @@
+#+TITLE: Ratio Boot Fix Session - 2026-01-22
+#+DATE: 2026-01-22
+#+AUTHOR: Craig Jennings + Claude
+
+* Summary
+
+Successfully diagnosed and fixed boot failures on ratio (Framework Desktop with AMD Ryzen AI Max 300 / Strix Halo GPU). The primary issue was outdated/missing linux-firmware causing the amdgpu driver to hang during boot.
+
+* Hardware
+
+- Machine: Framework Desktop
+- CPU/GPU: AMD Ryzen AI Max 300 (Strix Halo APU, codenamed GFX 1151)
+- Storage: 2x NVMe in ZFS mirror (zroot)
+- Installed via: install-archzfs script from this project
+
+* Initial Symptoms
+
+1. System froze at "triggering udev events..." during boot
+2. Only visible message before freeze: "RDSEED32 is broken" (red herring - just a warning)
+3. Freeze occurred with both linux-lts (6.12.66) and linux (6.15.2) kernels
+4. Blacklisting amdgpu allowed boot to proceed but caused kernel panic (no display = init killed)
+
+* Root Cause
+
+The linux-firmware package was either missing or outdated. Specifically:
+- linux-firmware 20251125 is known to break AMD Strix Halo (GFX 1151)
+- linux-firmware 20260110 contains fixes for Strix Halo stability
+
+Source: Donato Capitella video "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!"
+- Firmware 20251125 completely broke ROCm/GPU on Strix Halo
+- Firmware 20260110+ restores functionality
+
+* Troubleshooting Timeline
+
+** Phase 1: Initial Diagnosis
+
+- SSH'd to ratio via archzfs live ISO
+- Found mkinitcpio configuration issues (separate bug)
+- Fixed mkinitcpio.conf HOOKS and removed archiso.conf drop-in
+- System still froze after fixes
+
+** Phase 2: Kernel Investigation
+
+- Researched AMD Strix Halo issues on Framework community forums
+- Found reports of VPE (Video Processing Engine) idle timeout bug
+- Attempted kernel parameter workarounds:
+ - amdgpu.pg_mask=0 (disable power gating) - didn't help
+ - amdgpu.cwsr_enable=0 (disable compute wavefront save/restore) - not tested
+- Installed kernel 6.15.2 from Arch Linux Archive (has VPE fix)
+- Installed matching zfs-linux package for 6.15.2
+- System still froze
+
+** Phase 3: ZFS Rollback Complications
+
+- Rolled back to pre-kernel-switch ZFS snapshot
+- Discovered /boot is NOT on ZFS (EFI partition)
+- Rollback caused mismatch: root filesystem rolled back, but /boot kept newer kernels
+- Kernel 6.15.2 panicked because its modules didn't exist on rolled-back root
+- Documented this as a fundamental ZFS-on-root issue (see todo.org)
+
+** Phase 4: Firmware Discovery
+
+- Found video transcript explaining Strix Halo firmware requirements
+- Discovered linux-firmware package was not installed (orphaned files from rollback)
+- Repo had linux-firmware 20260110-1 (the fixed version)
+- Installed linux-firmware 20260110-1
+
+** Phase 5: Boot Success with Secondary Issues
+
+After firmware install, encountered additional issues:
+
+1. *Hostid mismatch*: Pool showed "previously in use from another system"
+ - Fix: Clean export from live ISO (zpool export zroot)
+
+2. *ZFS mountpoint=legacy*: Root dataset had legacy mountpoint from chroot work
+ - Fix: zfs set mountpoint=/ zroot/ROOT/default
+
+3. *ZFS mountpoints with /mnt prefix*: All child datasets had /mnt prefix
+ - Cause: zpool import -R /mnt persisted mountpoint changes
+ - Fix: Reset all mountpoints (zfs set mountpoint=/home zroot/home, etc.)
+
+* Final Working Configuration
+
+#+BEGIN_SRC
+Kernel: linux-lts 6.12.66-1-lts
+Firmware: linux-firmware 20260110-1
+ZFS: zfs-linux-lts (DKMS built for 6.12.66)
+Boot: GRUB with spl.spl_hostid=0x564478f3
+#+END_SRC
+
+* Key Learnings
+
+** 1. Firmware Matters for AMD APUs
+
+The linux-firmware package is critical for AMD integrated GPUs. Strix Halo specifically requires firmware 20260110 or newer. The kernel version (6.12 vs 6.15) was less important than having correct firmware.
+
+** 2. ZFS Rollback + Separate /boot = Danger
+
+When /boot is on a separate EFI partition (not ZFS):
+- ZFS rollback doesn't affect /boot
+- Kernel images remain at newer version
+- Modules on root get rolled back
+- Result: Boot failure or kernel panic
+
+Solutions:
+- Use ZFSBootMenu (stores kernel/initramfs on ZFS)
+- Put /boot on ZFS (GRUB can read it)
+- Always rebuild initramfs after rollback
+- Sync /boot backups with ZFS snapshots
+
+** 3. zpool import -R Persists Mountpoints
+
+Using =zpool import -R /mnt= for chroot work can permanently change dataset mountpoints. The -R flag sets altroot, but if you then modify datasets, those changes persist with the /mnt prefix.
+
+Fix after chroot work:
+#+BEGIN_SRC bash
+zfs set mountpoint=/ zroot/ROOT/default
+zfs set mountpoint=/home zroot/home
+# ... etc for all datasets
+#+END_SRC
+
+** 4. Hostid Consistency Required
+
+ZFS pools track which system last accessed them. If hostid changes (e.g., between live ISO and installed system), import fails with "pool was previously in use from another system."
+
+Solutions:
+- Clean export before switching systems (zpool export)
+- Force import (zpool import -f)
+- Ensure consistent hostid via /etc/hostid and spl.spl_hostid kernel parameter
+
+* Files Modified on Ratio
+
+- /etc/mkinitcpio.conf - Fixed HOOKS
+- /etc/mkinitcpio.conf.d/archiso.conf - Removed (was overriding HOOKS)
+- /etc/default/grub - GRUB_TIMEOUT=5 (was 0)
+- /boot/grub/grub.cfg - Regenerated, added TEST label to mainline kernel
+- /etc/hostid - Regenerated to match GRUB hostid parameter
+- ZFS dataset mountpoints - Reset from /mnt/* to /*
+
+* Packages Installed
+
+- linux-firmware 20260110-1 (critical fix)
+- linux 6.15.2 + zfs-linux (available as TEST kernel, not needed for boot)
+- Various system packages updated during troubleshooting
+
+* Resources Referenced
+
+** Framework Community Posts
+- https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221
+- https://community.frame.work/t/fyi-linux-firmware-amdgpu-20251125-breaks-rocm-on-ai-max-395-8060s/78554
+- https://github.com/FrameworkComputer/SoftwareFirmwareIssueTracker/issues/162
+
+** Donato Capitella Video
+- Title: "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!"
+- Key info: Firmware 20260110+ required, kernel 6.18.4+ for ROCm stability
+- Transcript saved: assets/Donato Capitella-ROCm+Linux Support on Strix Halo...txt
+
+** Other
+- Arch Linux Archive (for kernel 6.15.2 package)
+- Jeff Geerling blog (VRAM allocation on AMD APUs)
+
+* TODO Items Created
+
+Added to todo.org:
+- [#A] Fix ZFS rollback breaking boot (/boot not on ZFS)
+- Links to existing [#A] Integrate ZFSBootMenu task
+
+* Test Kernel Available
+
+The TEST kernel (linux 6.15.2) is installed and available in GRUB Advanced menu. It has matching zfs-linux modules and should work if needed. The mainline kernel may be useful for:
+- ROCm/AI workloads (combined with ROCm 7.2+ when released)
+- Future GPU stability improvements
+- Testing newer kernel features
+
+Current recommendation: Use linux-lts for stability, TEST kernel for experimentation.