diff options
Diffstat (limited to 'docs/2026-01-22-ratio-boot-fix-session.org')
| -rw-r--r-- | docs/2026-01-22-ratio-boot-fix-session.org | 241 |
1 files changed, 0 insertions, 241 deletions
diff --git a/docs/2026-01-22-ratio-boot-fix-session.org b/docs/2026-01-22-ratio-boot-fix-session.org deleted file mode 100644 index 56563d9..0000000 --- a/docs/2026-01-22-ratio-boot-fix-session.org +++ /dev/null @@ -1,241 +0,0 @@ -#+TITLE: Ratio Boot Fix Session - 2026-01-22 -#+DATE: 2026-01-22 -#+AUTHOR: Craig Jennings + Claude - -* Summary - -Successfully diagnosed and fixed boot failures on ratio (Framework Desktop with AMD Ryzen AI Max 300 / Strix Halo GPU). The primary issue was outdated/missing linux-firmware causing the amdgpu driver to hang during boot. - -* Hardware - -- Machine: Framework Desktop -- CPU/GPU: AMD Ryzen AI Max 300 (Strix Halo APU, codenamed GFX 1151) -- Storage: 2x NVMe in ZFS mirror (zroot) -- Installed via: install-archzfs script from this project - -* Initial Symptoms - -1. System froze at "triggering udev events..." during boot -2. Only visible message before freeze: "RDSEED32 is broken" (red herring - just a warning) -3. Freeze occurred with both linux-lts (6.12.66) and linux (6.15.2) kernels -4. Blacklisting amdgpu allowed boot to proceed but caused kernel panic (no display = init killed) - -* Root Cause - -The linux-firmware package was either missing or outdated. Specifically: -- linux-firmware 20251125 is known to break AMD Strix Halo (GFX 1151) -- linux-firmware 20260110 contains fixes for Strix Halo stability - -Source: Donato Capitella video "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!" -- Firmware 20251125 completely broke ROCm/GPU on Strix Halo -- Firmware 20260110+ restores functionality - -* Troubleshooting Timeline - -** Phase 1: Initial Diagnosis - -- SSH'd to ratio via archzfs live ISO -- Found mkinitcpio configuration issues (separate bug) -- Fixed mkinitcpio.conf HOOKS and removed archiso.conf drop-in -- System still froze after fixes - -** Phase 2: Kernel Investigation - -- Researched AMD Strix Halo issues on Framework community forums -- Found reports of VPE (Video Processing Engine) idle timeout bug -- Attempted kernel parameter workarounds: - - amdgpu.pg_mask=0 (disable power gating) - didn't help - - amdgpu.cwsr_enable=0 (disable compute wavefront save/restore) - not tested -- Installed kernel 6.15.2 from Arch Linux Archive (has VPE fix) -- Installed matching zfs-linux package for 6.15.2 -- System still froze - -** Phase 3: ZFS Rollback Complications - -- Rolled back to pre-kernel-switch ZFS snapshot -- Discovered /boot is NOT on ZFS (EFI partition) -- Rollback caused mismatch: root filesystem rolled back, but /boot kept newer kernels -- Kernel 6.15.2 panicked because its modules didn't exist on rolled-back root -- Documented this as a fundamental ZFS-on-root issue (see todo.org) - -** Phase 4: Firmware Discovery - -- Found video transcript explaining Strix Halo firmware requirements -- Discovered linux-firmware package was not installed (orphaned files from rollback) -- Repo had linux-firmware 20260110-1 (the fixed version) -- Installed linux-firmware 20260110-1 - -** Phase 5: Boot Success with Secondary Issues - -After firmware install, encountered additional issues: - -1. *Hostid mismatch*: Pool showed "previously in use from another system" - - Fix: Clean export from live ISO (zpool export zroot) - -2. *ZFS mountpoint=legacy*: Root dataset had legacy mountpoint from chroot work - - Fix: zfs set mountpoint=/ zroot/ROOT/default - -3. *ZFS mountpoints with /mnt prefix*: All child datasets had /mnt prefix - - Cause: zpool import -R /mnt persisted mountpoint changes - - Fix: Reset all mountpoints (zfs set mountpoint=/home zroot/home, etc.) - -* Final Working Configuration - -#+BEGIN_SRC -Kernel: linux-lts 6.12.66-1-lts -Firmware: linux-firmware 20260110-1 -ZFS: zfs-linux-lts (DKMS built for 6.12.66) -Boot: GRUB with spl.spl_hostid=0x564478f3 -#+END_SRC - -* Key Learnings - -** 1. Firmware Matters for AMD APUs - -The linux-firmware package is critical for AMD integrated GPUs. Strix Halo specifically requires firmware 20260110 or newer. The kernel version (6.12 vs 6.15) was less important than having correct firmware. - -** 2. ZFS Rollback + Separate /boot = Danger - -When /boot is on a separate EFI partition (not ZFS): -- ZFS rollback doesn't affect /boot -- Kernel images remain at newer version -- Modules on root get rolled back -- Result: Boot failure or kernel panic - -Solutions: -- Use ZFSBootMenu (stores kernel/initramfs on ZFS) -- Put /boot on ZFS (GRUB can read it) -- Always rebuild initramfs after rollback -- Sync /boot backups with ZFS snapshots - -** 3. zpool import -R Persists Mountpoints - -Using =zpool import -R /mnt= for chroot work can permanently change dataset mountpoints. The -R flag sets altroot, but if you then modify datasets, those changes persist with the /mnt prefix. - -Fix after chroot work: -#+BEGIN_SRC bash -zfs set mountpoint=/ zroot/ROOT/default -zfs set mountpoint=/home zroot/home -# ... etc for all datasets -#+END_SRC - -** 4. Hostid Consistency Required - -ZFS pools track which system last accessed them. If hostid changes (e.g., between live ISO and installed system), import fails with "pool was previously in use from another system." - -Solutions: -- Clean export before switching systems (zpool export) -- Force import (zpool import -f) -- Ensure consistent hostid via /etc/hostid and spl.spl_hostid kernel parameter - -* Files Modified on Ratio - -- /etc/mkinitcpio.conf - Fixed HOOKS -- /etc/mkinitcpio.conf.d/archiso.conf - Removed (was overriding HOOKS) -- /etc/default/grub - GRUB_TIMEOUT=5 (was 0) -- /boot/grub/grub.cfg - Regenerated, added TEST label to mainline kernel -- /etc/hostid - Regenerated to match GRUB hostid parameter -- ZFS dataset mountpoints - Reset from /mnt/* to /* - -* Packages Installed - -- linux-firmware 20260110-1 (critical fix) -- linux 6.15.2 + zfs-linux (available as TEST kernel, not needed for boot) -- Various system packages updated during troubleshooting - -* Resources Referenced - -** Framework Community Posts -- https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221 -- https://community.frame.work/t/fyi-linux-firmware-amdgpu-20251125-breaks-rocm-on-ai-max-395-8060s/78554 -- https://github.com/FrameworkComputer/SoftwareFirmwareIssueTracker/issues/162 - -** Donato Capitella Video -- Title: "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!" -- Key info: Firmware 20260110+ required, kernel 6.18.4+ for ROCm stability -- Transcript saved: assets/Donato Capitella-ROCm+Linux Support on Strix Halo...txt - -** Other -- Arch Linux Archive (for kernel 6.15.2 package) -- Jeff Geerling blog (VRAM allocation on AMD APUs) - -* TODO Items Created - -Added to todo.org: -- [#A] Fix ZFS rollback breaking boot (/boot not on ZFS) -- Links to existing [#A] Integrate ZFSBootMenu task - -* Test Kernel Available - -The TEST kernel (linux 6.15.2) is installed and available in GRUB Advanced menu. It has matching zfs-linux modules and should work if needed. The mainline kernel may be useful for: -- ROCm/AI workloads (combined with ROCm 7.2+ when released) -- Future GPU stability improvements -- Testing newer kernel features - -Current recommendation: Use linux-lts for stability, TEST kernel for experimentation. - -* Post-Fix Configuration (Phase 6) - -After boot was working, made kernel 6.15.2 the default with a clean GRUB menu. - -** Made 6.15.2 Default - -1. Created custom GRUB script /etc/grub.d/09_custom_kernels -2. Added clean menu entries: - - "Linux 6.15.2 (default)" - - "Linux LTS 6.12.66 (fallback)" -3. Set GRUB_DEFAULT="linux-6.15.2" - -** Pinned Kernel - -Added to /etc/pacman.conf: -#+BEGIN_SRC -IgnorePkg = linux -#+END_SRC - -This prevents pacman from upgrading linux package until manually unpinned. - -** GRUB UUID Issue - -Initial custom script used wrong boot partition UUID: -- nvme0n1p1: 6A4B-47A4 (wrong - got this from lsblk on first NVMe) -- nvme1n1p1: 6A4A-93B1 (correct - actually mounted at /boot) - -Fix: Updated /etc/grub.d/09_custom_kernels to use 6A4A-93B1 - -** Final GRUB Menu - -#+BEGIN_SRC -1. Linux 6.15.2 (default) <- Boots automatically -2. Linux LTS 6.12.66 (fallback) -3. Arch Linux (ZFS) Linux <- Auto-generated (ignored) -4. Advanced options... -5. UEFI Firmware Settings -6. ZFS Snapshots -#+END_SRC - -** SSH Keys - -Configured SSH key authentication for cjennings@ratio.local to simplify remote access. -Password auth (sshpass) was unreliable from Claude's session. - -* Final System State - -#+BEGIN_SRC -Hostname: ratio.local -Kernel: 6.15.2-arch1-1 (default) -Fallback: 6.12.66-1-lts -Firmware: linux-firmware 20260110-1 -ZFS: All pools healthy, 11 datasets mounted -Boot: Custom GRUB menu with clean entries -Kernel pinned: Yes (IgnorePkg = linux) -#+END_SRC - -* When to Unpin Kernel - -Unpin linux package when linux-lts version >= 6.15: -#+BEGIN_SRC bash -sudo sed -i 's/^IgnorePkg = linux/#IgnorePkg = linux/' /etc/pacman.conf -#+END_SRC - -Then optionally switch back to LTS as default for stability. |
