aboutsummaryrefslogtreecommitdiff
path: root/docs/2026-01-22-ratio-boot-fix-session.org
diff options
context:
space:
mode:
Diffstat (limited to 'docs/2026-01-22-ratio-boot-fix-session.org')
-rw-r--r--docs/2026-01-22-ratio-boot-fix-session.org241
1 files changed, 0 insertions, 241 deletions
diff --git a/docs/2026-01-22-ratio-boot-fix-session.org b/docs/2026-01-22-ratio-boot-fix-session.org
deleted file mode 100644
index 56563d9..0000000
--- a/docs/2026-01-22-ratio-boot-fix-session.org
+++ /dev/null
@@ -1,241 +0,0 @@
-#+TITLE: Ratio Boot Fix Session - 2026-01-22
-#+DATE: 2026-01-22
-#+AUTHOR: Craig Jennings + Claude
-
-* Summary
-
-Successfully diagnosed and fixed boot failures on ratio (Framework Desktop with AMD Ryzen AI Max 300 / Strix Halo GPU). The primary issue was outdated/missing linux-firmware causing the amdgpu driver to hang during boot.
-
-* Hardware
-
-- Machine: Framework Desktop
-- CPU/GPU: AMD Ryzen AI Max 300 (Strix Halo APU, codenamed GFX 1151)
-- Storage: 2x NVMe in ZFS mirror (zroot)
-- Installed via: install-archzfs script from this project
-
-* Initial Symptoms
-
-1. System froze at "triggering udev events..." during boot
-2. Only visible message before freeze: "RDSEED32 is broken" (red herring - just a warning)
-3. Freeze occurred with both linux-lts (6.12.66) and linux (6.15.2) kernels
-4. Blacklisting amdgpu allowed boot to proceed but caused kernel panic (no display = init killed)
-
-* Root Cause
-
-The linux-firmware package was either missing or outdated. Specifically:
-- linux-firmware 20251125 is known to break AMD Strix Halo (GFX 1151)
-- linux-firmware 20260110 contains fixes for Strix Halo stability
-
-Source: Donato Capitella video "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!"
-- Firmware 20251125 completely broke ROCm/GPU on Strix Halo
-- Firmware 20260110+ restores functionality
-
-* Troubleshooting Timeline
-
-** Phase 1: Initial Diagnosis
-
-- SSH'd to ratio via archzfs live ISO
-- Found mkinitcpio configuration issues (separate bug)
-- Fixed mkinitcpio.conf HOOKS and removed archiso.conf drop-in
-- System still froze after fixes
-
-** Phase 2: Kernel Investigation
-
-- Researched AMD Strix Halo issues on Framework community forums
-- Found reports of VPE (Video Processing Engine) idle timeout bug
-- Attempted kernel parameter workarounds:
- - amdgpu.pg_mask=0 (disable power gating) - didn't help
- - amdgpu.cwsr_enable=0 (disable compute wavefront save/restore) - not tested
-- Installed kernel 6.15.2 from Arch Linux Archive (has VPE fix)
-- Installed matching zfs-linux package for 6.15.2
-- System still froze
-
-** Phase 3: ZFS Rollback Complications
-
-- Rolled back to pre-kernel-switch ZFS snapshot
-- Discovered /boot is NOT on ZFS (EFI partition)
-- Rollback caused mismatch: root filesystem rolled back, but /boot kept newer kernels
-- Kernel 6.15.2 panicked because its modules didn't exist on rolled-back root
-- Documented this as a fundamental ZFS-on-root issue (see todo.org)
-
-** Phase 4: Firmware Discovery
-
-- Found video transcript explaining Strix Halo firmware requirements
-- Discovered linux-firmware package was not installed (orphaned files from rollback)
-- Repo had linux-firmware 20260110-1 (the fixed version)
-- Installed linux-firmware 20260110-1
-
-** Phase 5: Boot Success with Secondary Issues
-
-After firmware install, encountered additional issues:
-
-1. *Hostid mismatch*: Pool showed "previously in use from another system"
- - Fix: Clean export from live ISO (zpool export zroot)
-
-2. *ZFS mountpoint=legacy*: Root dataset had legacy mountpoint from chroot work
- - Fix: zfs set mountpoint=/ zroot/ROOT/default
-
-3. *ZFS mountpoints with /mnt prefix*: All child datasets had /mnt prefix
- - Cause: zpool import -R /mnt persisted mountpoint changes
- - Fix: Reset all mountpoints (zfs set mountpoint=/home zroot/home, etc.)
-
-* Final Working Configuration
-
-#+BEGIN_SRC
-Kernel: linux-lts 6.12.66-1-lts
-Firmware: linux-firmware 20260110-1
-ZFS: zfs-linux-lts (DKMS built for 6.12.66)
-Boot: GRUB with spl.spl_hostid=0x564478f3
-#+END_SRC
-
-* Key Learnings
-
-** 1. Firmware Matters for AMD APUs
-
-The linux-firmware package is critical for AMD integrated GPUs. Strix Halo specifically requires firmware 20260110 or newer. The kernel version (6.12 vs 6.15) was less important than having correct firmware.
-
-** 2. ZFS Rollback + Separate /boot = Danger
-
-When /boot is on a separate EFI partition (not ZFS):
-- ZFS rollback doesn't affect /boot
-- Kernel images remain at newer version
-- Modules on root get rolled back
-- Result: Boot failure or kernel panic
-
-Solutions:
-- Use ZFSBootMenu (stores kernel/initramfs on ZFS)
-- Put /boot on ZFS (GRUB can read it)
-- Always rebuild initramfs after rollback
-- Sync /boot backups with ZFS snapshots
-
-** 3. zpool import -R Persists Mountpoints
-
-Using =zpool import -R /mnt= for chroot work can permanently change dataset mountpoints. The -R flag sets altroot, but if you then modify datasets, those changes persist with the /mnt prefix.
-
-Fix after chroot work:
-#+BEGIN_SRC bash
-zfs set mountpoint=/ zroot/ROOT/default
-zfs set mountpoint=/home zroot/home
-# ... etc for all datasets
-#+END_SRC
-
-** 4. Hostid Consistency Required
-
-ZFS pools track which system last accessed them. If hostid changes (e.g., between live ISO and installed system), import fails with "pool was previously in use from another system."
-
-Solutions:
-- Clean export before switching systems (zpool export)
-- Force import (zpool import -f)
-- Ensure consistent hostid via /etc/hostid and spl.spl_hostid kernel parameter
-
-* Files Modified on Ratio
-
-- /etc/mkinitcpio.conf - Fixed HOOKS
-- /etc/mkinitcpio.conf.d/archiso.conf - Removed (was overriding HOOKS)
-- /etc/default/grub - GRUB_TIMEOUT=5 (was 0)
-- /boot/grub/grub.cfg - Regenerated, added TEST label to mainline kernel
-- /etc/hostid - Regenerated to match GRUB hostid parameter
-- ZFS dataset mountpoints - Reset from /mnt/* to /*
-
-* Packages Installed
-
-- linux-firmware 20260110-1 (critical fix)
-- linux 6.15.2 + zfs-linux (available as TEST kernel, not needed for boot)
-- Various system packages updated during troubleshooting
-
-* Resources Referenced
-
-** Framework Community Posts
-- https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221
-- https://community.frame.work/t/fyi-linux-firmware-amdgpu-20251125-breaks-rocm-on-ai-max-395-8060s/78554
-- https://github.com/FrameworkComputer/SoftwareFirmwareIssueTracker/issues/162
-
-** Donato Capitella Video
-- Title: "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!"
-- Key info: Firmware 20260110+ required, kernel 6.18.4+ for ROCm stability
-- Transcript saved: assets/Donato Capitella-ROCm+Linux Support on Strix Halo...txt
-
-** Other
-- Arch Linux Archive (for kernel 6.15.2 package)
-- Jeff Geerling blog (VRAM allocation on AMD APUs)
-
-* TODO Items Created
-
-Added to todo.org:
-- [#A] Fix ZFS rollback breaking boot (/boot not on ZFS)
-- Links to existing [#A] Integrate ZFSBootMenu task
-
-* Test Kernel Available
-
-The TEST kernel (linux 6.15.2) is installed and available in GRUB Advanced menu. It has matching zfs-linux modules and should work if needed. The mainline kernel may be useful for:
-- ROCm/AI workloads (combined with ROCm 7.2+ when released)
-- Future GPU stability improvements
-- Testing newer kernel features
-
-Current recommendation: Use linux-lts for stability, TEST kernel for experimentation.
-
-* Post-Fix Configuration (Phase 6)
-
-After boot was working, made kernel 6.15.2 the default with a clean GRUB menu.
-
-** Made 6.15.2 Default
-
-1. Created custom GRUB script /etc/grub.d/09_custom_kernels
-2. Added clean menu entries:
- - "Linux 6.15.2 (default)"
- - "Linux LTS 6.12.66 (fallback)"
-3. Set GRUB_DEFAULT="linux-6.15.2"
-
-** Pinned Kernel
-
-Added to /etc/pacman.conf:
-#+BEGIN_SRC
-IgnorePkg = linux
-#+END_SRC
-
-This prevents pacman from upgrading linux package until manually unpinned.
-
-** GRUB UUID Issue
-
-Initial custom script used wrong boot partition UUID:
-- nvme0n1p1: 6A4B-47A4 (wrong - got this from lsblk on first NVMe)
-- nvme1n1p1: 6A4A-93B1 (correct - actually mounted at /boot)
-
-Fix: Updated /etc/grub.d/09_custom_kernels to use 6A4A-93B1
-
-** Final GRUB Menu
-
-#+BEGIN_SRC
-1. Linux 6.15.2 (default) <- Boots automatically
-2. Linux LTS 6.12.66 (fallback)
-3. Arch Linux (ZFS) Linux <- Auto-generated (ignored)
-4. Advanced options...
-5. UEFI Firmware Settings
-6. ZFS Snapshots
-#+END_SRC
-
-** SSH Keys
-
-Configured SSH key authentication for cjennings@ratio.local to simplify remote access.
-Password auth (sshpass) was unreliable from Claude's session.
-
-* Final System State
-
-#+BEGIN_SRC
-Hostname: ratio.local
-Kernel: 6.15.2-arch1-1 (default)
-Fallback: 6.12.66-1-lts
-Firmware: linux-firmware 20260110-1
-ZFS: All pools healthy, 11 datasets mounted
-Boot: Custom GRUB menu with clean entries
-Kernel pinned: Yes (IgnorePkg = linux)
-#+END_SRC
-
-* When to Unpin Kernel
-
-Unpin linux package when linux-lts version >= 6.15:
-#+BEGIN_SRC bash
-sudo sed -i 's/^IgnorePkg = linux/#IgnorePkg = linux/' /etc/pacman.conf
-#+END_SRC
-
-Then optionally switch back to LTS as default for stability.