diff options
| author | Craig Jennings <c@cjennings.net> | 2026-01-22 23:21:18 -0600 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-01-22 23:21:18 -0600 |
| commit | 0ffe7a85a1b024b88e4ddc3305c5f805edd6e8e1 (patch) | |
| tree | ccd6c610630cce9eef268ab692999cdfe3bb5a1b /inbox/instructions.txt | |
| parent | 197a8036af21232276cfbd9624d9eeeebe722df6 (diff) | |
| download | archangel-0ffe7a85a1b024b88e4ddc3305c5f805edd6e8e1.tar.gz archangel-0ffe7a85a1b024b88e4ddc3305c5f805edd6e8e1.zip | |
Replace GRUB with ZFSBootMenu bootloader
This is a major change that replaces the GRUB bootloader with ZFSBootMenu,
providing native ZFS boot environment support.
Key changes:
- EFI partition reduced from 1GB to 512MB (only holds ZFSBootMenu)
- EFI now mounts at /efi instead of /boot
- Kernel and initramfs live on ZFS root (enables snapshot boot with matching kernel)
- Downloads pre-built ZFSBootMenu EFI binary from get.zfsbootmenu.org
- Creates EFI boot entries for all disks in multi-disk configurations
- Syncs ZFSBootMenu to all EFI partitions for redundancy
- Sets org.zfsbootmenu:commandline on zroot/ROOT for kernel cmdline inheritance
- Sets bootfs pool property for default boot environment
- AMD GPU workarounds (pg_mask, cwsr_enable) added to kernel cmdline when AMD detected
Deleted GRUB snapshot tooling (no longer needed):
- custom/grub-zfs-snap
- custom/40_zfs_snapshots
- custom/zz-grub-zfs-snap.hook
- custom/zfs-snap-prune
Updated helper scripts:
- zfssnapshot: removed grub-zfs-snap call, shows ZFSBootMenu tip
- zfsrollback: removed grub-zfs-snap call, notes auto-detection
Tested configurations:
- Single disk installation
- 2-disk mirror (mirror-0)
- 3-disk RAIDZ1 (raidz1-0)
- All boot correctly with ZFSBootMenu
Diffstat (limited to 'inbox/instructions.txt')
| -rw-r--r-- | inbox/instructions.txt | 224 |
1 files changed, 0 insertions, 224 deletions
diff --git a/inbox/instructions.txt b/inbox/instructions.txt deleted file mode 100644 index d6b8461..0000000 --- a/inbox/instructions.txt +++ /dev/null @@ -1,224 +0,0 @@ -AMD Strix Halo VPE/CWSR Freeze Fix Instructions -=============================================== -Created: 2026-01-22 -Machine: ratio (Framework Desktop, AMD Ryzen AI Max 300) - -PROBLEM SUMMARY ---------------- -Two AMD GPU bugs cause random freezes on Strix Halo: - -1. VPE Power Gating Bug - - VPE (Video Processing Engine) tries to power gate after 1 second idle - - SMU hangs, system freezes - - Fix: amdgpu.pg_mask=0 (disables power gating) - -2. CWSR Bug (Compute Wavefront Save/Restore) - - MES firmware hang under compute loads - - Causes GPU reset loops and crashes - - Fix: amdgpu.cwsr_enable=0 - -Current state on ratio: -- pg_mask = 4294967295 (power gating ENABLED - bad) -- cwsr_enable = 1 (CWSR ENABLED - bad) -- Neither workaround is applied - - -PART 1: GRUB CMDLINE FIX (Quick, can do now) -============================================ -This adds the parameters to the kernel command line via GRUB. -Can be done on the running system, takes effect on next boot. - -Step 1: Edit GRUB defaults --------------------------- -sudo nano /etc/default/grub - -Find the line: -GRUB_CMDLINE_LINUX_DEFAULT="..." - -Add these parameters (keep existing ones): -GRUB_CMDLINE_LINUX_DEFAULT="... amdgpu.pg_mask=0 amdgpu.cwsr_enable=0" - -Example - if current line is: -GRUB_CMDLINE_LINUX_DEFAULT="loglevel=2 rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1 mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash" - -Change to: -GRUB_CMDLINE_LINUX_DEFAULT="loglevel=2 rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1 mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash amdgpu.pg_mask=0 amdgpu.cwsr_enable=0" - -Step 2: Regenerate GRUB config ------------------------------- -sudo grub-mkconfig -o /boot/grub/grub.cfg - -Step 3: Reboot and verify -------------------------- -sudo reboot - -After reboot, verify: -cat /sys/module/amdgpu/parameters/pg_mask -# Should show: 0 - -cat /sys/module/amdgpu/parameters/cwsr_enable -# Should show: 0 - -cat /proc/cmdline | grep -oE "(pg_mask|cwsr_enable)=[^ ]+" -# Should show: -# pg_mask=0 -# cwsr_enable=0 - - -PART 2: MODPROBE.D FIX (Permanent, requires live ISO) -===================================================== -This embeds the parameters in the initramfs so they're always applied. -MUST be done from live ISO because mkinitcpio triggers the freeze. - -Step 1: Boot archzfs live ISO ------------------------------ -- Boot from USB with archzfs ISO -- Get to root shell - -Step 2: Import and mount ZFS ----------------------------- -zpool import -f zroot -zfs mount zroot/ROOT/default -mount /dev/nvme1n1p1 /mnt/boot # Note: nvme1n1p1, not nvme0n1p1! - -Verify: -ls /mnt/boot/vmlinuz* -# Should show kernel images - -Step 3: Create modprobe config ------------------------------- -cat > /mnt/etc/modprobe.d/amdgpu.conf << 'EOF' -# Workarounds for AMD Strix Halo GPU bugs -# Created: 2026-01-22 -# Remove when kernel has proper fixes (check linux-lts >= 6.18 with fixes) - -# Disable power gating to prevent VPE freeze -# VPE tries to power gate after 1s idle, causing SMU hang -options amdgpu pg_mask=0 - -# Disable Compute Wavefront Save/Restore to prevent MES hang -# CWSR causes MES firmware 0x80 hang under compute loads -options amdgpu cwsr_enable=0 -EOF - -Step 4: Chroot and rebuild initramfs ------------------------------------- -# Mount system directories -mount --rbind /dev /mnt/dev -mount --rbind /sys /mnt/sys -mount --rbind /proc /mnt/proc -mount --rbind /run /mnt/run - -# Chroot -arch-chroot /mnt - -# Rebuild initramfs (this is safe from live ISO) -mkinitcpio -P - -# Verify amdgpu.conf is in initramfs -lsinitcpio /boot/initramfs-linux.img | grep amdgpu -# Should show: etc/modprobe.d/amdgpu.conf - -# Exit chroot -exit - -Step 5: Clean up and reboot ---------------------------- -# Unmount everything -umount -R /mnt/dev /mnt/sys /mnt/proc /mnt/run -zfs unmount -a -zpool export zroot - -# Reboot -reboot - -Step 6: Verify after reboot ---------------------------- -cat /sys/module/amdgpu/parameters/pg_mask -# Should show: 0 - -cat /sys/module/amdgpu/parameters/cwsr_enable -# Should show: 0 - -lsinitcpio /boot/initramfs-linux.img | grep amdgpu.conf -# Should show: etc/modprobe.d/amdgpu.conf - - -VERIFICATION CHECKLIST -====================== -After applying fixes, verify: - -[ ] pg_mask shows 0 (not 4294967295) -[ ] cwsr_enable shows 0 (not 1) -[ ] Parameters visible in /proc/cmdline (if using GRUB method) -[ ] amdgpu.conf in initramfs (if using modprobe.d method) -[ ] System stable - no freezes during idle -[ ] mkinitcpio -P completes without freeze (test after fix applied) - - -IMPORTANT NOTES -=============== - -1. Boot partition UUID - ratio has mirrored NVMe drives. The boot partition is on nvme1n1p1: - - nvme0n1p1: 6A4B-47A4 (NOT the boot partition) - - nvme1n1p1: 6A4A-93B1 (THIS is /boot) - -2. Kernel is pinned - /etc/pacman.conf has: IgnorePkg = linux - This prevents upgrading from 6.15.2 until manually unpinned. - DO NOT upgrade to 6.18.x - it has worse bugs for Strix Halo. - -3. When to remove workarounds - Monitor Framework Community and AMD-gfx mailing list for proper fixes. - When linux-lts has confirmed VPE and CWSR fixes, can try removing. - Test by commenting out lines in amdgpu.conf, rebuild initramfs, test. - -4. If system freezes during mkinitcpio - This means the fix isn't active yet. Must do from live ISO. - The modconf hook reads /etc/modprobe.d/ at build time, but the - running kernel still has the old parameters until reboot. - - -TROUBLESHOOTING -=============== - -System still freezes after GRUB fix: -- Check /proc/cmdline - are parameters there? -- Check /sys/module/amdgpu/parameters/* - are values correct? -- If cmdline has them but sysfs doesn't, driver may have loaded before - parsing. Need modprobe.d method instead. - -Can't import zpool from live ISO: -- Try: zpool import -f zroot -- If "pool was previously in use": zpool import -f zroot -- Check hostid: cat /etc/hostid on installed system - -mkinitcpio says "Preset not found": -- Check /etc/mkinitcpio.d/*.preset files exist -- For linux kernel: linux.preset -- For linux-lts: linux-lts.preset - -After chroot, wrong mountpoints: -- Reset mountpoints after any chroot work: - zfs set mountpoint=/ zroot/ROOT/default - zfs set mountpoint=/home zroot/home - (etc. for all datasets) - - -REFERENCES -========== - -VPE timeout patch (not merged): -https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg127724.html - -Framework Community - critical 6.18 bugs: -https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221 - -CWSR workaround: -https://github.com/ROCm/ROCm/issues/5590 - -Session documentation: -- docs/2026-01-22-ratio-boot-fix-session.org -- docs/2026-01-22-mkinitcpio-config-boot-failure.org -- assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org |
