diff options
| author | Craig Jennings <c@cjennings.net> | 2026-02-22 23:20:56 -0600 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-02-22 23:20:56 -0600 |
| commit | 5e6877e8f3fb552fce3367ff273167d2cf6af75f (patch) | |
| tree | 909f98edbbb940aafb95de02457d4d6f7db3cba4 /docs/2026-01-22-ratio-amd-gpu-freeze-fix-instructions.org | |
| parent | b104dde43fcc717681a8733a977eb528c60eb13f (diff) | |
| download | archangel-5e6877e8f3fb552fce3367ff273167d2cf6af75f.tar.gz archangel-5e6877e8f3fb552fce3367ff273167d2cf6af75f.zip | |
chore: add docs/ to .gitignore and untrack personal files
docs/ contains session history, personal workflows, and private
protocols that shouldn't be in a public repository.
Diffstat (limited to 'docs/2026-01-22-ratio-amd-gpu-freeze-fix-instructions.org')
| -rw-r--r-- | docs/2026-01-22-ratio-amd-gpu-freeze-fix-instructions.org | 224 |
1 files changed, 0 insertions, 224 deletions
diff --git a/docs/2026-01-22-ratio-amd-gpu-freeze-fix-instructions.org b/docs/2026-01-22-ratio-amd-gpu-freeze-fix-instructions.org deleted file mode 100644 index d6b8461..0000000 --- a/docs/2026-01-22-ratio-amd-gpu-freeze-fix-instructions.org +++ /dev/null @@ -1,224 +0,0 @@ -AMD Strix Halo VPE/CWSR Freeze Fix Instructions -=============================================== -Created: 2026-01-22 -Machine: ratio (Framework Desktop, AMD Ryzen AI Max 300) - -PROBLEM SUMMARY ---------------- -Two AMD GPU bugs cause random freezes on Strix Halo: - -1. VPE Power Gating Bug - - VPE (Video Processing Engine) tries to power gate after 1 second idle - - SMU hangs, system freezes - - Fix: amdgpu.pg_mask=0 (disables power gating) - -2. CWSR Bug (Compute Wavefront Save/Restore) - - MES firmware hang under compute loads - - Causes GPU reset loops and crashes - - Fix: amdgpu.cwsr_enable=0 - -Current state on ratio: -- pg_mask = 4294967295 (power gating ENABLED - bad) -- cwsr_enable = 1 (CWSR ENABLED - bad) -- Neither workaround is applied - - -PART 1: GRUB CMDLINE FIX (Quick, can do now) -============================================ -This adds the parameters to the kernel command line via GRUB. -Can be done on the running system, takes effect on next boot. - -Step 1: Edit GRUB defaults --------------------------- -sudo nano /etc/default/grub - -Find the line: -GRUB_CMDLINE_LINUX_DEFAULT="..." - -Add these parameters (keep existing ones): -GRUB_CMDLINE_LINUX_DEFAULT="... amdgpu.pg_mask=0 amdgpu.cwsr_enable=0" - -Example - if current line is: -GRUB_CMDLINE_LINUX_DEFAULT="loglevel=2 rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1 mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash" - -Change to: -GRUB_CMDLINE_LINUX_DEFAULT="loglevel=2 rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1 mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash amdgpu.pg_mask=0 amdgpu.cwsr_enable=0" - -Step 2: Regenerate GRUB config ------------------------------- -sudo grub-mkconfig -o /boot/grub/grub.cfg - -Step 3: Reboot and verify -------------------------- -sudo reboot - -After reboot, verify: -cat /sys/module/amdgpu/parameters/pg_mask -# Should show: 0 - -cat /sys/module/amdgpu/parameters/cwsr_enable -# Should show: 0 - -cat /proc/cmdline | grep -oE "(pg_mask|cwsr_enable)=[^ ]+" -# Should show: -# pg_mask=0 -# cwsr_enable=0 - - -PART 2: MODPROBE.D FIX (Permanent, requires live ISO) -===================================================== -This embeds the parameters in the initramfs so they're always applied. -MUST be done from live ISO because mkinitcpio triggers the freeze. - -Step 1: Boot archzfs live ISO ------------------------------ -- Boot from USB with archzfs ISO -- Get to root shell - -Step 2: Import and mount ZFS ----------------------------- -zpool import -f zroot -zfs mount zroot/ROOT/default -mount /dev/nvme1n1p1 /mnt/boot # Note: nvme1n1p1, not nvme0n1p1! - -Verify: -ls /mnt/boot/vmlinuz* -# Should show kernel images - -Step 3: Create modprobe config ------------------------------- -cat > /mnt/etc/modprobe.d/amdgpu.conf << 'EOF' -# Workarounds for AMD Strix Halo GPU bugs -# Created: 2026-01-22 -# Remove when kernel has proper fixes (check linux-lts >= 6.18 with fixes) - -# Disable power gating to prevent VPE freeze -# VPE tries to power gate after 1s idle, causing SMU hang -options amdgpu pg_mask=0 - -# Disable Compute Wavefront Save/Restore to prevent MES hang -# CWSR causes MES firmware 0x80 hang under compute loads -options amdgpu cwsr_enable=0 -EOF - -Step 4: Chroot and rebuild initramfs ------------------------------------- -# Mount system directories -mount --rbind /dev /mnt/dev -mount --rbind /sys /mnt/sys -mount --rbind /proc /mnt/proc -mount --rbind /run /mnt/run - -# Chroot -arch-chroot /mnt - -# Rebuild initramfs (this is safe from live ISO) -mkinitcpio -P - -# Verify amdgpu.conf is in initramfs -lsinitcpio /boot/initramfs-linux.img | grep amdgpu -# Should show: etc/modprobe.d/amdgpu.conf - -# Exit chroot -exit - -Step 5: Clean up and reboot ---------------------------- -# Unmount everything -umount -R /mnt/dev /mnt/sys /mnt/proc /mnt/run -zfs unmount -a -zpool export zroot - -# Reboot -reboot - -Step 6: Verify after reboot ---------------------------- -cat /sys/module/amdgpu/parameters/pg_mask -# Should show: 0 - -cat /sys/module/amdgpu/parameters/cwsr_enable -# Should show: 0 - -lsinitcpio /boot/initramfs-linux.img | grep amdgpu.conf -# Should show: etc/modprobe.d/amdgpu.conf - - -VERIFICATION CHECKLIST -====================== -After applying fixes, verify: - -[ ] pg_mask shows 0 (not 4294967295) -[ ] cwsr_enable shows 0 (not 1) -[ ] Parameters visible in /proc/cmdline (if using GRUB method) -[ ] amdgpu.conf in initramfs (if using modprobe.d method) -[ ] System stable - no freezes during idle -[ ] mkinitcpio -P completes without freeze (test after fix applied) - - -IMPORTANT NOTES -=============== - -1. Boot partition UUID - ratio has mirrored NVMe drives. The boot partition is on nvme1n1p1: - - nvme0n1p1: 6A4B-47A4 (NOT the boot partition) - - nvme1n1p1: 6A4A-93B1 (THIS is /boot) - -2. Kernel is pinned - /etc/pacman.conf has: IgnorePkg = linux - This prevents upgrading from 6.15.2 until manually unpinned. - DO NOT upgrade to 6.18.x - it has worse bugs for Strix Halo. - -3. When to remove workarounds - Monitor Framework Community and AMD-gfx mailing list for proper fixes. - When linux-lts has confirmed VPE and CWSR fixes, can try removing. - Test by commenting out lines in amdgpu.conf, rebuild initramfs, test. - -4. If system freezes during mkinitcpio - This means the fix isn't active yet. Must do from live ISO. - The modconf hook reads /etc/modprobe.d/ at build time, but the - running kernel still has the old parameters until reboot. - - -TROUBLESHOOTING -=============== - -System still freezes after GRUB fix: -- Check /proc/cmdline - are parameters there? -- Check /sys/module/amdgpu/parameters/* - are values correct? -- If cmdline has them but sysfs doesn't, driver may have loaded before - parsing. Need modprobe.d method instead. - -Can't import zpool from live ISO: -- Try: zpool import -f zroot -- If "pool was previously in use": zpool import -f zroot -- Check hostid: cat /etc/hostid on installed system - -mkinitcpio says "Preset not found": -- Check /etc/mkinitcpio.d/*.preset files exist -- For linux kernel: linux.preset -- For linux-lts: linux-lts.preset - -After chroot, wrong mountpoints: -- Reset mountpoints after any chroot work: - zfs set mountpoint=/ zroot/ROOT/default - zfs set mountpoint=/home zroot/home - (etc. for all datasets) - - -REFERENCES -========== - -VPE timeout patch (not merged): -https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg127724.html - -Framework Community - critical 6.18 bugs: -https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221 - -CWSR workaround: -https://github.com/ROCm/ROCm/issues/5590 - -Session documentation: -- docs/2026-01-22-ratio-boot-fix-session.org -- docs/2026-01-22-mkinitcpio-config-boot-failure.org -- assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org |
