From 197a8036af21232276cfbd9624d9eeeebe722df6 Mon Sep 17 00:00:00 2001 From: Craig Jennings Date: Thu, 22 Jan 2026 15:45:07 -0600 Subject: Document Strix Halo VPE/CWSR freeze issues and workarounds - Add instructions for applying pg_mask=0 and cwsr_enable=0 workarounds - Document that kernel 6.18.x has critical bugs, stay on 6.15.x-6.17.x - Add session docs, mkinitcpio fixes, and Donato Capitella video transcript - Add PRINCIPLES.org for behavioral lessons learned - Update protocols.org from template --- inbox/instructions.txt | 224 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 224 insertions(+) create mode 100644 inbox/instructions.txt (limited to 'inbox') diff --git a/inbox/instructions.txt b/inbox/instructions.txt new file mode 100644 index 0000000..d6b8461 --- /dev/null +++ b/inbox/instructions.txt @@ -0,0 +1,224 @@ +AMD Strix Halo VPE/CWSR Freeze Fix Instructions +=============================================== +Created: 2026-01-22 +Machine: ratio (Framework Desktop, AMD Ryzen AI Max 300) + +PROBLEM SUMMARY +--------------- +Two AMD GPU bugs cause random freezes on Strix Halo: + +1. VPE Power Gating Bug + - VPE (Video Processing Engine) tries to power gate after 1 second idle + - SMU hangs, system freezes + - Fix: amdgpu.pg_mask=0 (disables power gating) + +2. CWSR Bug (Compute Wavefront Save/Restore) + - MES firmware hang under compute loads + - Causes GPU reset loops and crashes + - Fix: amdgpu.cwsr_enable=0 + +Current state on ratio: +- pg_mask = 4294967295 (power gating ENABLED - bad) +- cwsr_enable = 1 (CWSR ENABLED - bad) +- Neither workaround is applied + + +PART 1: GRUB CMDLINE FIX (Quick, can do now) +============================================ +This adds the parameters to the kernel command line via GRUB. +Can be done on the running system, takes effect on next boot. + +Step 1: Edit GRUB defaults +-------------------------- +sudo nano /etc/default/grub + +Find the line: +GRUB_CMDLINE_LINUX_DEFAULT="..." + +Add these parameters (keep existing ones): +GRUB_CMDLINE_LINUX_DEFAULT="... amdgpu.pg_mask=0 amdgpu.cwsr_enable=0" + +Example - if current line is: +GRUB_CMDLINE_LINUX_DEFAULT="loglevel=2 rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1 mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash" + +Change to: +GRUB_CMDLINE_LINUX_DEFAULT="loglevel=2 rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1 mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash amdgpu.pg_mask=0 amdgpu.cwsr_enable=0" + +Step 2: Regenerate GRUB config +------------------------------ +sudo grub-mkconfig -o /boot/grub/grub.cfg + +Step 3: Reboot and verify +------------------------- +sudo reboot + +After reboot, verify: +cat /sys/module/amdgpu/parameters/pg_mask +# Should show: 0 + +cat /sys/module/amdgpu/parameters/cwsr_enable +# Should show: 0 + +cat /proc/cmdline | grep -oE "(pg_mask|cwsr_enable)=[^ ]+" +# Should show: +# pg_mask=0 +# cwsr_enable=0 + + +PART 2: MODPROBE.D FIX (Permanent, requires live ISO) +===================================================== +This embeds the parameters in the initramfs so they're always applied. +MUST be done from live ISO because mkinitcpio triggers the freeze. + +Step 1: Boot archzfs live ISO +----------------------------- +- Boot from USB with archzfs ISO +- Get to root shell + +Step 2: Import and mount ZFS +---------------------------- +zpool import -f zroot +zfs mount zroot/ROOT/default +mount /dev/nvme1n1p1 /mnt/boot # Note: nvme1n1p1, not nvme0n1p1! + +Verify: +ls /mnt/boot/vmlinuz* +# Should show kernel images + +Step 3: Create modprobe config +------------------------------ +cat > /mnt/etc/modprobe.d/amdgpu.conf << 'EOF' +# Workarounds for AMD Strix Halo GPU bugs +# Created: 2026-01-22 +# Remove when kernel has proper fixes (check linux-lts >= 6.18 with fixes) + +# Disable power gating to prevent VPE freeze +# VPE tries to power gate after 1s idle, causing SMU hang +options amdgpu pg_mask=0 + +# Disable Compute Wavefront Save/Restore to prevent MES hang +# CWSR causes MES firmware 0x80 hang under compute loads +options amdgpu cwsr_enable=0 +EOF + +Step 4: Chroot and rebuild initramfs +------------------------------------ +# Mount system directories +mount --rbind /dev /mnt/dev +mount --rbind /sys /mnt/sys +mount --rbind /proc /mnt/proc +mount --rbind /run /mnt/run + +# Chroot +arch-chroot /mnt + +# Rebuild initramfs (this is safe from live ISO) +mkinitcpio -P + +# Verify amdgpu.conf is in initramfs +lsinitcpio /boot/initramfs-linux.img | grep amdgpu +# Should show: etc/modprobe.d/amdgpu.conf + +# Exit chroot +exit + +Step 5: Clean up and reboot +--------------------------- +# Unmount everything +umount -R /mnt/dev /mnt/sys /mnt/proc /mnt/run +zfs unmount -a +zpool export zroot + +# Reboot +reboot + +Step 6: Verify after reboot +--------------------------- +cat /sys/module/amdgpu/parameters/pg_mask +# Should show: 0 + +cat /sys/module/amdgpu/parameters/cwsr_enable +# Should show: 0 + +lsinitcpio /boot/initramfs-linux.img | grep amdgpu.conf +# Should show: etc/modprobe.d/amdgpu.conf + + +VERIFICATION CHECKLIST +====================== +After applying fixes, verify: + +[ ] pg_mask shows 0 (not 4294967295) +[ ] cwsr_enable shows 0 (not 1) +[ ] Parameters visible in /proc/cmdline (if using GRUB method) +[ ] amdgpu.conf in initramfs (if using modprobe.d method) +[ ] System stable - no freezes during idle +[ ] mkinitcpio -P completes without freeze (test after fix applied) + + +IMPORTANT NOTES +=============== + +1. Boot partition UUID + ratio has mirrored NVMe drives. The boot partition is on nvme1n1p1: + - nvme0n1p1: 6A4B-47A4 (NOT the boot partition) + - nvme1n1p1: 6A4A-93B1 (THIS is /boot) + +2. Kernel is pinned + /etc/pacman.conf has: IgnorePkg = linux + This prevents upgrading from 6.15.2 until manually unpinned. + DO NOT upgrade to 6.18.x - it has worse bugs for Strix Halo. + +3. When to remove workarounds + Monitor Framework Community and AMD-gfx mailing list for proper fixes. + When linux-lts has confirmed VPE and CWSR fixes, can try removing. + Test by commenting out lines in amdgpu.conf, rebuild initramfs, test. + +4. If system freezes during mkinitcpio + This means the fix isn't active yet. Must do from live ISO. + The modconf hook reads /etc/modprobe.d/ at build time, but the + running kernel still has the old parameters until reboot. + + +TROUBLESHOOTING +=============== + +System still freezes after GRUB fix: +- Check /proc/cmdline - are parameters there? +- Check /sys/module/amdgpu/parameters/* - are values correct? +- If cmdline has them but sysfs doesn't, driver may have loaded before + parsing. Need modprobe.d method instead. + +Can't import zpool from live ISO: +- Try: zpool import -f zroot +- If "pool was previously in use": zpool import -f zroot +- Check hostid: cat /etc/hostid on installed system + +mkinitcpio says "Preset not found": +- Check /etc/mkinitcpio.d/*.preset files exist +- For linux kernel: linux.preset +- For linux-lts: linux-lts.preset + +After chroot, wrong mountpoints: +- Reset mountpoints after any chroot work: + zfs set mountpoint=/ zroot/ROOT/default + zfs set mountpoint=/home zroot/home + (etc. for all datasets) + + +REFERENCES +========== + +VPE timeout patch (not merged): +https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg127724.html + +Framework Community - critical 6.18 bugs: +https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221 + +CWSR workaround: +https://github.com/ROCm/ROCm/issues/5590 + +Session documentation: +- docs/2026-01-22-ratio-boot-fix-session.org +- docs/2026-01-22-mkinitcpio-config-boot-failure.org +- assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org -- cgit v1.2.3