From bf6eef6183df6051b2423c7850c230406861f927 Mon Sep 17 00:00:00 2001 From: Craig Jennings Date: Sat, 31 Jan 2026 16:23:00 -0600 Subject: docs: add new workflows and AMD GPU workaround - Add email workflow (msmtp direct sending) - Add assemble-email workflow (document gathering for manual send) - Add retrospective workflow - Add AMD GPU suspend workaround notes --- ...2026-01-27-ratio-amd-gpu-suspend-workaround.org | 217 +++++++++++++++++++++ 1 file changed, 217 insertions(+) create mode 100644 docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org (limited to 'docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org') diff --git a/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org b/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org new file mode 100644 index 0000000..46e403d --- /dev/null +++ b/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org @@ -0,0 +1,217 @@ +#+TITLE: Ratio AMD GPU Suspend Freeze - Workaround & Fix Tracking +#+DATE: 2026-01-27 + +* Summary + +Ratio (Framework Desktop, AMD Ryzen AI Max / Strix Halo) freezes hard on +resume from suspend due to a VPE power gating race condition in the amdgpu +driver. The freeze requires a hard power cycle, which causes journal +corruption and can leave the btrfs filesystem read-only. + +As of 2026-01-27, the proper kernel fix exists (merged in 6.18) but is +unusable due to separate CWSR bugs in 6.18+. Ratio runs kernel 6.12 LTS, +which does not have the fix and will not receive a backport. + +A systemd suspend mask is applied as a workaround to prevent the system from +ever entering the suspend/resume path. + +* The Bug + +** What Happens + +~8% of suspend/resume cycles on Strix Halo result in a hard system freeze +approximately 1 second after the screen turns on during resume. + +** Root Cause: VPE Power Gating Race Condition + +The freeze is caused by a race condition in the amdgpu driver's VPE (Video +Processing Engine) power management during resume: + +1. System resumes from suspend. +2. amdgpu schedules =amdgpu_device_delayed_init_work_handler= (2s delay) to + run self-tests, including =vpe_ring_test_ib= which briefly powers on VPE. +3. The ring buffer test is very short. VPE goes idle. +4. After 1 second of idle, =vpe_idle_work_handler= fires and tells the SMU + (System Management Unit) to power gate (shut down) VPE. +5. *But VPE is still at a high DPM level.* Newer VPE firmware only drops DPM + back to the lowest level (DPM0) after a workload has run for 2+ seconds. + The ring buffer test was too short to trigger that drop. +6. The SMU tries to power gate VPE while it's at a high DPM level. On Strix + Halo, this hangs the SMU. +7. The SMU hang cascades -- VCN, JPEG, and other GPU IPs can't be managed. + Half the GPU is frozen. +8. The thread that issued the SMU command is stuck. System is locked up. + No further logging is possible. + +It only triggers on resume because that's when the driver runs the ring +buffer self-test. During normal operation, VPE either isn't used or has had +enough time to settle its DPM level before power gating. + +** Error Messages (if visible before freeze) + +#+begin_example +SMU: I'm not done with your previous command +Failed to power gate VPE! +Dpm disable vpe failed, ret = -62 +Failed to power gate JPEG +Failed to power gate VCN instance 0 +Dpm disable uvd failed +#+end_example + +** References + +- [[https://lkml.org/lkml/2025/8/24/139][Original VPE_IDLE_TIMEOUT patch (LKML, Aug 2025)]] +- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130657.html][VPE DPM0 fix v5 (amd-gfx, Oct 2025)]] +- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130804.html][Follow-up: missing return statement fix]] +- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop bug #4615]] +- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community: Critical 6.18/6.19 CWSR bugs]] + +* Kernel Fix Status + +** The Proper Fix + +Mario Limonciello (AMD) wrote =drm/amd: Check that VPE has reached DPM0 in +idle handler= -- makes the idle handler check that VPE has actually reached +DPM0 before attempting the power gate. Targets VPE 6.1.1 (Strix Halo) with +firmware versions below =0x0a640500=. + +Merged into Linux 6.18 during the RC phase (drm-fixes-6.18, Oct 29, 2025). +Closes freedesktop bug #4615. + +** Why We Can't Use 6.18 + +Kernel 6.18.x and 6.19.x have critical CWSR (Compute Wavefront Save/Restore) +bugs that cause hard GPU hangs on RDNA3/RDNA4 during compute workloads. The +Framework Community recommends staying on 6.15-6.17 for Strix Halo until +AMD resolves both VPE and CWSR issues in the same kernel. + +** Backport Status + +The fix was tagged =Cc: stable@vger.kernel.org= for backport but has NOT +appeared in any 6.12 LTS release as of 6.12.67. It likely won't be +backported to 6.12 due to infrastructure differences. + +** When to Check Again + +Monitor these for resolution: +- Arch =linux-lts= package updates (=pacman -Si linux-lts=) +- [[https://cdn.kernel.org/pub/linux/kernel/v6.x/][Kernel.org changelogs]] for 6.12.x stable releases +- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community thread]] for CWSR resolution status +- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop #4615]] for any further developments + +* What We Applied (2026-01-27) + +** Workaround: Disable Suspend via systemd + +Prevents the system from entering the suspend/resume path entirely. +The GPU bug is still present but never triggered. + +#+begin_src bash +# Applied 2026-01-27: +sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target +#+end_src + +Effects: +- hypridle can no longer suspend the system +- Screen stays on at idle (active power draw) +- No more freeze → hard reboot → filesystem corruption cycle + +** Kernel Parameters NOT Applied + +The following parameters were identified as fixes but caused boot failures +on ratio when previously attempted (twice): + +#+begin_example +amdgpu.pg_mask=0 # Disables all GPU power gating +amdgpu.cwsr_enable=0 # Disables Compute Wavefront Save/Restore +#+end_example + +It is unclear whether the boot failures were caused by the parameters +themselves or by a corrupted initramfs from running mkinitcpio while the +GPU was in a bad state. Testing via the GRUB =e= key (temporary, no +permanent change) is planned but deferred. + +** Current Kernel Command Line (for reference) + +#+begin_example +BOOT_IMAGE=/@/boot/vmlinuz-linux-lts root=UUID=5b9f7f7f-2477-488f-8fb1-52b5c7d90e98 +rw rootflags=subvol=@ console=tty0 console=ttyS0,115200 rw loglevel=2 +rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1 +mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash +#+end_example + +* How to Undo When a Fixed Kernel Arrives + +** Step 1: Verify the Fix is in the New Kernel + +Check that the VPE DPM0 fix is present: + +#+begin_src bash +# Check kernel version +uname -r + +# Search for the fix in the changelog +# Look for "VPE" or "DPM0" or "vpe_idle" in the relevant changelog: +# https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog- + +# Or check the source directly: +grep -r "vpe_need_dpm0_at_power_down\|vpe_get_dpm_level" /usr/src/linux/drivers/gpu/drm/amd/ 2>/dev/null +#+end_src + +Also verify that CWSR bugs are resolved (check Framework Community thread). + +** Step 2: Unmask Suspend Targets + +#+begin_src bash +sudo systemctl unmask sleep.target suspend.target hibernate.target hybrid-sleep.target +#+end_src + +** Step 3: Test Suspend/Resume + +#+begin_src bash +# Test a single suspend/resume cycle +sudo systemctl suspend + +# If system resumes cleanly, test a few more times +# The original bug had ~8% failure rate, so test at least 20 cycles +#+end_src + +** Step 4: If Kernel Parameters Were Applied + +If =amdgpu.pg_mask=0= and =amdgpu.cwsr_enable=0= were added to GRUB, remove +them once the kernel fix is confirmed working: + +#+begin_src bash +# Edit GRUB config +sudo vim /etc/default/grub +# Remove amdgpu.pg_mask=0 and amdgpu.cwsr_enable=0 from GRUB_CMDLINE_LINUX_DEFAULT + +# Rebuild GRUB config +sudo grub-mkconfig -o /boot/grub/grub.cfg + +# Reboot and test suspend +#+end_src + +* Log Evidence (2026-01-27 Investigation) + +** System Info + +- Machine: Framework Desktop (AMD Ryzen AI Max 300 Series) +- Hostname: ratio +- Kernel: 6.12.67-1-lts +- Filesystem: btrfs RAID1 on 2x NVMe (nvme0n1p2 + nvme1n1p2) +- GPU: AMD Strix Halo (RDNA 3.5) + +** Findings + +- 13 boots between Jan 25-27, most ending in suspend then hard freeze +- Journal corruption on boots -5, -3, and -7 (unclean shutdown) +- =mc= (Midnight Commander) stuck in D state (uninterruptible I/O) during + failed freeze attempts, in =io_schedule → folio_wait_bit_common → + filemap_read= path +- Suspend freeze pattern: =PM: suspend entry (deep)= → =PM: suspend exit= → + =PM: suspend entry (s2idle)= → no more logs → hard reboot required +- =mu= database corruption (error 121) from repeated unclean shutdowns +- btrfs device stats: zero errors on both NVMe drives +- No explicit BTRFS read-only event logged (freeze kills logging before it + can be recorded) -- cgit v1.2.3