diff options
| author | Craig Jennings <c@cjennings.net> | 2026-02-22 23:20:56 -0600 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-02-22 23:20:56 -0600 |
| commit | 5e6877e8f3fb552fce3367ff273167d2cf6af75f (patch) | |
| tree | 909f98edbbb940aafb95de02457d4d6f7db3cba4 /docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org | |
| parent | b104dde43fcc717681a8733a977eb528c60eb13f (diff) | |
| download | archangel-5e6877e8f3fb552fce3367ff273167d2cf6af75f.tar.gz archangel-5e6877e8f3fb552fce3367ff273167d2cf6af75f.zip | |
chore: add docs/ to .gitignore and untrack personal files
docs/ contains session history, personal workflows, and private
protocols that shouldn't be in a public repository.
Diffstat (limited to 'docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org')
| -rw-r--r-- | docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org | 217 |
1 files changed, 0 insertions, 217 deletions
diff --git a/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org b/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org deleted file mode 100644 index 46e403d..0000000 --- a/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org +++ /dev/null @@ -1,217 +0,0 @@ -#+TITLE: Ratio AMD GPU Suspend Freeze - Workaround & Fix Tracking -#+DATE: 2026-01-27 - -* Summary - -Ratio (Framework Desktop, AMD Ryzen AI Max / Strix Halo) freezes hard on -resume from suspend due to a VPE power gating race condition in the amdgpu -driver. The freeze requires a hard power cycle, which causes journal -corruption and can leave the btrfs filesystem read-only. - -As of 2026-01-27, the proper kernel fix exists (merged in 6.18) but is -unusable due to separate CWSR bugs in 6.18+. Ratio runs kernel 6.12 LTS, -which does not have the fix and will not receive a backport. - -A systemd suspend mask is applied as a workaround to prevent the system from -ever entering the suspend/resume path. - -* The Bug - -** What Happens - -~8% of suspend/resume cycles on Strix Halo result in a hard system freeze -approximately 1 second after the screen turns on during resume. - -** Root Cause: VPE Power Gating Race Condition - -The freeze is caused by a race condition in the amdgpu driver's VPE (Video -Processing Engine) power management during resume: - -1. System resumes from suspend. -2. amdgpu schedules =amdgpu_device_delayed_init_work_handler= (2s delay) to - run self-tests, including =vpe_ring_test_ib= which briefly powers on VPE. -3. The ring buffer test is very short. VPE goes idle. -4. After 1 second of idle, =vpe_idle_work_handler= fires and tells the SMU - (System Management Unit) to power gate (shut down) VPE. -5. *But VPE is still at a high DPM level.* Newer VPE firmware only drops DPM - back to the lowest level (DPM0) after a workload has run for 2+ seconds. - The ring buffer test was too short to trigger that drop. -6. The SMU tries to power gate VPE while it's at a high DPM level. On Strix - Halo, this hangs the SMU. -7. The SMU hang cascades -- VCN, JPEG, and other GPU IPs can't be managed. - Half the GPU is frozen. -8. The thread that issued the SMU command is stuck. System is locked up. - No further logging is possible. - -It only triggers on resume because that's when the driver runs the ring -buffer self-test. During normal operation, VPE either isn't used or has had -enough time to settle its DPM level before power gating. - -** Error Messages (if visible before freeze) - -#+begin_example -SMU: I'm not done with your previous command -Failed to power gate VPE! -Dpm disable vpe failed, ret = -62 -Failed to power gate JPEG -Failed to power gate VCN instance 0 -Dpm disable uvd failed -#+end_example - -** References - -- [[https://lkml.org/lkml/2025/8/24/139][Original VPE_IDLE_TIMEOUT patch (LKML, Aug 2025)]] -- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130657.html][VPE DPM0 fix v5 (amd-gfx, Oct 2025)]] -- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130804.html][Follow-up: missing return statement fix]] -- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop bug #4615]] -- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community: Critical 6.18/6.19 CWSR bugs]] - -* Kernel Fix Status - -** The Proper Fix - -Mario Limonciello (AMD) wrote =drm/amd: Check that VPE has reached DPM0 in -idle handler= -- makes the idle handler check that VPE has actually reached -DPM0 before attempting the power gate. Targets VPE 6.1.1 (Strix Halo) with -firmware versions below =0x0a640500=. - -Merged into Linux 6.18 during the RC phase (drm-fixes-6.18, Oct 29, 2025). -Closes freedesktop bug #4615. - -** Why We Can't Use 6.18 - -Kernel 6.18.x and 6.19.x have critical CWSR (Compute Wavefront Save/Restore) -bugs that cause hard GPU hangs on RDNA3/RDNA4 during compute workloads. The -Framework Community recommends staying on 6.15-6.17 for Strix Halo until -AMD resolves both VPE and CWSR issues in the same kernel. - -** Backport Status - -The fix was tagged =Cc: stable@vger.kernel.org= for backport but has NOT -appeared in any 6.12 LTS release as of 6.12.67. It likely won't be -backported to 6.12 due to infrastructure differences. - -** When to Check Again - -Monitor these for resolution: -- Arch =linux-lts= package updates (=pacman -Si linux-lts=) -- [[https://cdn.kernel.org/pub/linux/kernel/v6.x/][Kernel.org changelogs]] for 6.12.x stable releases -- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community thread]] for CWSR resolution status -- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop #4615]] for any further developments - -* What We Applied (2026-01-27) - -** Workaround: Disable Suspend via systemd - -Prevents the system from entering the suspend/resume path entirely. -The GPU bug is still present but never triggered. - -#+begin_src bash -# Applied 2026-01-27: -sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target -#+end_src - -Effects: -- hypridle can no longer suspend the system -- Screen stays on at idle (active power draw) -- No more freeze → hard reboot → filesystem corruption cycle - -** Kernel Parameters NOT Applied - -The following parameters were identified as fixes but caused boot failures -on ratio when previously attempted (twice): - -#+begin_example -amdgpu.pg_mask=0 # Disables all GPU power gating -amdgpu.cwsr_enable=0 # Disables Compute Wavefront Save/Restore -#+end_example - -It is unclear whether the boot failures were caused by the parameters -themselves or by a corrupted initramfs from running mkinitcpio while the -GPU was in a bad state. Testing via the GRUB =e= key (temporary, no -permanent change) is planned but deferred. - -** Current Kernel Command Line (for reference) - -#+begin_example -BOOT_IMAGE=/@/boot/vmlinuz-linux-lts root=UUID=5b9f7f7f-2477-488f-8fb1-52b5c7d90e98 -rw rootflags=subvol=@ console=tty0 console=ttyS0,115200 rw loglevel=2 -rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1 -mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash -#+end_example - -* How to Undo When a Fixed Kernel Arrives - -** Step 1: Verify the Fix is in the New Kernel - -Check that the VPE DPM0 fix is present: - -#+begin_src bash -# Check kernel version -uname -r - -# Search for the fix in the changelog -# Look for "VPE" or "DPM0" or "vpe_idle" in the relevant changelog: -# https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-<version> - -# Or check the source directly: -grep -r "vpe_need_dpm0_at_power_down\|vpe_get_dpm_level" /usr/src/linux/drivers/gpu/drm/amd/ 2>/dev/null -#+end_src - -Also verify that CWSR bugs are resolved (check Framework Community thread). - -** Step 2: Unmask Suspend Targets - -#+begin_src bash -sudo systemctl unmask sleep.target suspend.target hibernate.target hybrid-sleep.target -#+end_src - -** Step 3: Test Suspend/Resume - -#+begin_src bash -# Test a single suspend/resume cycle -sudo systemctl suspend - -# If system resumes cleanly, test a few more times -# The original bug had ~8% failure rate, so test at least 20 cycles -#+end_src - -** Step 4: If Kernel Parameters Were Applied - -If =amdgpu.pg_mask=0= and =amdgpu.cwsr_enable=0= were added to GRUB, remove -them once the kernel fix is confirmed working: - -#+begin_src bash -# Edit GRUB config -sudo vim /etc/default/grub -# Remove amdgpu.pg_mask=0 and amdgpu.cwsr_enable=0 from GRUB_CMDLINE_LINUX_DEFAULT - -# Rebuild GRUB config -sudo grub-mkconfig -o /boot/grub/grub.cfg - -# Reboot and test suspend -#+end_src - -* Log Evidence (2026-01-27 Investigation) - -** System Info - -- Machine: Framework Desktop (AMD Ryzen AI Max 300 Series) -- Hostname: ratio -- Kernel: 6.12.67-1-lts -- Filesystem: btrfs RAID1 on 2x NVMe (nvme0n1p2 + nvme1n1p2) -- GPU: AMD Strix Halo (RDNA 3.5) - -** Findings - -- 13 boots between Jan 25-27, most ending in suspend then hard freeze -- Journal corruption on boots -5, -3, and -7 (unclean shutdown) -- =mc= (Midnight Commander) stuck in D state (uninterruptible I/O) during - failed freeze attempts, in =io_schedule → folio_wait_bit_common → - filemap_read= path -- Suspend freeze pattern: =PM: suspend entry (deep)= → =PM: suspend exit= → - =PM: suspend entry (s2idle)= → no more logs → hard reboot required -- =mu= database corruption (error 121) from repeated unclean shutdowns -- btrfs device stats: zero errors on both NVMe drives -- No explicit BTRFS read-only event logged (freeze kills logging before it - can be recorded) |
