aboutsummaryrefslogtreecommitdiff
path: root/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2026-02-22 23:20:56 -0600
committerCraig Jennings <c@cjennings.net>2026-02-22 23:20:56 -0600
commit3a2445080c880544985f50fb0d916534698cc073 (patch)
tree909f98edbbb940aafb95de02457d4d6f7db3cba4 /docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org
parent3595aa8a8122da543676717fb5825044eee99a9d (diff)
downloadarchangel-3a2445080c880544985f50fb0d916534698cc073.tar.gz
archangel-3a2445080c880544985f50fb0d916534698cc073.zip
chore: add docs/ to .gitignore and untrack personal files
docs/ contains session history, personal workflows, and private protocols that shouldn't be in a public repository.
Diffstat (limited to 'docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org')
-rw-r--r--docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org217
1 files changed, 0 insertions, 217 deletions
diff --git a/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org b/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org
deleted file mode 100644
index 46e403d..0000000
--- a/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org
+++ /dev/null
@@ -1,217 +0,0 @@
-#+TITLE: Ratio AMD GPU Suspend Freeze - Workaround & Fix Tracking
-#+DATE: 2026-01-27
-
-* Summary
-
-Ratio (Framework Desktop, AMD Ryzen AI Max / Strix Halo) freezes hard on
-resume from suspend due to a VPE power gating race condition in the amdgpu
-driver. The freeze requires a hard power cycle, which causes journal
-corruption and can leave the btrfs filesystem read-only.
-
-As of 2026-01-27, the proper kernel fix exists (merged in 6.18) but is
-unusable due to separate CWSR bugs in 6.18+. Ratio runs kernel 6.12 LTS,
-which does not have the fix and will not receive a backport.
-
-A systemd suspend mask is applied as a workaround to prevent the system from
-ever entering the suspend/resume path.
-
-* The Bug
-
-** What Happens
-
-~8% of suspend/resume cycles on Strix Halo result in a hard system freeze
-approximately 1 second after the screen turns on during resume.
-
-** Root Cause: VPE Power Gating Race Condition
-
-The freeze is caused by a race condition in the amdgpu driver's VPE (Video
-Processing Engine) power management during resume:
-
-1. System resumes from suspend.
-2. amdgpu schedules =amdgpu_device_delayed_init_work_handler= (2s delay) to
- run self-tests, including =vpe_ring_test_ib= which briefly powers on VPE.
-3. The ring buffer test is very short. VPE goes idle.
-4. After 1 second of idle, =vpe_idle_work_handler= fires and tells the SMU
- (System Management Unit) to power gate (shut down) VPE.
-5. *But VPE is still at a high DPM level.* Newer VPE firmware only drops DPM
- back to the lowest level (DPM0) after a workload has run for 2+ seconds.
- The ring buffer test was too short to trigger that drop.
-6. The SMU tries to power gate VPE while it's at a high DPM level. On Strix
- Halo, this hangs the SMU.
-7. The SMU hang cascades -- VCN, JPEG, and other GPU IPs can't be managed.
- Half the GPU is frozen.
-8. The thread that issued the SMU command is stuck. System is locked up.
- No further logging is possible.
-
-It only triggers on resume because that's when the driver runs the ring
-buffer self-test. During normal operation, VPE either isn't used or has had
-enough time to settle its DPM level before power gating.
-
-** Error Messages (if visible before freeze)
-
-#+begin_example
-SMU: I'm not done with your previous command
-Failed to power gate VPE!
-Dpm disable vpe failed, ret = -62
-Failed to power gate JPEG
-Failed to power gate VCN instance 0
-Dpm disable uvd failed
-#+end_example
-
-** References
-
-- [[https://lkml.org/lkml/2025/8/24/139][Original VPE_IDLE_TIMEOUT patch (LKML, Aug 2025)]]
-- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130657.html][VPE DPM0 fix v5 (amd-gfx, Oct 2025)]]
-- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130804.html][Follow-up: missing return statement fix]]
-- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop bug #4615]]
-- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community: Critical 6.18/6.19 CWSR bugs]]
-
-* Kernel Fix Status
-
-** The Proper Fix
-
-Mario Limonciello (AMD) wrote =drm/amd: Check that VPE has reached DPM0 in
-idle handler= -- makes the idle handler check that VPE has actually reached
-DPM0 before attempting the power gate. Targets VPE 6.1.1 (Strix Halo) with
-firmware versions below =0x0a640500=.
-
-Merged into Linux 6.18 during the RC phase (drm-fixes-6.18, Oct 29, 2025).
-Closes freedesktop bug #4615.
-
-** Why We Can't Use 6.18
-
-Kernel 6.18.x and 6.19.x have critical CWSR (Compute Wavefront Save/Restore)
-bugs that cause hard GPU hangs on RDNA3/RDNA4 during compute workloads. The
-Framework Community recommends staying on 6.15-6.17 for Strix Halo until
-AMD resolves both VPE and CWSR issues in the same kernel.
-
-** Backport Status
-
-The fix was tagged =Cc: stable@vger.kernel.org= for backport but has NOT
-appeared in any 6.12 LTS release as of 6.12.67. It likely won't be
-backported to 6.12 due to infrastructure differences.
-
-** When to Check Again
-
-Monitor these for resolution:
-- Arch =linux-lts= package updates (=pacman -Si linux-lts=)
-- [[https://cdn.kernel.org/pub/linux/kernel/v6.x/][Kernel.org changelogs]] for 6.12.x stable releases
-- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community thread]] for CWSR resolution status
-- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop #4615]] for any further developments
-
-* What We Applied (2026-01-27)
-
-** Workaround: Disable Suspend via systemd
-
-Prevents the system from entering the suspend/resume path entirely.
-The GPU bug is still present but never triggered.
-
-#+begin_src bash
-# Applied 2026-01-27:
-sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target
-#+end_src
-
-Effects:
-- hypridle can no longer suspend the system
-- Screen stays on at idle (active power draw)
-- No more freeze → hard reboot → filesystem corruption cycle
-
-** Kernel Parameters NOT Applied
-
-The following parameters were identified as fixes but caused boot failures
-on ratio when previously attempted (twice):
-
-#+begin_example
-amdgpu.pg_mask=0 # Disables all GPU power gating
-amdgpu.cwsr_enable=0 # Disables Compute Wavefront Save/Restore
-#+end_example
-
-It is unclear whether the boot failures were caused by the parameters
-themselves or by a corrupted initramfs from running mkinitcpio while the
-GPU was in a bad state. Testing via the GRUB =e= key (temporary, no
-permanent change) is planned but deferred.
-
-** Current Kernel Command Line (for reference)
-
-#+begin_example
-BOOT_IMAGE=/@/boot/vmlinuz-linux-lts root=UUID=5b9f7f7f-2477-488f-8fb1-52b5c7d90e98
-rw rootflags=subvol=@ console=tty0 console=ttyS0,115200 rw loglevel=2
-rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1
-mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash
-#+end_example
-
-* How to Undo When a Fixed Kernel Arrives
-
-** Step 1: Verify the Fix is in the New Kernel
-
-Check that the VPE DPM0 fix is present:
-
-#+begin_src bash
-# Check kernel version
-uname -r
-
-# Search for the fix in the changelog
-# Look for "VPE" or "DPM0" or "vpe_idle" in the relevant changelog:
-# https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-<version>
-
-# Or check the source directly:
-grep -r "vpe_need_dpm0_at_power_down\|vpe_get_dpm_level" /usr/src/linux/drivers/gpu/drm/amd/ 2>/dev/null
-#+end_src
-
-Also verify that CWSR bugs are resolved (check Framework Community thread).
-
-** Step 2: Unmask Suspend Targets
-
-#+begin_src bash
-sudo systemctl unmask sleep.target suspend.target hibernate.target hybrid-sleep.target
-#+end_src
-
-** Step 3: Test Suspend/Resume
-
-#+begin_src bash
-# Test a single suspend/resume cycle
-sudo systemctl suspend
-
-# If system resumes cleanly, test a few more times
-# The original bug had ~8% failure rate, so test at least 20 cycles
-#+end_src
-
-** Step 4: If Kernel Parameters Were Applied
-
-If =amdgpu.pg_mask=0= and =amdgpu.cwsr_enable=0= were added to GRUB, remove
-them once the kernel fix is confirmed working:
-
-#+begin_src bash
-# Edit GRUB config
-sudo vim /etc/default/grub
-# Remove amdgpu.pg_mask=0 and amdgpu.cwsr_enable=0 from GRUB_CMDLINE_LINUX_DEFAULT
-
-# Rebuild GRUB config
-sudo grub-mkconfig -o /boot/grub/grub.cfg
-
-# Reboot and test suspend
-#+end_src
-
-* Log Evidence (2026-01-27 Investigation)
-
-** System Info
-
-- Machine: Framework Desktop (AMD Ryzen AI Max 300 Series)
-- Hostname: ratio
-- Kernel: 6.12.67-1-lts
-- Filesystem: btrfs RAID1 on 2x NVMe (nvme0n1p2 + nvme1n1p2)
-- GPU: AMD Strix Halo (RDNA 3.5)
-
-** Findings
-
-- 13 boots between Jan 25-27, most ending in suspend then hard freeze
-- Journal corruption on boots -5, -3, and -7 (unclean shutdown)
-- =mc= (Midnight Commander) stuck in D state (uninterruptible I/O) during
- failed freeze attempts, in =io_schedule → folio_wait_bit_common →
- filemap_read= path
-- Suspend freeze pattern: =PM: suspend entry (deep)= → =PM: suspend exit= →
- =PM: suspend entry (s2idle)= → no more logs → hard reboot required
-- =mu= database corruption (error 121) from repeated unclean shutdowns
-- btrfs device stats: zero errors on both NVMe drives
-- No explicit BTRFS read-only event logged (freeze kills logging before it
- can be recorded)