docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217

#+TITLE: Ratio AMD GPU Suspend Freeze - Workaround & Fix Tracking
#+DATE: 2026-01-27

* Summary

Ratio (Framework Desktop, AMD Ryzen AI Max / Strix Halo) freezes hard on
resume from suspend due to a VPE power gating race condition in the amdgpu
driver. The freeze requires a hard power cycle, which causes journal
corruption and can leave the btrfs filesystem read-only.

As of 2026-01-27, the proper kernel fix exists (merged in 6.18) but is
unusable due to separate CWSR bugs in 6.18+. Ratio runs kernel 6.12 LTS,
which does not have the fix and will not receive a backport.

A systemd suspend mask is applied as a workaround to prevent the system from
ever entering the suspend/resume path.

* The Bug

** What Happens

~8% of suspend/resume cycles on Strix Halo result in a hard system freeze
approximately 1 second after the screen turns on during resume.

** Root Cause: VPE Power Gating Race Condition

The freeze is caused by a race condition in the amdgpu driver's VPE (Video
Processing Engine) power management during resume:

1. System resumes from suspend.
2. amdgpu schedules =amdgpu_device_delayed_init_work_handler= (2s delay) to
   run self-tests, including =vpe_ring_test_ib= which briefly powers on VPE.
3. The ring buffer test is very short. VPE goes idle.
4. After 1 second of idle, =vpe_idle_work_handler= fires and tells the SMU
   (System Management Unit) to power gate (shut down) VPE.
5. *But VPE is still at a high DPM level.* Newer VPE firmware only drops DPM
   back to the lowest level (DPM0) after a workload has run for 2+ seconds.
   The ring buffer test was too short to trigger that drop.
6. The SMU tries to power gate VPE while it's at a high DPM level. On Strix
   Halo, this hangs the SMU.
7. The SMU hang cascades -- VCN, JPEG, and other GPU IPs can't be managed.
   Half the GPU is frozen.
8. The thread that issued the SMU command is stuck. System is locked up.
   No further logging is possible.

It only triggers on resume because that's when the driver runs the ring
buffer self-test. During normal operation, VPE either isn't used or has had
enough time to settle its DPM level before power gating.

** Error Messages (if visible before freeze)

#+begin_example
SMU: I'm not done with your previous command
Failed to power gate VPE!
Dpm disable vpe failed, ret = -62
Failed to power gate JPEG
Failed to power gate VCN instance 0
Dpm disable uvd failed
#+end_example

** References

- [[https://lkml.org/lkml/2025/8/24/139][Original VPE_IDLE_TIMEOUT patch (LKML, Aug 2025)]]
- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130657.html][VPE DPM0 fix v5 (amd-gfx, Oct 2025)]]
- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130804.html][Follow-up: missing return statement fix]]
- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop bug #4615]]
- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community: Critical 6.18/6.19 CWSR bugs]]

* Kernel Fix Status

** The Proper Fix

Mario Limonciello (AMD) wrote =drm/amd: Check that VPE has reached DPM0 in
idle handler= -- makes the idle handler check that VPE has actually reached
DPM0 before attempting the power gate. Targets VPE 6.1.1 (Strix Halo) with
firmware versions below =0x0a640500=.

Merged into Linux 6.18 during the RC phase (drm-fixes-6.18, Oct 29, 2025).
Closes freedesktop bug #4615.

** Why We Can't Use 6.18

Kernel 6.18.x and 6.19.x have critical CWSR (Compute Wavefront Save/Restore)
bugs that cause hard GPU hangs on RDNA3/RDNA4 during compute workloads. The
Framework Community recommends staying on 6.15-6.17 for Strix Halo until
AMD resolves both VPE and CWSR issues in the same kernel.

** Backport Status

The fix was tagged =Cc: stable@vger.kernel.org= for backport but has NOT
appeared in any 6.12 LTS release as of 6.12.67. It likely won't be
backported to 6.12 due to infrastructure differences.

** When to Check Again

Monitor these for resolution:
- Arch =linux-lts= package updates (=pacman -Si linux-lts=)
- [[https://cdn.kernel.org/pub/linux/kernel/v6.x/][Kernel.org changelogs]] for 6.12.x stable releases
- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community thread]] for CWSR resolution status
- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop #4615]] for any further developments

* What We Applied (2026-01-27)

** Workaround: Disable Suspend via systemd

Prevents the system from entering the suspend/resume path entirely.
The GPU bug is still present but never triggered.

#+begin_src bash
# Applied 2026-01-27:
sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target
#+end_src

Effects:
- hypridle can no longer suspend the system
- Screen stays on at idle (active power draw)
- No more freeze → hard reboot → filesystem corruption cycle

** Kernel Parameters NOT Applied

The following parameters were identified as fixes but caused boot failures
on ratio when previously attempted (twice):

#+begin_example
amdgpu.pg_mask=0         # Disables all GPU power gating
amdgpu.cwsr_enable=0     # Disables Compute Wavefront Save/Restore
#+end_example

It is unclear whether the boot failures were caused by the parameters
themselves or by a corrupted initramfs from running mkinitcpio while the
GPU was in a bad state. Testing via the GRUB =e= key (temporary, no
permanent change) is planned but deferred.

** Current Kernel Command Line (for reference)

#+begin_example
BOOT_IMAGE=/@/boot/vmlinuz-linux-lts root=UUID=5b9f7f7f-2477-488f-8fb1-52b5c7d90e98
rw rootflags=subvol=@ console=tty0 console=ttyS0,115200 rw loglevel=2
rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1
mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash
#+end_example

* How to Undo When a Fixed Kernel Arrives

** Step 1: Verify the Fix is in the New Kernel

Check that the VPE DPM0 fix is present:

#+begin_src bash
# Check kernel version
uname -r

# Search for the fix in the changelog
# Look for "VPE" or "DPM0" or "vpe_idle" in the relevant changelog:
# https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-<version>

# Or check the source directly:
grep -r "vpe_need_dpm0_at_power_down\|vpe_get_dpm_level" /usr/src/linux/drivers/gpu/drm/amd/ 2>/dev/null
#+end_src

Also verify that CWSR bugs are resolved (check Framework Community thread).

** Step 2: Unmask Suspend Targets

#+begin_src bash
sudo systemctl unmask sleep.target suspend.target hibernate.target hybrid-sleep.target
#+end_src

** Step 3: Test Suspend/Resume

#+begin_src bash
# Test a single suspend/resume cycle
sudo systemctl suspend

# If system resumes cleanly, test a few more times
# The original bug had ~8% failure rate, so test at least 20 cycles
#+end_src

** Step 4: If Kernel Parameters Were Applied

If =amdgpu.pg_mask=0= and =amdgpu.cwsr_enable=0= were added to GRUB, remove
them once the kernel fix is confirmed working:

#+begin_src bash
# Edit GRUB config
sudo vim /etc/default/grub
# Remove amdgpu.pg_mask=0 and amdgpu.cwsr_enable=0 from GRUB_CMDLINE_LINUX_DEFAULT

# Rebuild GRUB config
sudo grub-mkconfig -o /boot/grub/grub.cfg

# Reboot and test suspend
#+end_src

* Log Evidence (2026-01-27 Investigation)

** System Info

- Machine: Framework Desktop (AMD Ryzen AI Max 300 Series)
- Hostname: ratio
- Kernel: 6.12.67-1-lts
- Filesystem: btrfs RAID1 on 2x NVMe (nvme0n1p2 + nvme1n1p2)
- GPU: AMD Strix Halo (RDNA 3.5)

** Findings

- 13 boots between Jan 25-27, most ending in suspend then hard freeze
- Journal corruption on boots -5, -3, and -7 (unclean shutdown)
- =mc= (Midnight Commander) stuck in D state (uninterruptible I/O) during
  failed freeze attempts, in =io_schedule → folio_wait_bit_common →
  filemap_read= path
- Suspend freeze pattern: =PM: suspend entry (deep)= → =PM: suspend exit= →
  =PM: suspend entry (s2idle)= → no more logs → hard reboot required
- =mu= database corruption (error 121) from repeated unclean shutdowns
- btrfs device stats: zero errors on both NVMe drives
- No explicit BTRFS read-only event logged (freeze kills logging before it
  can be recorded)