1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
|
#+TITLE: System freezes during mkinitcpio -P rebuild
#+DATE: 2026-01-22
* Problem Summary
After fixing the mkinitcpio configuration issues (see 2026-01-22-mkinitcpio-config-boot-failure.org), the system successfully booted. However, running =mkinitcpio -P= again caused the system to freeze, requiring a power cycle.
This indicates the mkinitcpio config fix was correct, but there's a separate issue causing freezes during initramfs rebuilds.
* Timeline
1. System wouldn't boot due to broken mkinitcpio config (wrong HOOKS, missing zfs)
2. Booted from archzfs live ISO
3. Fixed mkinitcpio.conf, preset file, removed archiso.conf drop-in
4. Rebuilt initramfs via chroot - completed successfully
5. Rebooted - system booted successfully
6. Ran =mkinitcpio -P= again - system froze
7. Had to power cycle, now back on live ISO
* What This Tells Us
The mkinitcpio configuration fix was correct (system booted). But something about running mkinitcpio itself is triggering a system freeze.
* Suspected Cause: AMD GPU Power Gating Bug
ratio has an AMD Strix Halo GPU (RDNA 3.5) with a known VPE power gating bug. When the VPE (Video Processing Engine) tries to power gate after 1 second of idle, the SMU hangs and the system freezes.
Symptoms before freeze:
#+begin_example
amdgpu: SMU: I'm not done with your previous command
amdgpu: Failed to power gate VPE!
[drm:vpe_set_powergating_state] *ERROR* Dpm disable vpe failed, ret = -62
#+end_example
The fix is to disable power gating via =/etc/modprobe.d/amdgpu.conf=:
#+begin_example
options amdgpu pg_mask=0
#+end_example
*CRITICAL*: After creating this file, must run =mkinitcpio -P= to include it in initramfs (the modconf hook reads /etc/modprobe.d/ at build time).
* The Chicken-and-Egg Problem
1. Need to run =mkinitcpio -P= to apply the GPU fix (include amdgpu.conf in initramfs)
2. But running =mkinitcpio -P= triggers the GPU freeze
3. The fix can't be applied because applying it causes the problem it's meant to fix
* Possible Solutions to Investigate
** Option 1: Apply GPU fix at runtime before mkinitcpio
Before running mkinitcpio, manually set pg_mask at runtime:
#+begin_src bash
echo 0 | sudo tee /sys/module/amdgpu/parameters/pg_mask
#+end_src
Then run mkinitcpio while power gating is disabled. This might prevent the freeze.
** Option 2: Build initramfs from live ISO
Boot from archzfs live ISO (which doesn't have the GPU issue), mount the system, and rebuild initramfs from there. The live ISO uses a different GPU driver state.
We tried this and it worked - the rebuild completed. But then running mkinitcpio on the booted system froze.
** Option 3: Add amdgpu.conf before rebuilding from live ISO
When rebuilding from live ISO:
1. Create /etc/modprobe.d/amdgpu.conf with pg_mask=0
2. Rebuild initramfs
3. Boot - now the GPU fix should be in effect
4. Future mkinitcpio runs might not freeze
This might work because the initramfs would load with power gating disabled from the start.
** Option 4: Wait for kernel 6.18+
The upstream fix (VPE_IDLE_TIMEOUT increased from 1s to 2s) is in kernel 6.15+. When linux-lts reaches 6.18, the workaround won't be needed.
Current: linux-lts 6.12.66
Target: linux-lts 6.18
* Current State of ratio
- Booted to archzfs live ISO
- ZFS pool: zroot (mirror of nvme0n1p2 + nvme1n1p2)
- mkinitcpio.conf: FIXED (has correct HOOKS with zfs)
- /etc/mkinitcpio.conf.d/archiso.conf: REMOVED
- /etc/mkinitcpio.d/linux-lts.preset: FIXED
- /etc/modprobe.d/amdgpu.conf: EXISTS but may not be in initramfs
- Current pg_mask value on booted system: Unknown (need to check after boot)
* Verification Commands
Check if GPU fix is active:
#+begin_src bash
cat /sys/module/amdgpu/parameters/pg_mask
# Should return: 0
# If returns 4294967295 (0xFFFFFFFF), fix is NOT active
#+end_src
Check if amdgpu.conf is in initramfs:
#+begin_src bash
lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu
#+end_src
* Recovery Procedure (Option 3 - recommended)
From archzfs live ISO:
#+begin_src bash
# Import and mount ZFS
zpool import -f zroot
zfs mount zroot/ROOT/default
mount /dev/nvme0n1p1 /boot
# Ensure GPU fix file exists
cat > /etc/modprobe.d/amdgpu.conf << 'EOF'
# Disable power gating to prevent VPE freeze on Strix Halo GPUs
# Remove this file when linux-lts reaches 6.18+
options amdgpu pg_mask=0
EOF
# Mount system directories for chroot
mount --rbind /dev /dev
mount --rbind /sys /sys
mount --rbind /proc /proc
mount --rbind /run /run
# Rebuild initramfs (should include amdgpu.conf via modconf hook)
chroot / mkinitcpio -P
# Verify amdgpu.conf is in initramfs
lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu
# Reboot and test
reboot
#+end_src
After reboot, verify pg_mask=0 is active, then test =mkinitcpio -P= again.
* Related Files
- [[file:2026-01-22-mkinitcpio-config-boot-failure.org]] - The config fix that was applied
- archsetup NOTES.org - AMD GPU freeze diagnosis details
* Machine Details
- Machine: ratio (desktop)
- CPU: AMD (Strix Halo)
- GPU: AMD RDNA 3.5 (integrated)
- Storage: Two NVMe in ZFS mirror
- Kernel: linux-lts 6.12.66-1
|