1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
|
#+TITLE: Ratio Boot Fix Session - 2026-01-22
#+DATE: 2026-01-22
#+AUTHOR: Craig Jennings + Claude
* Summary
Successfully diagnosed and fixed boot failures on ratio (Framework Desktop with AMD Ryzen AI Max 300 / Strix Halo GPU). The primary issue was outdated/missing linux-firmware causing the amdgpu driver to hang during boot.
* Hardware
- Machine: Framework Desktop
- CPU/GPU: AMD Ryzen AI Max 300 (Strix Halo APU, codenamed GFX 1151)
- Storage: 2x NVMe in ZFS mirror (zroot)
- Installed via: install-archzfs script from this project
* Initial Symptoms
1. System froze at "triggering udev events..." during boot
2. Only visible message before freeze: "RDSEED32 is broken" (red herring - just a warning)
3. Freeze occurred with both linux-lts (6.12.66) and linux (6.15.2) kernels
4. Blacklisting amdgpu allowed boot to proceed but caused kernel panic (no display = init killed)
* Root Cause
The linux-firmware package was either missing or outdated. Specifically:
- linux-firmware 20251125 is known to break AMD Strix Halo (GFX 1151)
- linux-firmware 20260110 contains fixes for Strix Halo stability
Source: Donato Capitella video "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!"
- Firmware 20251125 completely broke ROCm/GPU on Strix Halo
- Firmware 20260110+ restores functionality
* Troubleshooting Timeline
** Phase 1: Initial Diagnosis
- SSH'd to ratio via archzfs live ISO
- Found mkinitcpio configuration issues (separate bug)
- Fixed mkinitcpio.conf HOOKS and removed archiso.conf drop-in
- System still froze after fixes
** Phase 2: Kernel Investigation
- Researched AMD Strix Halo issues on Framework community forums
- Found reports of VPE (Video Processing Engine) idle timeout bug
- Attempted kernel parameter workarounds:
- amdgpu.pg_mask=0 (disable power gating) - didn't help
- amdgpu.cwsr_enable=0 (disable compute wavefront save/restore) - not tested
- Installed kernel 6.15.2 from Arch Linux Archive (has VPE fix)
- Installed matching zfs-linux package for 6.15.2
- System still froze
** Phase 3: ZFS Rollback Complications
- Rolled back to pre-kernel-switch ZFS snapshot
- Discovered /boot is NOT on ZFS (EFI partition)
- Rollback caused mismatch: root filesystem rolled back, but /boot kept newer kernels
- Kernel 6.15.2 panicked because its modules didn't exist on rolled-back root
- Documented this as a fundamental ZFS-on-root issue (see todo.org)
** Phase 4: Firmware Discovery
- Found video transcript explaining Strix Halo firmware requirements
- Discovered linux-firmware package was not installed (orphaned files from rollback)
- Repo had linux-firmware 20260110-1 (the fixed version)
- Installed linux-firmware 20260110-1
** Phase 5: Boot Success with Secondary Issues
After firmware install, encountered additional issues:
1. *Hostid mismatch*: Pool showed "previously in use from another system"
- Fix: Clean export from live ISO (zpool export zroot)
2. *ZFS mountpoint=legacy*: Root dataset had legacy mountpoint from chroot work
- Fix: zfs set mountpoint=/ zroot/ROOT/default
3. *ZFS mountpoints with /mnt prefix*: All child datasets had /mnt prefix
- Cause: zpool import -R /mnt persisted mountpoint changes
- Fix: Reset all mountpoints (zfs set mountpoint=/home zroot/home, etc.)
* Final Working Configuration
#+BEGIN_SRC
Kernel: linux-lts 6.12.66-1-lts
Firmware: linux-firmware 20260110-1
ZFS: zfs-linux-lts (DKMS built for 6.12.66)
Boot: GRUB with spl.spl_hostid=0x564478f3
#+END_SRC
* Key Learnings
** 1. Firmware Matters for AMD APUs
The linux-firmware package is critical for AMD integrated GPUs. Strix Halo specifically requires firmware 20260110 or newer. The kernel version (6.12 vs 6.15) was less important than having correct firmware.
** 2. ZFS Rollback + Separate /boot = Danger
When /boot is on a separate EFI partition (not ZFS):
- ZFS rollback doesn't affect /boot
- Kernel images remain at newer version
- Modules on root get rolled back
- Result: Boot failure or kernel panic
Solutions:
- Use ZFSBootMenu (stores kernel/initramfs on ZFS)
- Put /boot on ZFS (GRUB can read it)
- Always rebuild initramfs after rollback
- Sync /boot backups with ZFS snapshots
** 3. zpool import -R Persists Mountpoints
Using =zpool import -R /mnt= for chroot work can permanently change dataset mountpoints. The -R flag sets altroot, but if you then modify datasets, those changes persist with the /mnt prefix.
Fix after chroot work:
#+BEGIN_SRC bash
zfs set mountpoint=/ zroot/ROOT/default
zfs set mountpoint=/home zroot/home
# ... etc for all datasets
#+END_SRC
** 4. Hostid Consistency Required
ZFS pools track which system last accessed them. If hostid changes (e.g., between live ISO and installed system), import fails with "pool was previously in use from another system."
Solutions:
- Clean export before switching systems (zpool export)
- Force import (zpool import -f)
- Ensure consistent hostid via /etc/hostid and spl.spl_hostid kernel parameter
* Files Modified on Ratio
- /etc/mkinitcpio.conf - Fixed HOOKS
- /etc/mkinitcpio.conf.d/archiso.conf - Removed (was overriding HOOKS)
- /etc/default/grub - GRUB_TIMEOUT=5 (was 0)
- /boot/grub/grub.cfg - Regenerated, added TEST label to mainline kernel
- /etc/hostid - Regenerated to match GRUB hostid parameter
- ZFS dataset mountpoints - Reset from /mnt/* to /*
* Packages Installed
- linux-firmware 20260110-1 (critical fix)
- linux 6.15.2 + zfs-linux (available as TEST kernel, not needed for boot)
- Various system packages updated during troubleshooting
* Resources Referenced
** Framework Community Posts
- https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221
- https://community.frame.work/t/fyi-linux-firmware-amdgpu-20251125-breaks-rocm-on-ai-max-395-8060s/78554
- https://github.com/FrameworkComputer/SoftwareFirmwareIssueTracker/issues/162
** Donato Capitella Video
- Title: "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!"
- Key info: Firmware 20260110+ required, kernel 6.18.4+ for ROCm stability
- Transcript saved: assets/Donato Capitella-ROCm+Linux Support on Strix Halo...txt
** Other
- Arch Linux Archive (for kernel 6.15.2 package)
- Jeff Geerling blog (VRAM allocation on AMD APUs)
* TODO Items Created
Added to todo.org:
- [#A] Fix ZFS rollback breaking boot (/boot not on ZFS)
- Links to existing [#A] Integrate ZFSBootMenu task
* Test Kernel Available
The TEST kernel (linux 6.15.2) is installed and available in GRUB Advanced menu. It has matching zfs-linux modules and should work if needed. The mainline kernel may be useful for:
- ROCm/AI workloads (combined with ROCm 7.2+ when released)
- Future GPU stability improvements
- Testing newer kernel features
Current recommendation: Use linux-lts for stability, TEST kernel for experimentation.
* Post-Fix Configuration (Phase 6)
After boot was working, made kernel 6.15.2 the default with a clean GRUB menu.
** Made 6.15.2 Default
1. Created custom GRUB script /etc/grub.d/09_custom_kernels
2. Added clean menu entries:
- "Linux 6.15.2 (default)"
- "Linux LTS 6.12.66 (fallback)"
3. Set GRUB_DEFAULT="linux-6.15.2"
** Pinned Kernel
Added to /etc/pacman.conf:
#+BEGIN_SRC
IgnorePkg = linux
#+END_SRC
This prevents pacman from upgrading linux package until manually unpinned.
** GRUB UUID Issue
Initial custom script used wrong boot partition UUID:
- nvme0n1p1: 6A4B-47A4 (wrong - got this from lsblk on first NVMe)
- nvme1n1p1: 6A4A-93B1 (correct - actually mounted at /boot)
Fix: Updated /etc/grub.d/09_custom_kernels to use 6A4A-93B1
** Final GRUB Menu
#+BEGIN_SRC
1. Linux 6.15.2 (default) <- Boots automatically
2. Linux LTS 6.12.66 (fallback)
3. Arch Linux (ZFS) Linux <- Auto-generated (ignored)
4. Advanced options...
5. UEFI Firmware Settings
6. ZFS Snapshots
#+END_SRC
** SSH Keys
Configured SSH key authentication for cjennings@ratio.local to simplify remote access.
Password auth (sshpass) was unreliable from Claude's session.
* Final System State
#+BEGIN_SRC
Hostname: ratio.local
Kernel: 6.15.2-arch1-1 (default)
Fallback: 6.12.66-1-lts
Firmware: linux-firmware 20260110-1
ZFS: All pools healthy, 11 datasets mounted
Boot: Custom GRUB menu with clean entries
Kernel pinned: Yes (IgnorePkg = linux)
#+END_SRC
* When to Unpin Kernel
Unpin linux package when linux-lts version >= 6.15:
#+BEGIN_SRC bash
sudo sed -i 's/^IgnorePkg = linux/#IgnorePkg = linux/' /etc/pacman.conf
#+END_SRC
Then optionally switch back to LTS as default for stability.
|