aboutsummaryrefslogtreecommitdiff
path: root/assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2...
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2026-01-22 14:27:49 -0600
committerCraig Jennings <c@cjennings.net>2026-01-22 14:27:49 -0600
commit0720a543d0eacf890ec99a6a5b337c85f896d647 (patch)
treeb8a40b3f3a02e1631f1e92c2b2207a558445d560 /assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt
parent50a5f78c5a7be0f5e3d630efb10cd23902549667 (diff)
downloadarchangel-0720a543d0eacf890ec99a6a5b337c85f896d647.tar.gz
archangel-0720a543d0eacf890ec99a6a5b337c85f896d647.zip
Fix ratio boot issues: firmware, mkinitcpio, and document ZFS rollback dangers
Root cause: Missing/outdated linux-firmware broke AMD Strix Halo GPU init. Fixed by installing linux-firmware 20260110-1. Changes: - install-archzfs: Fix mkinitcpio config (remove archiso.conf, fix preset) - todo.org: Add ZFS rollback + /boot mismatch issue, recommend ZFSBootMenu - docs/2026-01-22-ratio-boot-fix-session.org: Full troubleshooting session - docs/2026-01-22-mkinitcpio-config-boot-failure.org: Bug report - assets/: Supporting documentation and video transcript Key learnings: - AMD Strix Halo requires linux-firmware 20260110+ - ZFS rollback with /boot on EFI partition can break boot - zpool import -R can permanently change mountpoints
Diffstat (limited to 'assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt')
-rw-r--r--assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt1
1 files changed, 1 insertions, 0 deletions
diff --git a/assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt b/assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt
new file mode 100644
index 0000000..322893a
--- /dev/null
+++ b/assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt
@@ -0,0 +1 @@
+Speaker A: In this video I want to give you an update on the current state of Linux support for Streak's Halo. I know that most of my recent viewers either have a device with this AMD APU or are thinking of buying one. As a reminder, this is the integrated GPU Inside AMD Ryzen AI Max and it's codenamed GFX 1151. Now over the last two months there's been a lot of confusion, broken setups and contradictory advice and this video is meant to clarify what changed and what actually works. Now if you've been trying to run LLMs, ComfyUI or other ROCM based AI workflows on Strix Halo and things broke depending on your distribution, kernel and ROCM version, this this wasn't user error. The software stack itself was inconsistent and only recently it has started to converge again to something stable. Now if you just want a working system without digging into the details, here's what you have to Use Linux firmware 20260110 or newer avoid 20251125 that firmware is broken for ROCM on Streak's Halo. Use Linux kernel 6.18.4 or newer use my toolboxes that have the ROCM nightly builds from the Rock or alternatively the ROCM 7.2 builds once they are officially released. This is currently the only combination that includes the full stability fixes for GFX 1151. Importantly, if you try to run older versions of ROCM on newer kernels, these won't work. If you want more details about what's been happening, keep watching this video as essentially we had two major unrelated issues plaguing this IGPU before moving on the usual ask the research that goes into these videos is fun to do but also time consuming. I really appreciate it if you could take a second to support the channel in all the usual ways like subscribing, liking and commenting on the video. Your support does make a difference. Back in November, AMD pushed a Linux firmware update that got bundled into Linux firmware 20251125 and quickly made its way into major Linux distributions like Fedora. Unfortunately, that firmware completely broke ROCAM support on Streak's Halo. ROCAM would simply fail to initialize and became unusable. AMD reverted that firmware fairly quickly, but several distributions never picked up. The revert Fedora is the most obvious example. For roughly two months a fully up to date Fedora system simply could not run ROCM on Streak's Halo. I find the reluctance from the Fedora maintainers to push a fix for these hard to understand this wasn't a corner case or an obscure configuration issue. It completely broke a flagship AMD platform for a whole cluster of users doing GPU compute. The only workaround during that period was to manually downgrade the Linux firmware to 2025 1111, which was the last known one working version. I documented this downgrade process and the link is in the description and a lot of people ended up having to do that in order to run Rocm on their Strixelo systems. But finally, in January 2026 a new firmware release, Linux firmware 202260110 started landing in mainstream distributions. That version restored RAW CAM functionality without requiring a downgrade. So from a firmware persp this specific regression is now resolved if you update your system. However, that firmware update only addressed the immediate RAW QM regression, but it did not fix the underlying stability problems, which were caused by a separate issue elsewhere in the stack. Typical symptoms were GPU kernel crashes and resets, causing AI workflows to fail randomly, and ComfyUI is a good example here. It's the de facto standard software used for image and video generation and it really stresses the gpu. On STRIX Halo it would often work briefly and then fall over, which exposed the ROCM stability issues we've been talking about. This is also why I didn't focus much on ComfyUI in my earlier video on image and video generation. At that time time it simply wasn't stable enough on STRIX Halo to recommend AMD finally identified the underlying issue causing all this trouble. The fixed turned out to require changes in two places at the same time, the AMD GPU driver in the Linux kernel and ROCM itself. The core problem was a mismatch in how hardware resource limits were defined and communicated for GFX 1151, specifically around something called VGPRS vector general purpose registers. For Streak's Halo, the actual VGPR capacity is significantly higher than what ROCM had been assuming. All the ROCM versions were effectively using the wrong register limits for this gpu. This led to GPU kernels being scheduled with invalid assumptions about available registers. The result wasn't a clean failure, it was undefined behavior, often resulting in heap kernel hangs and eventually GPU resets. AMD addressed this by changing both sides of the stack. The important point is that both sides must agree. If the kernel thinks more registers are available but ROCAM still assumes the old limits or vice versa. The runtime ends up scheduling work that doesn't Line up with what is expected by the hardware, which leads to failures. These fixes landed in mainline Linux starting with kernel 6.18.4 with matching changes in ROCM, and this is the key point. The kernel fix and the ROCM fix must be used together. This is where most of the current confusion comes from. If you run kernel 6.18.4 or newer with an older rocm version, for example 6.4.4 or 7.1.1, things will break immediately. This isn't a regression in those ROCM versions, it's a compatibility mismatch. The kernel now expects ROCM to behave differently. Older ROCM builds don't know about these changes, so the stack crashes. The first ROCM release that properly matches these kernel changes will be ROCM 7.2, but at the time of recording 7.2 hasn't been officially released yet. That's why if you are on a newer kernel today, you need to use ROCM nightly B builds from the ROC which already include this fix. All of my current toolboxes provide this option. The table on screen now summarizes what combinations actually work on GFX 1151 and which ones are known to break. This is the part most people trip over. Right now there are two validations configurations the new kernel path, which means kernel 6.18.4 or newer ROCM builds that already include the fixes, which is the nightly builds from the ROC and toolboxes built against these ROCM nightly builds. The second configuration is the old ROCM compatibility path, which means ROCM versions 6.4 and 7.1 kernel 6.18.3 or older. Mixing these parts does not work. That's the key takeaway. If you update your kernel but keep using older ROM toolboxes, you will hit crashes. If you want to stay on older ROCM versions for benchmarking or comparison, you must also stay on the older kernel. This is Donato from the future. As I made it in this video, I realized I want to make an additional point. It is perfectly possible that in the next few months AMD decides to cherry pick and include this stability patch back into older branches of ROCM. So we might have for example a 6.4.5 release which includes this fix and the same might happen for 7.1. We might have a 7.1.2. I don't know this, but it is possible. Likewise, distributions like Fedora build ROCM from scratch and this particular patch is incredibly easy to cherry pick. MBAC port and I think that Fedora is currently doing that. It's currently looking at backporting this particular patch. So long story short, it might become possible to use older version of RAW Cam with newer kernels pretty soon. Up to now, I've kept multiple toolboxes around using different RAW Cam versions. That was intentional. Rocam performance on GFX 1151 has been quite inconsistent, and in some cases older versions were genuinely faster. But with the kernel fixes in place and with AMD now clearly moving forward with Strix Halo as a supported AI platform, especially after the Ryzen AI halo announcement at CES 2026, it no longer makes sense to anchor on old stacks. Performance improvements will eventually land in the latest ROCAM versions, even if there are still some regressions today due to a mix of RAW M and for example LLAMA CPP changes. These are all being worked on as we speak, so over time I'll be retiring the older Rocam toolboxes and focusing on the latest stack. As a result, I can now release a Strix Halo toolbox focused on config UI with proper benchmarks and stability, similar to what I did with the Radio 9 700. That toolbox is ready and a dedicated video on ComfyUI performance and workflows on Streaksalo is coming next.