aboutsummaryrefslogtreecommitdiff
path: root/assets
diff options
context:
space:
mode:
Diffstat (limited to 'assets')
-rw-r--r--assets/2026-01-22-mkinitcpio-fixes-applied-detail.org194
-rw-r--r--assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org152
-rw-r--r--assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt1
-rw-r--r--assets/cogito-hardware-specs.txt17
4 files changed, 0 insertions, 364 deletions
diff --git a/assets/2026-01-22-mkinitcpio-fixes-applied-detail.org b/assets/2026-01-22-mkinitcpio-fixes-applied-detail.org
deleted file mode 100644
index 68c6f0e..0000000
--- a/assets/2026-01-22-mkinitcpio-fixes-applied-detail.org
+++ /dev/null
@@ -1,194 +0,0 @@
-#+TITLE: Detailed mkinitcpio Fixes Applied to ratio
-#+DATE: 2026-01-22
-
-* Overview
-
-This documents the exact fixes applied to ratio's mkinitcpio configuration to make it bootable. These fixes worked - the system booted successfully after applying them. The install-archzfs script needs to be updated to apply these configurations during installation.
-
-* Fix 1: /etc/mkinitcpio.conf HOOKS
-
-** Problem
-
-The HOOKS line was configured for a systemd-based initramfs without ZFS support.
-
-** Before (broken)
-#+begin_example
-HOOKS=(base systemd autodetect microcode modconf kms keyboard keymap sd-vconsole block filesystems fsck)
-#+end_example
-
-** After (working)
-#+begin_example
-HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)
-#+end_example
-
-** Changes Explained
-
-| Removed | Added/Changed | Reason |
-|----------------+----------------+-----------------------------------------------------------|
-| systemd | udev | ZFS hook is busybox-based, incompatible with systemd init |
-| sd-vconsole | consolefont | sd-vconsole is systemd-specific; consolefont is busybox |
-| fsck | (removed) | fsck is for ext4/xfs, not needed for ZFS |
-| (missing) | zfs | Required to import ZFS pool and mount root at boot |
-
-** Command Used
-#+begin_src bash
-sed -i "s/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/" /etc/mkinitcpio.conf
-#+end_src
-
-* Fix 2: Remove /etc/mkinitcpio.conf.d/archiso.conf
-
-** Problem
-
-The archzfs live ISO uses a drop-in config file at =/etc/mkinitcpio.conf.d/archiso.conf=. This file was not removed during installation, and it *overrides* the HOOKS setting in mkinitcpio.conf.
-
-** Contents of archiso.conf (should not exist on installed system)
-#+begin_example
-HOOKS=(base udev microcode modconf kms memdisk archiso archiso_loop_mnt archiso_pxe_common archiso_pxe_nbd archiso_pxe_http archiso_pxe_nfs block filesystems keyboard)
-COMPRESSION="xz"
-COMPRESSION_OPTIONS=(-9e)
-#+end_example
-
-** Why This Breaks Things
-
-Even if mkinitcpio.conf has the correct HOOKS, this drop-in file overrides them with archiso-specific hooks (memdisk, archiso, archiso_loop_mnt, etc.) that are only for the live ISO environment. The =zfs= hook is notably absent.
-
-** Fix Applied
-#+begin_src bash
-rm -f /etc/mkinitcpio.conf.d/archiso.conf
-#+end_src
-
-** Note for install-archzfs
-
-The script should remove this file after arch-chroot setup:
-#+begin_src bash
-rm -f /mnt/etc/mkinitcpio.conf.d/archiso.conf
-#+end_src
-
-* Fix 3: /etc/mkinitcpio.d/linux-lts.preset
-
-** Problem
-
-The preset file was still configured for the archiso live environment, not a normal installed system.
-
-** Before (broken)
-#+begin_example
-# mkinitcpio preset file for the 'linux-lts' package on archiso
-
-PRESETS=('archiso')
-
-ALL_kver='/boot/vmlinuz-linux-lts'
-archiso_config='/etc/mkinitcpio.conf.d/archiso.conf'
-
-archiso_image="/boot/initramfs-linux-lts.img"
-#+end_example
-
-** After (working)
-#+begin_example
-# mkinitcpio preset file for linux-lts
-
-PRESETS=(default fallback)
-
-ALL_kver="/boot/vmlinuz-linux-lts"
-
-default_image="/boot/initramfs-linux-lts.img"
-
-fallback_image="/boot/initramfs-linux-lts-fallback.img"
-fallback_options="-S autodetect"
-#+end_example
-
-** Changes Explained
-
-| Before | After | Reason |
-|---------------------------------+------------------------+-----------------------------------------------------|
-| PRESETS=('archiso') | PRESETS=(default fallback) | Normal system needs default + fallback images |
-| archiso_config=... (drop-in) | (removed) | Don't use archiso drop-in config |
-| archiso_image=... | default_image=... | Use standard naming |
-| (missing) | fallback_image=... | Fallback image for recovery |
-| (missing) | fallback_options="-S autodetect" | Fallback skips autodetect for broader hardware support |
-
-** Command Used
-#+begin_src bash
-cat > /etc/mkinitcpio.d/linux-lts.preset << 'EOF'
-# mkinitcpio preset file for linux-lts
-
-PRESETS=(default fallback)
-
-ALL_kver="/boot/vmlinuz-linux-lts"
-
-default_image="/boot/initramfs-linux-lts.img"
-
-fallback_image="/boot/initramfs-linux-lts-fallback.img"
-fallback_options="-S autodetect"
-EOF
-#+end_src
-
-* Fix 4: Rebuild initramfs
-
-After applying the above fixes, the initramfs must be rebuilt:
-
-#+begin_src bash
-mkinitcpio -P
-#+end_src
-
-This regenerates both default and fallback images with the correct hooks.
-
-* Verification
-
-** Verify HOOKS are correct
-#+begin_src bash
-grep "^HOOKS" /etc/mkinitcpio.conf
-# Should show: HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)
-#+end_src
-
-** Verify no archiso drop-in
-#+begin_src bash
-ls /etc/mkinitcpio.conf.d/
-# Should be empty or not contain archiso.conf
-#+end_src
-
-** Verify preset is correct
-#+begin_src bash
-grep "PRESETS" /etc/mkinitcpio.d/linux-lts.preset
-# Should show: PRESETS=(default fallback)
-#+end_src
-
-** Verify ZFS hook is in initramfs
-#+begin_src bash
-lsinitcpio /boot/initramfs-linux-lts.img | grep -E "^hooks/zfs|zfs.ko"
-# Should show:
-# hooks/zfs
-# usr/lib/modules/.../zfs.ko.zst
-#+end_src
-
-* Summary for install-archzfs Script
-
-The script needs to add these steps after installing packages and before running final mkinitcpio:
-
-#+begin_src bash
-# 1. Set correct HOOKS for ZFS boot
-sed -i "s/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/" /mnt/etc/mkinitcpio.conf
-
-# 2. Remove archiso drop-in config
-rm -f /mnt/etc/mkinitcpio.conf.d/archiso.conf
-
-# 3. Create proper preset file (adjust kernel name if not linux-lts)
-cat > /mnt/etc/mkinitcpio.d/linux-lts.preset << 'EOF'
-# mkinitcpio preset file for linux-lts
-
-PRESETS=(default fallback)
-
-ALL_kver="/boot/vmlinuz-linux-lts"
-
-default_image="/boot/initramfs-linux-lts.img"
-
-fallback_image="/boot/initramfs-linux-lts-fallback.img"
-fallback_options="-S autodetect"
-EOF
-
-# 4. Rebuild initramfs with correct config
-arch-chroot /mnt mkinitcpio -P
-#+end_src
-
-* Result
-
-After applying these fixes and rebuilding initramfs from the live ISO, ratio booted successfully. The system froze on a subsequent =mkinitcpio -P= run, but that's a separate AMD GPU issue (see 2026-01-22-mkinitcpio-freeze-during-rebuild.org), not a configuration problem.
diff --git a/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org b/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org
deleted file mode 100644
index 1132ddd..0000000
--- a/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org
+++ /dev/null
@@ -1,152 +0,0 @@
-#+TITLE: System freezes during mkinitcpio -P rebuild
-#+DATE: 2026-01-22
-
-* Problem Summary
-
-After fixing the mkinitcpio configuration issues (see 2026-01-22-mkinitcpio-config-boot-failure.org), the system successfully booted. However, running =mkinitcpio -P= again caused the system to freeze, requiring a power cycle.
-
-This indicates the mkinitcpio config fix was correct, but there's a separate issue causing freezes during initramfs rebuilds.
-
-* Timeline
-
-1. System wouldn't boot due to broken mkinitcpio config (wrong HOOKS, missing zfs)
-2. Booted from archzfs live ISO
-3. Fixed mkinitcpio.conf, preset file, removed archiso.conf drop-in
-4. Rebuilt initramfs via chroot - completed successfully
-5. Rebooted - system booted successfully
-6. Ran =mkinitcpio -P= again - system froze
-7. Had to power cycle, now back on live ISO
-
-* What This Tells Us
-
-The mkinitcpio configuration fix was correct (system booted). But something about running mkinitcpio itself is triggering a system freeze.
-
-* Suspected Cause: AMD GPU Power Gating Bug
-
-ratio has an AMD Strix Halo GPU (RDNA 3.5) with a known VPE power gating bug. When the VPE (Video Processing Engine) tries to power gate after 1 second of idle, the SMU hangs and the system freezes.
-
-Symptoms before freeze:
-#+begin_example
-amdgpu: SMU: I'm not done with your previous command
-amdgpu: Failed to power gate VPE!
-[drm:vpe_set_powergating_state] *ERROR* Dpm disable vpe failed, ret = -62
-#+end_example
-
-The fix is to disable power gating via =/etc/modprobe.d/amdgpu.conf=:
-#+begin_example
-options amdgpu pg_mask=0
-#+end_example
-
-*CRITICAL*: After creating this file, must run =mkinitcpio -P= to include it in initramfs (the modconf hook reads /etc/modprobe.d/ at build time).
-
-* The Chicken-and-Egg Problem
-
-1. Need to run =mkinitcpio -P= to apply the GPU fix (include amdgpu.conf in initramfs)
-2. But running =mkinitcpio -P= triggers the GPU freeze
-3. The fix can't be applied because applying it causes the problem it's meant to fix
-
-* Possible Solutions to Investigate
-
-** Option 1: Apply GPU fix at runtime before mkinitcpio
-
-Before running mkinitcpio, manually set pg_mask at runtime:
-#+begin_src bash
-echo 0 | sudo tee /sys/module/amdgpu/parameters/pg_mask
-#+end_src
-
-Then run mkinitcpio while power gating is disabled. This might prevent the freeze.
-
-** Option 2: Build initramfs from live ISO
-
-Boot from archzfs live ISO (which doesn't have the GPU issue), mount the system, and rebuild initramfs from there. The live ISO uses a different GPU driver state.
-
-We tried this and it worked - the rebuild completed. But then running mkinitcpio on the booted system froze.
-
-** Option 3: Add amdgpu.conf before rebuilding from live ISO
-
-When rebuilding from live ISO:
-1. Create /etc/modprobe.d/amdgpu.conf with pg_mask=0
-2. Rebuild initramfs
-3. Boot - now the GPU fix should be in effect
-4. Future mkinitcpio runs might not freeze
-
-This might work because the initramfs would load with power gating disabled from the start.
-
-** Option 4: Wait for kernel 6.18+
-
-The upstream fix (VPE_IDLE_TIMEOUT increased from 1s to 2s) is in kernel 6.15+. When linux-lts reaches 6.18, the workaround won't be needed.
-
-Current: linux-lts 6.12.66
-Target: linux-lts 6.18
-
-* Current State of ratio
-
-- Booted to archzfs live ISO
-- ZFS pool: zroot (mirror of nvme0n1p2 + nvme1n1p2)
-- mkinitcpio.conf: FIXED (has correct HOOKS with zfs)
-- /etc/mkinitcpio.conf.d/archiso.conf: REMOVED
-- /etc/mkinitcpio.d/linux-lts.preset: FIXED
-- /etc/modprobe.d/amdgpu.conf: EXISTS but may not be in initramfs
-- Current pg_mask value on booted system: Unknown (need to check after boot)
-
-* Verification Commands
-
-Check if GPU fix is active:
-#+begin_src bash
-cat /sys/module/amdgpu/parameters/pg_mask
-# Should return: 0
-# If returns 4294967295 (0xFFFFFFFF), fix is NOT active
-#+end_src
-
-Check if amdgpu.conf is in initramfs:
-#+begin_src bash
-lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu
-#+end_src
-
-* Recovery Procedure (Option 3 - recommended)
-
-From archzfs live ISO:
-
-#+begin_src bash
-# Import and mount ZFS
-zpool import -f zroot
-zfs mount zroot/ROOT/default
-mount /dev/nvme0n1p1 /boot
-
-# Ensure GPU fix file exists
-cat > /etc/modprobe.d/amdgpu.conf << 'EOF'
-# Disable power gating to prevent VPE freeze on Strix Halo GPUs
-# Remove this file when linux-lts reaches 6.18+
-options amdgpu pg_mask=0
-EOF
-
-# Mount system directories for chroot
-mount --rbind /dev /dev
-mount --rbind /sys /sys
-mount --rbind /proc /proc
-mount --rbind /run /run
-
-# Rebuild initramfs (should include amdgpu.conf via modconf hook)
-chroot / mkinitcpio -P
-
-# Verify amdgpu.conf is in initramfs
-lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu
-
-# Reboot and test
-reboot
-#+end_src
-
-After reboot, verify pg_mask=0 is active, then test =mkinitcpio -P= again.
-
-* Related Files
-
-- [[file:2026-01-22-mkinitcpio-config-boot-failure.org]] - The config fix that was applied
-- archsetup NOTES.org - AMD GPU freeze diagnosis details
-
-* Machine Details
-
-- Machine: ratio (desktop)
-- CPU: AMD (Strix Halo)
-- GPU: AMD RDNA 3.5 (integrated)
-- Storage: Two NVMe in ZFS mirror
-- Kernel: linux-lts 6.12.66-1
diff --git a/assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt b/assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt
deleted file mode 100644
index 322893a..0000000
--- a/assets/Donato Capitella-ROCm+Linux Support on Strix Halo: It's finally stable in 2026!.txt
+++ /dev/null
@@ -1 +0,0 @@
-Speaker A: In this video I want to give you an update on the current state of Linux support for Streak's Halo. I know that most of my recent viewers either have a device with this AMD APU or are thinking of buying one. As a reminder, this is the integrated GPU Inside AMD Ryzen AI Max and it's codenamed GFX 1151. Now over the last two months there's been a lot of confusion, broken setups and contradictory advice and this video is meant to clarify what changed and what actually works. Now if you've been trying to run LLMs, ComfyUI or other ROCM based AI workflows on Strix Halo and things broke depending on your distribution, kernel and ROCM version, this this wasn't user error. The software stack itself was inconsistent and only recently it has started to converge again to something stable. Now if you just want a working system without digging into the details, here's what you have to Use Linux firmware 20260110 or newer avoid 20251125 that firmware is broken for ROCM on Streak's Halo. Use Linux kernel 6.18.4 or newer use my toolboxes that have the ROCM nightly builds from the Rock or alternatively the ROCM 7.2 builds once they are officially released. This is currently the only combination that includes the full stability fixes for GFX 1151. Importantly, if you try to run older versions of ROCM on newer kernels, these won't work. If you want more details about what's been happening, keep watching this video as essentially we had two major unrelated issues plaguing this IGPU before moving on the usual ask the research that goes into these videos is fun to do but also time consuming. I really appreciate it if you could take a second to support the channel in all the usual ways like subscribing, liking and commenting on the video. Your support does make a difference. Back in November, AMD pushed a Linux firmware update that got bundled into Linux firmware 20251125 and quickly made its way into major Linux distributions like Fedora. Unfortunately, that firmware completely broke ROCAM support on Streak's Halo. ROCAM would simply fail to initialize and became unusable. AMD reverted that firmware fairly quickly, but several distributions never picked up. The revert Fedora is the most obvious example. For roughly two months a fully up to date Fedora system simply could not run ROCM on Streak's Halo. I find the reluctance from the Fedora maintainers to push a fix for these hard to understand this wasn't a corner case or an obscure configuration issue. It completely broke a flagship AMD platform for a whole cluster of users doing GPU compute. The only workaround during that period was to manually downgrade the Linux firmware to 2025 1111, which was the last known one working version. I documented this downgrade process and the link is in the description and a lot of people ended up having to do that in order to run Rocm on their Strixelo systems. But finally, in January 2026 a new firmware release, Linux firmware 202260110 started landing in mainstream distributions. That version restored RAW CAM functionality without requiring a downgrade. So from a firmware persp this specific regression is now resolved if you update your system. However, that firmware update only addressed the immediate RAW QM regression, but it did not fix the underlying stability problems, which were caused by a separate issue elsewhere in the stack. Typical symptoms were GPU kernel crashes and resets, causing AI workflows to fail randomly, and ComfyUI is a good example here. It's the de facto standard software used for image and video generation and it really stresses the gpu. On STRIX Halo it would often work briefly and then fall over, which exposed the ROCM stability issues we've been talking about. This is also why I didn't focus much on ComfyUI in my earlier video on image and video generation. At that time time it simply wasn't stable enough on STRIX Halo to recommend AMD finally identified the underlying issue causing all this trouble. The fixed turned out to require changes in two places at the same time, the AMD GPU driver in the Linux kernel and ROCM itself. The core problem was a mismatch in how hardware resource limits were defined and communicated for GFX 1151, specifically around something called VGPRS vector general purpose registers. For Streak's Halo, the actual VGPR capacity is significantly higher than what ROCM had been assuming. All the ROCM versions were effectively using the wrong register limits for this gpu. This led to GPU kernels being scheduled with invalid assumptions about available registers. The result wasn't a clean failure, it was undefined behavior, often resulting in heap kernel hangs and eventually GPU resets. AMD addressed this by changing both sides of the stack. The important point is that both sides must agree. If the kernel thinks more registers are available but ROCAM still assumes the old limits or vice versa. The runtime ends up scheduling work that doesn't Line up with what is expected by the hardware, which leads to failures. These fixes landed in mainline Linux starting with kernel 6.18.4 with matching changes in ROCM, and this is the key point. The kernel fix and the ROCM fix must be used together. This is where most of the current confusion comes from. If you run kernel 6.18.4 or newer with an older rocm version, for example 6.4.4 or 7.1.1, things will break immediately. This isn't a regression in those ROCM versions, it's a compatibility mismatch. The kernel now expects ROCM to behave differently. Older ROCM builds don't know about these changes, so the stack crashes. The first ROCM release that properly matches these kernel changes will be ROCM 7.2, but at the time of recording 7.2 hasn't been officially released yet. That's why if you are on a newer kernel today, you need to use ROCM nightly B builds from the ROC which already include this fix. All of my current toolboxes provide this option. The table on screen now summarizes what combinations actually work on GFX 1151 and which ones are known to break. This is the part most people trip over. Right now there are two validations configurations the new kernel path, which means kernel 6.18.4 or newer ROCM builds that already include the fixes, which is the nightly builds from the ROC and toolboxes built against these ROCM nightly builds. The second configuration is the old ROCM compatibility path, which means ROCM versions 6.4 and 7.1 kernel 6.18.3 or older. Mixing these parts does not work. That's the key takeaway. If you update your kernel but keep using older ROM toolboxes, you will hit crashes. If you want to stay on older ROCM versions for benchmarking or comparison, you must also stay on the older kernel. This is Donato from the future. As I made it in this video, I realized I want to make an additional point. It is perfectly possible that in the next few months AMD decides to cherry pick and include this stability patch back into older branches of ROCM. So we might have for example a 6.4.5 release which includes this fix and the same might happen for 7.1. We might have a 7.1.2. I don't know this, but it is possible. Likewise, distributions like Fedora build ROCM from scratch and this particular patch is incredibly easy to cherry pick. MBAC port and I think that Fedora is currently doing that. It's currently looking at backporting this particular patch. So long story short, it might become possible to use older version of RAW Cam with newer kernels pretty soon. Up to now, I've kept multiple toolboxes around using different RAW Cam versions. That was intentional. Rocam performance on GFX 1151 has been quite inconsistent, and in some cases older versions were genuinely faster. But with the kernel fixes in place and with AMD now clearly moving forward with Strix Halo as a supported AI platform, especially after the Ryzen AI halo announcement at CES 2026, it no longer makes sense to anchor on old stacks. Performance improvements will eventually land in the latest ROCAM versions, even if there are still some regressions today due to a mix of RAW M and for example LLAMA CPP changes. These are all being worked on as we speak, so over time I'll be retiring the older Rocam toolboxes and focusing on the latest stack. As a result, I can now release a Strix Halo toolbox focused on config UI with proper benchmarks and stability, similar to what I did with the Radio 9 700. That toolbox is ready and a dedicated video on ComfyUI performance and workflows on Streaksalo is coming next.
diff --git a/assets/cogito-hardware-specs.txt b/assets/cogito-hardware-specs.txt
deleted file mode 100644
index 7a3b285..0000000
--- a/assets/cogito-hardware-specs.txt
+++ /dev/null
@@ -1,17 +0,0 @@
-Framework Desktop ML (cogito) Hardware Specifications
-
-Model: AMD Ryzen AI MAX+ 395 (Strix Halo)
-CPU: 16 cores Zen 5
-GPU: Radeon 8060S (40 RDNA 3.5 CUs)
-Total RAM: 128GB unified memory
-Max VRAM: 96GB (via AMD Variable Graphics Memory)
-NPU: XDNA 2, 50+ peak AI TOPS
-GPU Arch: gfx1151
-Peak Perf: 59.4 FP16/BF16 TFLOPS @ 2.9GHz
-
-Inference Performance (AMD testing with LM Studio 0.3.11 / llama.cpp 1.18):
-- Small models (1-3B): ~100+ tokens/sec
-- Medium models (7-8B): ~60-80 tokens/sec
-- Large models (20B): ~58 tokens/sec
-- Very large models (120B): ~38 tokens/sec
-- Context support: Up to 256K tokens with Flash Attention