From 8eb7eb600bc9e709b6ee754a337289382bbcea4f Mon Sep 17 00:00:00 2001 From: Craig Jennings Date: Thu, 22 Jan 2026 14:27:49 -0600 Subject: Fix ratio boot issues: firmware, mkinitcpio, and document ZFS rollback dangers Root cause: Missing/outdated linux-firmware broke AMD Strix Halo GPU init. Fixed by installing linux-firmware 20260110-1. Changes: - install-archzfs: Fix mkinitcpio config (remove archiso.conf, fix preset) - todo.org: Add ZFS rollback + /boot mismatch issue, recommend ZFSBootMenu - docs/2026-01-22-ratio-boot-fix-session.org: Full troubleshooting session - docs/2026-01-22-mkinitcpio-config-boot-failure.org: Bug report - assets/: Supporting documentation and video transcript Key learnings: - AMD Strix Halo requires linux-firmware 20260110+ - ZFS rollback with /boot on EFI partition can break boot - zpool import -R can permanently change mountpoints --- .../2026-01-22-mkinitcpio-fixes-applied-detail.org | 194 +++++++++++++++++++++ ...2026-01-22-mkinitcpio-freeze-during-rebuild.org | 152 ++++++++++++++++ ...o\357\274\232 It's finally stable in 2026!.txt" | 1 + custom/install-archzfs | 26 +++ docs/2026-01-22-mkinitcpio-config-boot-failure.org | 159 +++++++++++++++++ docs/2026-01-22-ratio-boot-fix-session.org | 175 +++++++++++++++++++ todo.org | 87 +++++++++ 7 files changed, 794 insertions(+) create mode 100644 assets/2026-01-22-mkinitcpio-fixes-applied-detail.org create mode 100644 assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org create mode 100644 "assets/Donato Capitella-ROCm+Linux Support on Strix Halo\357\274\232 It's finally stable in 2026!.txt" create mode 100644 docs/2026-01-22-mkinitcpio-config-boot-failure.org create mode 100644 docs/2026-01-22-ratio-boot-fix-session.org diff --git a/assets/2026-01-22-mkinitcpio-fixes-applied-detail.org b/assets/2026-01-22-mkinitcpio-fixes-applied-detail.org new file mode 100644 index 0000000..68c6f0e --- /dev/null +++ b/assets/2026-01-22-mkinitcpio-fixes-applied-detail.org @@ -0,0 +1,194 @@ +#+TITLE: Detailed mkinitcpio Fixes Applied to ratio +#+DATE: 2026-01-22 + +* Overview + +This documents the exact fixes applied to ratio's mkinitcpio configuration to make it bootable. These fixes worked - the system booted successfully after applying them. The install-archzfs script needs to be updated to apply these configurations during installation. + +* Fix 1: /etc/mkinitcpio.conf HOOKS + +** Problem + +The HOOKS line was configured for a systemd-based initramfs without ZFS support. + +** Before (broken) +#+begin_example +HOOKS=(base systemd autodetect microcode modconf kms keyboard keymap sd-vconsole block filesystems fsck) +#+end_example + +** After (working) +#+begin_example +HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems) +#+end_example + +** Changes Explained + +| Removed | Added/Changed | Reason | +|----------------+----------------+-----------------------------------------------------------| +| systemd | udev | ZFS hook is busybox-based, incompatible with systemd init | +| sd-vconsole | consolefont | sd-vconsole is systemd-specific; consolefont is busybox | +| fsck | (removed) | fsck is for ext4/xfs, not needed for ZFS | +| (missing) | zfs | Required to import ZFS pool and mount root at boot | + +** Command Used +#+begin_src bash +sed -i "s/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/" /etc/mkinitcpio.conf +#+end_src + +* Fix 2: Remove /etc/mkinitcpio.conf.d/archiso.conf + +** Problem + +The archzfs live ISO uses a drop-in config file at =/etc/mkinitcpio.conf.d/archiso.conf=. This file was not removed during installation, and it *overrides* the HOOKS setting in mkinitcpio.conf. + +** Contents of archiso.conf (should not exist on installed system) +#+begin_example +HOOKS=(base udev microcode modconf kms memdisk archiso archiso_loop_mnt archiso_pxe_common archiso_pxe_nbd archiso_pxe_http archiso_pxe_nfs block filesystems keyboard) +COMPRESSION="xz" +COMPRESSION_OPTIONS=(-9e) +#+end_example + +** Why This Breaks Things + +Even if mkinitcpio.conf has the correct HOOKS, this drop-in file overrides them with archiso-specific hooks (memdisk, archiso, archiso_loop_mnt, etc.) that are only for the live ISO environment. The =zfs= hook is notably absent. + +** Fix Applied +#+begin_src bash +rm -f /etc/mkinitcpio.conf.d/archiso.conf +#+end_src + +** Note for install-archzfs + +The script should remove this file after arch-chroot setup: +#+begin_src bash +rm -f /mnt/etc/mkinitcpio.conf.d/archiso.conf +#+end_src + +* Fix 3: /etc/mkinitcpio.d/linux-lts.preset + +** Problem + +The preset file was still configured for the archiso live environment, not a normal installed system. + +** Before (broken) +#+begin_example +# mkinitcpio preset file for the 'linux-lts' package on archiso + +PRESETS=('archiso') + +ALL_kver='/boot/vmlinuz-linux-lts' +archiso_config='/etc/mkinitcpio.conf.d/archiso.conf' + +archiso_image="/boot/initramfs-linux-lts.img" +#+end_example + +** After (working) +#+begin_example +# mkinitcpio preset file for linux-lts + +PRESETS=(default fallback) + +ALL_kver="/boot/vmlinuz-linux-lts" + +default_image="/boot/initramfs-linux-lts.img" + +fallback_image="/boot/initramfs-linux-lts-fallback.img" +fallback_options="-S autodetect" +#+end_example + +** Changes Explained + +| Before | After | Reason | +|---------------------------------+------------------------+-----------------------------------------------------| +| PRESETS=('archiso') | PRESETS=(default fallback) | Normal system needs default + fallback images | +| archiso_config=... (drop-in) | (removed) | Don't use archiso drop-in config | +| archiso_image=... | default_image=... | Use standard naming | +| (missing) | fallback_image=... | Fallback image for recovery | +| (missing) | fallback_options="-S autodetect" | Fallback skips autodetect for broader hardware support | + +** Command Used +#+begin_src bash +cat > /etc/mkinitcpio.d/linux-lts.preset << 'EOF' +# mkinitcpio preset file for linux-lts + +PRESETS=(default fallback) + +ALL_kver="/boot/vmlinuz-linux-lts" + +default_image="/boot/initramfs-linux-lts.img" + +fallback_image="/boot/initramfs-linux-lts-fallback.img" +fallback_options="-S autodetect" +EOF +#+end_src + +* Fix 4: Rebuild initramfs + +After applying the above fixes, the initramfs must be rebuilt: + +#+begin_src bash +mkinitcpio -P +#+end_src + +This regenerates both default and fallback images with the correct hooks. + +* Verification + +** Verify HOOKS are correct +#+begin_src bash +grep "^HOOKS" /etc/mkinitcpio.conf +# Should show: HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems) +#+end_src + +** Verify no archiso drop-in +#+begin_src bash +ls /etc/mkinitcpio.conf.d/ +# Should be empty or not contain archiso.conf +#+end_src + +** Verify preset is correct +#+begin_src bash +grep "PRESETS" /etc/mkinitcpio.d/linux-lts.preset +# Should show: PRESETS=(default fallback) +#+end_src + +** Verify ZFS hook is in initramfs +#+begin_src bash +lsinitcpio /boot/initramfs-linux-lts.img | grep -E "^hooks/zfs|zfs.ko" +# Should show: +# hooks/zfs +# usr/lib/modules/.../zfs.ko.zst +#+end_src + +* Summary for install-archzfs Script + +The script needs to add these steps after installing packages and before running final mkinitcpio: + +#+begin_src bash +# 1. Set correct HOOKS for ZFS boot +sed -i "s/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/" /mnt/etc/mkinitcpio.conf + +# 2. Remove archiso drop-in config +rm -f /mnt/etc/mkinitcpio.conf.d/archiso.conf + +# 3. Create proper preset file (adjust kernel name if not linux-lts) +cat > /mnt/etc/mkinitcpio.d/linux-lts.preset << 'EOF' +# mkinitcpio preset file for linux-lts + +PRESETS=(default fallback) + +ALL_kver="/boot/vmlinuz-linux-lts" + +default_image="/boot/initramfs-linux-lts.img" + +fallback_image="/boot/initramfs-linux-lts-fallback.img" +fallback_options="-S autodetect" +EOF + +# 4. Rebuild initramfs with correct config +arch-chroot /mnt mkinitcpio -P +#+end_src + +* Result + +After applying these fixes and rebuilding initramfs from the live ISO, ratio booted successfully. The system froze on a subsequent =mkinitcpio -P= run, but that's a separate AMD GPU issue (see 2026-01-22-mkinitcpio-freeze-during-rebuild.org), not a configuration problem. diff --git a/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org b/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org new file mode 100644 index 0000000..1132ddd --- /dev/null +++ b/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org @@ -0,0 +1,152 @@ +#+TITLE: System freezes during mkinitcpio -P rebuild +#+DATE: 2026-01-22 + +* Problem Summary + +After fixing the mkinitcpio configuration issues (see 2026-01-22-mkinitcpio-config-boot-failure.org), the system successfully booted. However, running =mkinitcpio -P= again caused the system to freeze, requiring a power cycle. + +This indicates the mkinitcpio config fix was correct, but there's a separate issue causing freezes during initramfs rebuilds. + +* Timeline + +1. System wouldn't boot due to broken mkinitcpio config (wrong HOOKS, missing zfs) +2. Booted from archzfs live ISO +3. Fixed mkinitcpio.conf, preset file, removed archiso.conf drop-in +4. Rebuilt initramfs via chroot - completed successfully +5. Rebooted - system booted successfully +6. Ran =mkinitcpio -P= again - system froze +7. Had to power cycle, now back on live ISO + +* What This Tells Us + +The mkinitcpio configuration fix was correct (system booted). But something about running mkinitcpio itself is triggering a system freeze. + +* Suspected Cause: AMD GPU Power Gating Bug + +ratio has an AMD Strix Halo GPU (RDNA 3.5) with a known VPE power gating bug. When the VPE (Video Processing Engine) tries to power gate after 1 second of idle, the SMU hangs and the system freezes. + +Symptoms before freeze: +#+begin_example +amdgpu: SMU: I'm not done with your previous command +amdgpu: Failed to power gate VPE! +[drm:vpe_set_powergating_state] *ERROR* Dpm disable vpe failed, ret = -62 +#+end_example + +The fix is to disable power gating via =/etc/modprobe.d/amdgpu.conf=: +#+begin_example +options amdgpu pg_mask=0 +#+end_example + +*CRITICAL*: After creating this file, must run =mkinitcpio -P= to include it in initramfs (the modconf hook reads /etc/modprobe.d/ at build time). + +* The Chicken-and-Egg Problem + +1. Need to run =mkinitcpio -P= to apply the GPU fix (include amdgpu.conf in initramfs) +2. But running =mkinitcpio -P= triggers the GPU freeze +3. The fix can't be applied because applying it causes the problem it's meant to fix + +* Possible Solutions to Investigate + +** Option 1: Apply GPU fix at runtime before mkinitcpio + +Before running mkinitcpio, manually set pg_mask at runtime: +#+begin_src bash +echo 0 | sudo tee /sys/module/amdgpu/parameters/pg_mask +#+end_src + +Then run mkinitcpio while power gating is disabled. This might prevent the freeze. + +** Option 2: Build initramfs from live ISO + +Boot from archzfs live ISO (which doesn't have the GPU issue), mount the system, and rebuild initramfs from there. The live ISO uses a different GPU driver state. + +We tried this and it worked - the rebuild completed. But then running mkinitcpio on the booted system froze. + +** Option 3: Add amdgpu.conf before rebuilding from live ISO + +When rebuilding from live ISO: +1. Create /etc/modprobe.d/amdgpu.conf with pg_mask=0 +2. Rebuild initramfs +3. Boot - now the GPU fix should be in effect +4. Future mkinitcpio runs might not freeze + +This might work because the initramfs would load with power gating disabled from the start. + +** Option 4: Wait for kernel 6.18+ + +The upstream fix (VPE_IDLE_TIMEOUT increased from 1s to 2s) is in kernel 6.15+. When linux-lts reaches 6.18, the workaround won't be needed. + +Current: linux-lts 6.12.66 +Target: linux-lts 6.18 + +* Current State of ratio + +- Booted to archzfs live ISO +- ZFS pool: zroot (mirror of nvme0n1p2 + nvme1n1p2) +- mkinitcpio.conf: FIXED (has correct HOOKS with zfs) +- /etc/mkinitcpio.conf.d/archiso.conf: REMOVED +- /etc/mkinitcpio.d/linux-lts.preset: FIXED +- /etc/modprobe.d/amdgpu.conf: EXISTS but may not be in initramfs +- Current pg_mask value on booted system: Unknown (need to check after boot) + +* Verification Commands + +Check if GPU fix is active: +#+begin_src bash +cat /sys/module/amdgpu/parameters/pg_mask +# Should return: 0 +# If returns 4294967295 (0xFFFFFFFF), fix is NOT active +#+end_src + +Check if amdgpu.conf is in initramfs: +#+begin_src bash +lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu +#+end_src + +* Recovery Procedure (Option 3 - recommended) + +From archzfs live ISO: + +#+begin_src bash +# Import and mount ZFS +zpool import -f zroot +zfs mount zroot/ROOT/default +mount /dev/nvme0n1p1 /boot + +# Ensure GPU fix file exists +cat > /etc/modprobe.d/amdgpu.conf << 'EOF' +# Disable power gating to prevent VPE freeze on Strix Halo GPUs +# Remove this file when linux-lts reaches 6.18+ +options amdgpu pg_mask=0 +EOF + +# Mount system directories for chroot +mount --rbind /dev /dev +mount --rbind /sys /sys +mount --rbind /proc /proc +mount --rbind /run /run + +# Rebuild initramfs (should include amdgpu.conf via modconf hook) +chroot / mkinitcpio -P + +# Verify amdgpu.conf is in initramfs +lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu + +# Reboot and test +reboot +#+end_src + +After reboot, verify pg_mask=0 is active, then test =mkinitcpio -P= again. + +* Related Files + +- [[file:2026-01-22-mkinitcpio-config-boot-failure.org]] - The config fix that was applied +- archsetup NOTES.org - AMD GPU freeze diagnosis details + +* Machine Details + +- Machine: ratio (desktop) +- CPU: AMD (Strix Halo) +- GPU: AMD RDNA 3.5 (integrated) +- Storage: Two NVMe in ZFS mirror +- Kernel: linux-lts 6.12.66-1 diff --git "a/assets/Donato Capitella-ROCm+Linux Support on Strix Halo\357\274\232 It's finally stable in 2026!.txt" "b/assets/Donato Capitella-ROCm+Linux Support on Strix Halo\357\274\232 It's finally stable in 2026!.txt" new file mode 100644 index 0000000..322893a --- /dev/null +++ "b/assets/Donato Capitella-ROCm+Linux Support on Strix Halo\357\274\232 It's finally stable in 2026!.txt" @@ -0,0 +1 @@ +Speaker A: In this video I want to give you an update on the current state of Linux support for Streak's Halo. I know that most of my recent viewers either have a device with this AMD APU or are thinking of buying one. As a reminder, this is the integrated GPU Inside AMD Ryzen AI Max and it's codenamed GFX 1151. Now over the last two months there's been a lot of confusion, broken setups and contradictory advice and this video is meant to clarify what changed and what actually works. Now if you've been trying to run LLMs, ComfyUI or other ROCM based AI workflows on Strix Halo and things broke depending on your distribution, kernel and ROCM version, this this wasn't user error. The software stack itself was inconsistent and only recently it has started to converge again to something stable. Now if you just want a working system without digging into the details, here's what you have to Use Linux firmware 20260110 or newer avoid 20251125 that firmware is broken for ROCM on Streak's Halo. Use Linux kernel 6.18.4 or newer use my toolboxes that have the ROCM nightly builds from the Rock or alternatively the ROCM 7.2 builds once they are officially released. This is currently the only combination that includes the full stability fixes for GFX 1151. Importantly, if you try to run older versions of ROCM on newer kernels, these won't work. If you want more details about what's been happening, keep watching this video as essentially we had two major unrelated issues plaguing this IGPU before moving on the usual ask the research that goes into these videos is fun to do but also time consuming. I really appreciate it if you could take a second to support the channel in all the usual ways like subscribing, liking and commenting on the video. Your support does make a difference. Back in November, AMD pushed a Linux firmware update that got bundled into Linux firmware 20251125 and quickly made its way into major Linux distributions like Fedora. Unfortunately, that firmware completely broke ROCAM support on Streak's Halo. ROCAM would simply fail to initialize and became unusable. AMD reverted that firmware fairly quickly, but several distributions never picked up. The revert Fedora is the most obvious example. For roughly two months a fully up to date Fedora system simply could not run ROCM on Streak's Halo. I find the reluctance from the Fedora maintainers to push a fix for these hard to understand this wasn't a corner case or an obscure configuration issue. It completely broke a flagship AMD platform for a whole cluster of users doing GPU compute. The only workaround during that period was to manually downgrade the Linux firmware to 2025 1111, which was the last known one working version. I documented this downgrade process and the link is in the description and a lot of people ended up having to do that in order to run Rocm on their Strixelo systems. But finally, in January 2026 a new firmware release, Linux firmware 202260110 started landing in mainstream distributions. That version restored RAW CAM functionality without requiring a downgrade. So from a firmware persp this specific regression is now resolved if you update your system. However, that firmware update only addressed the immediate RAW QM regression, but it did not fix the underlying stability problems, which were caused by a separate issue elsewhere in the stack. Typical symptoms were GPU kernel crashes and resets, causing AI workflows to fail randomly, and ComfyUI is a good example here. It's the de facto standard software used for image and video generation and it really stresses the gpu. On STRIX Halo it would often work briefly and then fall over, which exposed the ROCM stability issues we've been talking about. This is also why I didn't focus much on ComfyUI in my earlier video on image and video generation. At that time time it simply wasn't stable enough on STRIX Halo to recommend AMD finally identified the underlying issue causing all this trouble. The fixed turned out to require changes in two places at the same time, the AMD GPU driver in the Linux kernel and ROCM itself. The core problem was a mismatch in how hardware resource limits were defined and communicated for GFX 1151, specifically around something called VGPRS vector general purpose registers. For Streak's Halo, the actual VGPR capacity is significantly higher than what ROCM had been assuming. All the ROCM versions were effectively using the wrong register limits for this gpu. This led to GPU kernels being scheduled with invalid assumptions about available registers. The result wasn't a clean failure, it was undefined behavior, often resulting in heap kernel hangs and eventually GPU resets. AMD addressed this by changing both sides of the stack. The important point is that both sides must agree. If the kernel thinks more registers are available but ROCAM still assumes the old limits or vice versa. The runtime ends up scheduling work that doesn't Line up with what is expected by the hardware, which leads to failures. These fixes landed in mainline Linux starting with kernel 6.18.4 with matching changes in ROCM, and this is the key point. The kernel fix and the ROCM fix must be used together. This is where most of the current confusion comes from. If you run kernel 6.18.4 or newer with an older rocm version, for example 6.4.4 or 7.1.1, things will break immediately. This isn't a regression in those ROCM versions, it's a compatibility mismatch. The kernel now expects ROCM to behave differently. Older ROCM builds don't know about these changes, so the stack crashes. The first ROCM release that properly matches these kernel changes will be ROCM 7.2, but at the time of recording 7.2 hasn't been officially released yet. That's why if you are on a newer kernel today, you need to use ROCM nightly B builds from the ROC which already include this fix. All of my current toolboxes provide this option. The table on screen now summarizes what combinations actually work on GFX 1151 and which ones are known to break. This is the part most people trip over. Right now there are two validations configurations the new kernel path, which means kernel 6.18.4 or newer ROCM builds that already include the fixes, which is the nightly builds from the ROC and toolboxes built against these ROCM nightly builds. The second configuration is the old ROCM compatibility path, which means ROCM versions 6.4 and 7.1 kernel 6.18.3 or older. Mixing these parts does not work. That's the key takeaway. If you update your kernel but keep using older ROM toolboxes, you will hit crashes. If you want to stay on older ROCM versions for benchmarking or comparison, you must also stay on the older kernel. This is Donato from the future. As I made it in this video, I realized I want to make an additional point. It is perfectly possible that in the next few months AMD decides to cherry pick and include this stability patch back into older branches of ROCM. So we might have for example a 6.4.5 release which includes this fix and the same might happen for 7.1. We might have a 7.1.2. I don't know this, but it is possible. Likewise, distributions like Fedora build ROCM from scratch and this particular patch is incredibly easy to cherry pick. MBAC port and I think that Fedora is currently doing that. It's currently looking at backporting this particular patch. So long story short, it might become possible to use older version of RAW Cam with newer kernels pretty soon. Up to now, I've kept multiple toolboxes around using different RAW Cam versions. That was intentional. Rocam performance on GFX 1151 has been quite inconsistent, and in some cases older versions were genuinely faster. But with the kernel fixes in place and with AMD now clearly moving forward with Strix Halo as a supported AI platform, especially after the Ryzen AI halo announcement at CES 2026, it no longer makes sense to anchor on old stacks. Performance improvements will eventually land in the latest ROCAM versions, even if there are still some regressions today due to a mix of RAW M and for example LLAMA CPP changes. These are all being worked on as we speak, so over time I'll be retiring the older Rocam toolboxes and focusing on the latest stack. As a result, I can now release a Strix Halo toolbox focused on config UI with proper benchmarks and stability, similar to what I did with the Radio 9 700. That toolbox is ready and a dedicated video on ComfyUI performance and workflows on Streaksalo is coming next. diff --git a/custom/install-archzfs b/custom/install-archzfs index 7081a5b..567a213 100755 --- a/custom/install-archzfs +++ b/custom/install-archzfs @@ -1023,6 +1023,30 @@ configure_initramfs() { cp /mnt/etc/mkinitcpio.conf /mnt/etc/mkinitcpio.conf.bak + # CRITICAL: Remove archiso drop-in that overrides mkinitcpio.conf HOOKS + # The archiso.conf contains live ISO-specific hooks that are incompatible with ZFS + # If not removed, it overrides our HOOKS setting and breaks boot after kernel updates + if [[ -f /mnt/etc/mkinitcpio.conf.d/archiso.conf ]]; then + info "Removing archiso drop-in config..." + rm -f /mnt/etc/mkinitcpio.conf.d/archiso.conf + fi + + # CRITICAL: Fix linux-lts preset file + # The preset from archiso uses archiso-specific config that breaks mkinitcpio -P + info "Creating proper linux-lts preset..." + cat > /mnt/etc/mkinitcpio.d/linux-lts.preset << 'PRESET_EOF' +# mkinitcpio preset file for linux-lts + +PRESETS=(default fallback) + +ALL_kver="/boot/vmlinuz-linux-lts" + +default_image="/boot/initramfs-linux-lts.img" + +fallback_image="/boot/initramfs-linux-lts-fallback.img" +fallback_options="-S autodetect" +PRESET_EOF + # Check for AMD ISP (Image Signal Processor) firmware needs # ISP is used for camera processing on AMD APUs (Strix, Strix Halo, etc.) # The firmware must be in initramfs since amdgpu loads before root is mounted @@ -1043,9 +1067,11 @@ EOF fi # Configure hooks for ZFS + # - Use udev (not systemd): ZFS hook is busybox-based and incompatible with systemd init # - Remove autodetect: it filters modules based on live ISO hardware, not target # This ensures NVMe, AHCI, and other storage drivers are always included # - Remove fsck: ZFS doesn't use it, avoids confusing error messages + # - Add zfs: required for ZFS root boot sed -i 's/^HOOKS=.*/HOOKS=(base udev microcode modconf kms keyboard keymap consolefont block zfs filesystems)/' /mnt/etc/mkinitcpio.conf # Get the installed kernel version (not the running kernel) diff --git a/docs/2026-01-22-mkinitcpio-config-boot-failure.org b/docs/2026-01-22-mkinitcpio-config-boot-failure.org new file mode 100644 index 0000000..ba5bc72 --- /dev/null +++ b/docs/2026-01-22-mkinitcpio-config-boot-failure.org @@ -0,0 +1,159 @@ +#+TITLE: install-archzfs leaves broken mkinitcpio configuration +#+DATE: 2026-01-22 + +* Problem Summary + +After installing Arch Linux with ZFS via install-archzfs, the system has incorrect mkinitcpio configuration that can cause boot failures. The configuration issues are latent - the system may boot initially but will fail after any mkinitcpio regeneration (kernel updates, manual rebuilds, etc.). + +* Root Cause + +The install-archzfs script does not properly configure mkinitcpio for a ZFS boot environment. Three issues were identified: + +** Issue 1: Wrong HOOKS in mkinitcpio.conf + +The installed system had: +#+begin_example +HOOKS=(base systemd autodetect microcode modconf kms keyboard keymap sd-vconsole block filesystems fsck) +#+end_example + +This is wrong for ZFS because: +- Uses =systemd= init hook, but ZFS hook is busybox-based and incompatible with systemd init +- Missing =zfs= hook entirely +- Has =fsck= hook which is unnecessary/wrong for ZFS + +Correct HOOKS for ZFS: +#+begin_example +HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems) +#+end_example + +** Issue 2: Leftover archiso.conf drop-in + +The file =/etc/mkinitcpio.conf.d/archiso.conf= was left over from the live ISO: +#+begin_example +HOOKS=(base udev microcode modconf kms memdisk archiso archiso_loop_mnt archiso_pxe_common archiso_pxe_nbd archiso_pxe_http archiso_pxe_nfs block filesystems keyboard) +COMPRESSION="xz" +COMPRESSION_OPTIONS=(-9e) +#+end_example + +This drop-in OVERRIDES the HOOKS setting in mkinitcpio.conf, so even if mkinitcpio.conf were correct, this file would break it. + +** Issue 3: Wrong preset file + +The file =/etc/mkinitcpio.d/linux-lts.preset= contained archiso-specific configuration: +#+begin_example +# mkinitcpio preset file for the 'linux-lts' package on archiso + +PRESETS=('archiso') + +ALL_kver='/boot/vmlinuz-linux-lts' +archiso_config='/etc/mkinitcpio.conf.d/archiso.conf' + +archiso_image="/boot/initramfs-linux-lts.img" +#+end_example + +Should be: +#+begin_example +# mkinitcpio preset file for linux-lts + +PRESETS=(default fallback) + +ALL_kver="/boot/vmlinuz-linux-lts" + +default_image="/boot/initramfs-linux-lts.img" + +fallback_image="/boot/initramfs-linux-lts-fallback.img" +fallback_options="-S autodetect" +#+end_example + +* How This Manifests + +1. Fresh install appears to work (initramfs built during install has ZFS support somehow) +2. System boots fine initially +3. Kernel update or manual =mkinitcpio -P= rebuilds initramfs +4. New initramfs lacks ZFS support due to wrong config +5. Next reboot fails with "cannot import pool" or "failed to mount /sysroot" + +* Fix Required in install-archzfs + +The script needs to, after arch-chroot setup: + +1. *Set correct mkinitcpio.conf HOOKS*: + #+begin_src bash + sed -i 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/' /mnt/etc/mkinitcpio.conf + #+end_src + +2. *Remove archiso drop-in*: + #+begin_src bash + rm -f /mnt/etc/mkinitcpio.conf.d/archiso.conf + #+end_src + +3. *Create proper preset file*: + #+begin_src bash + cat > /mnt/etc/mkinitcpio.d/linux-lts.preset << 'EOF' + # mkinitcpio preset file for linux-lts + + PRESETS=(default fallback) + + ALL_kver="/boot/vmlinuz-linux-lts" + + default_image="/boot/initramfs-linux-lts.img" + + fallback_image="/boot/initramfs-linux-lts-fallback.img" + fallback_options="-S autodetect" + EOF + #+end_src + +4. *Rebuild initramfs after fixing config*: + #+begin_src bash + arch-chroot /mnt mkinitcpio -P + #+end_src + +* Recovery Procedure (for affected systems) + +Boot from archzfs live ISO, then: + +#+begin_src bash +# Import and mount ZFS +zpool import -f zroot +zfs mount zroot/ROOT/default +mount /dev/nvme0n1p1 /boot # adjust device as needed + +# Fix mkinitcpio.conf +sed -i 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/' /etc/mkinitcpio.conf + +# Remove archiso drop-in +rm -f /etc/mkinitcpio.conf.d/archiso.conf + +# Fix preset (adjust for your kernel: linux, linux-lts, linux-zen, etc.) +cat > /etc/mkinitcpio.d/linux-lts.preset << 'EOF' +PRESETS=(default fallback) +ALL_kver="/boot/vmlinuz-linux-lts" +default_image="/boot/initramfs-linux-lts.img" +fallback_image="/boot/initramfs-linux-lts-fallback.img" +fallback_options="-S autodetect" +EOF + +# Mount system directories for chroot +mount --rbind /dev /dev +mount --rbind /sys /sys +mount --rbind /proc /proc +mount --rbind /run /run + +# Rebuild initramfs +chroot / mkinitcpio -P + +# Reboot +reboot +#+end_src + +* Machine Details (ratio) + +- Two NVMe drives in ZFS mirror (nvme0n1, nvme1n1) +- Pool: zroot +- Root dataset: zroot/ROOT/default +- Kernel: linux-lts 6.12.66-1 +- Boot partition: /dev/nvme0n1p1 (FAT32, mounted at /boot) + +* Related Information + +The immediate trigger for discovering this was a system freeze during mkinitcpio regeneration. That freeze was caused by the AMD GPU VPE power gating bug (separate issue - see archsetup NOTES.org for details). However, the system's inability to boot afterward exposed these latent mkinitcpio configuration problems. diff --git a/docs/2026-01-22-ratio-boot-fix-session.org b/docs/2026-01-22-ratio-boot-fix-session.org new file mode 100644 index 0000000..463774f --- /dev/null +++ b/docs/2026-01-22-ratio-boot-fix-session.org @@ -0,0 +1,175 @@ +#+TITLE: Ratio Boot Fix Session - 2026-01-22 +#+DATE: 2026-01-22 +#+AUTHOR: Craig Jennings + Claude + +* Summary + +Successfully diagnosed and fixed boot failures on ratio (Framework Desktop with AMD Ryzen AI Max 300 / Strix Halo GPU). The primary issue was outdated/missing linux-firmware causing the amdgpu driver to hang during boot. + +* Hardware + +- Machine: Framework Desktop +- CPU/GPU: AMD Ryzen AI Max 300 (Strix Halo APU, codenamed GFX 1151) +- Storage: 2x NVMe in ZFS mirror (zroot) +- Installed via: install-archzfs script from this project + +* Initial Symptoms + +1. System froze at "triggering udev events..." during boot +2. Only visible message before freeze: "RDSEED32 is broken" (red herring - just a warning) +3. Freeze occurred with both linux-lts (6.12.66) and linux (6.15.2) kernels +4. Blacklisting amdgpu allowed boot to proceed but caused kernel panic (no display = init killed) + +* Root Cause + +The linux-firmware package was either missing or outdated. Specifically: +- linux-firmware 20251125 is known to break AMD Strix Halo (GFX 1151) +- linux-firmware 20260110 contains fixes for Strix Halo stability + +Source: Donato Capitella video "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!" +- Firmware 20251125 completely broke ROCm/GPU on Strix Halo +- Firmware 20260110+ restores functionality + +* Troubleshooting Timeline + +** Phase 1: Initial Diagnosis + +- SSH'd to ratio via archzfs live ISO +- Found mkinitcpio configuration issues (separate bug) +- Fixed mkinitcpio.conf HOOKS and removed archiso.conf drop-in +- System still froze after fixes + +** Phase 2: Kernel Investigation + +- Researched AMD Strix Halo issues on Framework community forums +- Found reports of VPE (Video Processing Engine) idle timeout bug +- Attempted kernel parameter workarounds: + - amdgpu.pg_mask=0 (disable power gating) - didn't help + - amdgpu.cwsr_enable=0 (disable compute wavefront save/restore) - not tested +- Installed kernel 6.15.2 from Arch Linux Archive (has VPE fix) +- Installed matching zfs-linux package for 6.15.2 +- System still froze + +** Phase 3: ZFS Rollback Complications + +- Rolled back to pre-kernel-switch ZFS snapshot +- Discovered /boot is NOT on ZFS (EFI partition) +- Rollback caused mismatch: root filesystem rolled back, but /boot kept newer kernels +- Kernel 6.15.2 panicked because its modules didn't exist on rolled-back root +- Documented this as a fundamental ZFS-on-root issue (see todo.org) + +** Phase 4: Firmware Discovery + +- Found video transcript explaining Strix Halo firmware requirements +- Discovered linux-firmware package was not installed (orphaned files from rollback) +- Repo had linux-firmware 20260110-1 (the fixed version) +- Installed linux-firmware 20260110-1 + +** Phase 5: Boot Success with Secondary Issues + +After firmware install, encountered additional issues: + +1. *Hostid mismatch*: Pool showed "previously in use from another system" + - Fix: Clean export from live ISO (zpool export zroot) + +2. *ZFS mountpoint=legacy*: Root dataset had legacy mountpoint from chroot work + - Fix: zfs set mountpoint=/ zroot/ROOT/default + +3. *ZFS mountpoints with /mnt prefix*: All child datasets had /mnt prefix + - Cause: zpool import -R /mnt persisted mountpoint changes + - Fix: Reset all mountpoints (zfs set mountpoint=/home zroot/home, etc.) + +* Final Working Configuration + +#+BEGIN_SRC +Kernel: linux-lts 6.12.66-1-lts +Firmware: linux-firmware 20260110-1 +ZFS: zfs-linux-lts (DKMS built for 6.12.66) +Boot: GRUB with spl.spl_hostid=0x564478f3 +#+END_SRC + +* Key Learnings + +** 1. Firmware Matters for AMD APUs + +The linux-firmware package is critical for AMD integrated GPUs. Strix Halo specifically requires firmware 20260110 or newer. The kernel version (6.12 vs 6.15) was less important than having correct firmware. + +** 2. ZFS Rollback + Separate /boot = Danger + +When /boot is on a separate EFI partition (not ZFS): +- ZFS rollback doesn't affect /boot +- Kernel images remain at newer version +- Modules on root get rolled back +- Result: Boot failure or kernel panic + +Solutions: +- Use ZFSBootMenu (stores kernel/initramfs on ZFS) +- Put /boot on ZFS (GRUB can read it) +- Always rebuild initramfs after rollback +- Sync /boot backups with ZFS snapshots + +** 3. zpool import -R Persists Mountpoints + +Using =zpool import -R /mnt= for chroot work can permanently change dataset mountpoints. The -R flag sets altroot, but if you then modify datasets, those changes persist with the /mnt prefix. + +Fix after chroot work: +#+BEGIN_SRC bash +zfs set mountpoint=/ zroot/ROOT/default +zfs set mountpoint=/home zroot/home +# ... etc for all datasets +#+END_SRC + +** 4. Hostid Consistency Required + +ZFS pools track which system last accessed them. If hostid changes (e.g., between live ISO and installed system), import fails with "pool was previously in use from another system." + +Solutions: +- Clean export before switching systems (zpool export) +- Force import (zpool import -f) +- Ensure consistent hostid via /etc/hostid and spl.spl_hostid kernel parameter + +* Files Modified on Ratio + +- /etc/mkinitcpio.conf - Fixed HOOKS +- /etc/mkinitcpio.conf.d/archiso.conf - Removed (was overriding HOOKS) +- /etc/default/grub - GRUB_TIMEOUT=5 (was 0) +- /boot/grub/grub.cfg - Regenerated, added TEST label to mainline kernel +- /etc/hostid - Regenerated to match GRUB hostid parameter +- ZFS dataset mountpoints - Reset from /mnt/* to /* + +* Packages Installed + +- linux-firmware 20260110-1 (critical fix) +- linux 6.15.2 + zfs-linux (available as TEST kernel, not needed for boot) +- Various system packages updated during troubleshooting + +* Resources Referenced + +** Framework Community Posts +- https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221 +- https://community.frame.work/t/fyi-linux-firmware-amdgpu-20251125-breaks-rocm-on-ai-max-395-8060s/78554 +- https://github.com/FrameworkComputer/SoftwareFirmwareIssueTracker/issues/162 + +** Donato Capitella Video +- Title: "ROCm+Linux Support on Strix Halo: It's finally stable in 2026!" +- Key info: Firmware 20260110+ required, kernel 6.18.4+ for ROCm stability +- Transcript saved: assets/Donato Capitella-ROCm+Linux Support on Strix Halo...txt + +** Other +- Arch Linux Archive (for kernel 6.15.2 package) +- Jeff Geerling blog (VRAM allocation on AMD APUs) + +* TODO Items Created + +Added to todo.org: +- [#A] Fix ZFS rollback breaking boot (/boot not on ZFS) +- Links to existing [#A] Integrate ZFSBootMenu task + +* Test Kernel Available + +The TEST kernel (linux 6.15.2) is installed and available in GRUB Advanced menu. It has matching zfs-linux modules and should work if needed. The mainline kernel may be useful for: +- ROCm/AI workloads (combined with ROCm 7.2+ when released) +- Future GPU stability improvements +- Testing newer kernel features + +Current recommendation: Use linux-lts for stability, TEST kernel for experimentation. diff --git a/todo.org b/todo.org index cae3dfa..9415a18 100644 --- a/todo.org +++ b/todo.org @@ -1,5 +1,39 @@ * Open Work +** TODO [#A] Fix mkinitcpio configuration in install-archzfs (causes boot failure) +After kernel updates or mkinitcpio regeneration, systems fail to boot because install-archzfs +leaves incorrect mkinitcpio configuration from the live ISO environment. + +See [[file:docs/2026-01-22-mkinitcpio-config-boot-failure.org][bug report]] for full details. + +*** Three issues to fix + +1. *Wrong HOOKS in mkinitcpio.conf* - uses systemd init (incompatible with ZFS hook), missing zfs hook + #+BEGIN_SRC bash + sed -i 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block zfs filesystems)/' /mnt/etc/mkinitcpio.conf + #+END_SRC + +2. *Leftover archiso.conf drop-in* - overrides HOOKS setting + #+BEGIN_SRC bash + rm -f /mnt/etc/mkinitcpio.conf.d/archiso.conf + #+END_SRC + +3. *Wrong preset file* - has archiso configuration instead of standard + #+BEGIN_SRC bash + cat > /mnt/etc/mkinitcpio.d/linux-lts.preset << 'EOF' + PRESETS=(default fallback) + ALL_kver="/boot/vmlinuz-linux-lts" + default_image="/boot/initramfs-linux-lts.img" + fallback_image="/boot/initramfs-linux-lts-fallback.img" + fallback_options="-S autodetect" + EOF + #+END_SRC + +4. *Rebuild initramfs after fixing* + #+BEGIN_SRC bash + arch-chroot /mnt mkinitcpio -P + #+END_SRC + ** TODO [#A] Build AUR packages and include in ISO as local repository Build AUR packages during ISO creation and include them in a local pacman repository. This allows AUR software to work both in the live environment AND be installable to target systems. @@ -144,6 +178,59 @@ fi - arch-wiki-lite: ~200MB (text only, smaller) - Could include both for ~600MB total +** TODO [#A] Fix ZFS rollback breaking boot (/boot not on ZFS) +ZFS rollbacks can leave the system unbootable because /boot is on a separate EFI partition +that doesn't get rolled back with the ZFS root filesystem. + +*** The Problem +When rolling back ZFS: +- /usr/lib/modules/ (kernel modules) gets rolled back +- /var/lib/pacman/ (package database) gets rolled back +- Everything else on ZFS root gets rolled back + +But /boot (EFI partition) does NOT roll back: +- Kernel images (vmlinuz-*) remain at newer version +- Initramfs images remain (may reference missing modules) +- GRUB config still lists kernels that may not have matching modules + +Result: After rollback, GRUB shows kernels that can't boot because their modules +no longer exist on root. User gets kernel panic or missing module errors. + +*** Why This Matters +- Kernel updates happen frequently and often go unnoticed +- User does ZFS rollback for unrelated reason +- System fails to boot with confusing errors +- Defeats the purpose of ZFS snapshots for easy recovery + +*** Solutions + +**** Option 1: ZFSBootMenu (Recommended) +Replace GRUB with ZFSBootMenu which is designed for ZFS boot environments. +- Boots directly from ZFS snapshots +- Kernel and initramfs stored on ZFS (rolled back together) +- Can select boot environment from boot menu +- See existing task below for implementation details + +**** Option 2: Put /boot on ZFS +- GRUB can read ZFS (with limitations) +- Requires careful GRUB configuration +- May have issues with ZFS features GRUB doesn't support + +**** Option 3: Sync /boot snapshots with ZFS +- Script to backup /boot before ZFS snapshot +- Restore /boot when rolling back ZFS +- More complex, error-prone + +**** Option 4: Always rebuild initramfs after rollback +- Document this as required step +- Add helper script to automate +- Doesn't help if kernel package itself was rolled back + +*** References +- https://zfsbootmenu.org/ +- https://wiki.archlinux.org/title/Install_Arch_Linux_on_ZFS +- https://openzfs.github.io/openzfs-docs/Getting%20Started/Arch%20Linux/index.html + ** TODO [#A] Integrate ZFSBootMenu as alternative boot manager Idea from: https://github.com/stevleibelt/arch-linux-live-cd-iso-with-zfs -- cgit v1.2.3