diff options
Diffstat (limited to 'assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org')
| -rw-r--r-- | assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org | 152 |
1 files changed, 152 insertions, 0 deletions
diff --git a/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org b/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org new file mode 100644 index 0000000..1132ddd --- /dev/null +++ b/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org @@ -0,0 +1,152 @@ +#+TITLE: System freezes during mkinitcpio -P rebuild +#+DATE: 2026-01-22 + +* Problem Summary + +After fixing the mkinitcpio configuration issues (see 2026-01-22-mkinitcpio-config-boot-failure.org), the system successfully booted. However, running =mkinitcpio -P= again caused the system to freeze, requiring a power cycle. + +This indicates the mkinitcpio config fix was correct, but there's a separate issue causing freezes during initramfs rebuilds. + +* Timeline + +1. System wouldn't boot due to broken mkinitcpio config (wrong HOOKS, missing zfs) +2. Booted from archzfs live ISO +3. Fixed mkinitcpio.conf, preset file, removed archiso.conf drop-in +4. Rebuilt initramfs via chroot - completed successfully +5. Rebooted - system booted successfully +6. Ran =mkinitcpio -P= again - system froze +7. Had to power cycle, now back on live ISO + +* What This Tells Us + +The mkinitcpio configuration fix was correct (system booted). But something about running mkinitcpio itself is triggering a system freeze. + +* Suspected Cause: AMD GPU Power Gating Bug + +ratio has an AMD Strix Halo GPU (RDNA 3.5) with a known VPE power gating bug. When the VPE (Video Processing Engine) tries to power gate after 1 second of idle, the SMU hangs and the system freezes. + +Symptoms before freeze: +#+begin_example +amdgpu: SMU: I'm not done with your previous command +amdgpu: Failed to power gate VPE! +[drm:vpe_set_powergating_state] *ERROR* Dpm disable vpe failed, ret = -62 +#+end_example + +The fix is to disable power gating via =/etc/modprobe.d/amdgpu.conf=: +#+begin_example +options amdgpu pg_mask=0 +#+end_example + +*CRITICAL*: After creating this file, must run =mkinitcpio -P= to include it in initramfs (the modconf hook reads /etc/modprobe.d/ at build time). + +* The Chicken-and-Egg Problem + +1. Need to run =mkinitcpio -P= to apply the GPU fix (include amdgpu.conf in initramfs) +2. But running =mkinitcpio -P= triggers the GPU freeze +3. The fix can't be applied because applying it causes the problem it's meant to fix + +* Possible Solutions to Investigate + +** Option 1: Apply GPU fix at runtime before mkinitcpio + +Before running mkinitcpio, manually set pg_mask at runtime: +#+begin_src bash +echo 0 | sudo tee /sys/module/amdgpu/parameters/pg_mask +#+end_src + +Then run mkinitcpio while power gating is disabled. This might prevent the freeze. + +** Option 2: Build initramfs from live ISO + +Boot from archzfs live ISO (which doesn't have the GPU issue), mount the system, and rebuild initramfs from there. The live ISO uses a different GPU driver state. + +We tried this and it worked - the rebuild completed. But then running mkinitcpio on the booted system froze. + +** Option 3: Add amdgpu.conf before rebuilding from live ISO + +When rebuilding from live ISO: +1. Create /etc/modprobe.d/amdgpu.conf with pg_mask=0 +2. Rebuild initramfs +3. Boot - now the GPU fix should be in effect +4. Future mkinitcpio runs might not freeze + +This might work because the initramfs would load with power gating disabled from the start. + +** Option 4: Wait for kernel 6.18+ + +The upstream fix (VPE_IDLE_TIMEOUT increased from 1s to 2s) is in kernel 6.15+. When linux-lts reaches 6.18, the workaround won't be needed. + +Current: linux-lts 6.12.66 +Target: linux-lts 6.18 + +* Current State of ratio + +- Booted to archzfs live ISO +- ZFS pool: zroot (mirror of nvme0n1p2 + nvme1n1p2) +- mkinitcpio.conf: FIXED (has correct HOOKS with zfs) +- /etc/mkinitcpio.conf.d/archiso.conf: REMOVED +- /etc/mkinitcpio.d/linux-lts.preset: FIXED +- /etc/modprobe.d/amdgpu.conf: EXISTS but may not be in initramfs +- Current pg_mask value on booted system: Unknown (need to check after boot) + +* Verification Commands + +Check if GPU fix is active: +#+begin_src bash +cat /sys/module/amdgpu/parameters/pg_mask +# Should return: 0 +# If returns 4294967295 (0xFFFFFFFF), fix is NOT active +#+end_src + +Check if amdgpu.conf is in initramfs: +#+begin_src bash +lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu +#+end_src + +* Recovery Procedure (Option 3 - recommended) + +From archzfs live ISO: + +#+begin_src bash +# Import and mount ZFS +zpool import -f zroot +zfs mount zroot/ROOT/default +mount /dev/nvme0n1p1 /boot + +# Ensure GPU fix file exists +cat > /etc/modprobe.d/amdgpu.conf << 'EOF' +# Disable power gating to prevent VPE freeze on Strix Halo GPUs +# Remove this file when linux-lts reaches 6.18+ +options amdgpu pg_mask=0 +EOF + +# Mount system directories for chroot +mount --rbind /dev /dev +mount --rbind /sys /sys +mount --rbind /proc /proc +mount --rbind /run /run + +# Rebuild initramfs (should include amdgpu.conf via modconf hook) +chroot / mkinitcpio -P + +# Verify amdgpu.conf is in initramfs +lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu + +# Reboot and test +reboot +#+end_src + +After reboot, verify pg_mask=0 is active, then test =mkinitcpio -P= again. + +* Related Files + +- [[file:2026-01-22-mkinitcpio-config-boot-failure.org]] - The config fix that was applied +- archsetup NOTES.org - AMD GPU freeze diagnosis details + +* Machine Details + +- Machine: ratio (desktop) +- CPU: AMD (Strix Halo) +- GPU: AMD RDNA 3.5 (integrated) +- Storage: Two NVMe in ZFS mirror +- Kernel: linux-lts 6.12.66-1 |
