aboutsummaryrefslogtreecommitdiff
path: root/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org
diff options
context:
space:
mode:
Diffstat (limited to 'assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org')
-rw-r--r--assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org152
1 files changed, 152 insertions, 0 deletions
diff --git a/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org b/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org
new file mode 100644
index 0000000..1132ddd
--- /dev/null
+++ b/assets/2026-01-22-mkinitcpio-freeze-during-rebuild.org
@@ -0,0 +1,152 @@
+#+TITLE: System freezes during mkinitcpio -P rebuild
+#+DATE: 2026-01-22
+
+* Problem Summary
+
+After fixing the mkinitcpio configuration issues (see 2026-01-22-mkinitcpio-config-boot-failure.org), the system successfully booted. However, running =mkinitcpio -P= again caused the system to freeze, requiring a power cycle.
+
+This indicates the mkinitcpio config fix was correct, but there's a separate issue causing freezes during initramfs rebuilds.
+
+* Timeline
+
+1. System wouldn't boot due to broken mkinitcpio config (wrong HOOKS, missing zfs)
+2. Booted from archzfs live ISO
+3. Fixed mkinitcpio.conf, preset file, removed archiso.conf drop-in
+4. Rebuilt initramfs via chroot - completed successfully
+5. Rebooted - system booted successfully
+6. Ran =mkinitcpio -P= again - system froze
+7. Had to power cycle, now back on live ISO
+
+* What This Tells Us
+
+The mkinitcpio configuration fix was correct (system booted). But something about running mkinitcpio itself is triggering a system freeze.
+
+* Suspected Cause: AMD GPU Power Gating Bug
+
+ratio has an AMD Strix Halo GPU (RDNA 3.5) with a known VPE power gating bug. When the VPE (Video Processing Engine) tries to power gate after 1 second of idle, the SMU hangs and the system freezes.
+
+Symptoms before freeze:
+#+begin_example
+amdgpu: SMU: I'm not done with your previous command
+amdgpu: Failed to power gate VPE!
+[drm:vpe_set_powergating_state] *ERROR* Dpm disable vpe failed, ret = -62
+#+end_example
+
+The fix is to disable power gating via =/etc/modprobe.d/amdgpu.conf=:
+#+begin_example
+options amdgpu pg_mask=0
+#+end_example
+
+*CRITICAL*: After creating this file, must run =mkinitcpio -P= to include it in initramfs (the modconf hook reads /etc/modprobe.d/ at build time).
+
+* The Chicken-and-Egg Problem
+
+1. Need to run =mkinitcpio -P= to apply the GPU fix (include amdgpu.conf in initramfs)
+2. But running =mkinitcpio -P= triggers the GPU freeze
+3. The fix can't be applied because applying it causes the problem it's meant to fix
+
+* Possible Solutions to Investigate
+
+** Option 1: Apply GPU fix at runtime before mkinitcpio
+
+Before running mkinitcpio, manually set pg_mask at runtime:
+#+begin_src bash
+echo 0 | sudo tee /sys/module/amdgpu/parameters/pg_mask
+#+end_src
+
+Then run mkinitcpio while power gating is disabled. This might prevent the freeze.
+
+** Option 2: Build initramfs from live ISO
+
+Boot from archzfs live ISO (which doesn't have the GPU issue), mount the system, and rebuild initramfs from there. The live ISO uses a different GPU driver state.
+
+We tried this and it worked - the rebuild completed. But then running mkinitcpio on the booted system froze.
+
+** Option 3: Add amdgpu.conf before rebuilding from live ISO
+
+When rebuilding from live ISO:
+1. Create /etc/modprobe.d/amdgpu.conf with pg_mask=0
+2. Rebuild initramfs
+3. Boot - now the GPU fix should be in effect
+4. Future mkinitcpio runs might not freeze
+
+This might work because the initramfs would load with power gating disabled from the start.
+
+** Option 4: Wait for kernel 6.18+
+
+The upstream fix (VPE_IDLE_TIMEOUT increased from 1s to 2s) is in kernel 6.15+. When linux-lts reaches 6.18, the workaround won't be needed.
+
+Current: linux-lts 6.12.66
+Target: linux-lts 6.18
+
+* Current State of ratio
+
+- Booted to archzfs live ISO
+- ZFS pool: zroot (mirror of nvme0n1p2 + nvme1n1p2)
+- mkinitcpio.conf: FIXED (has correct HOOKS with zfs)
+- /etc/mkinitcpio.conf.d/archiso.conf: REMOVED
+- /etc/mkinitcpio.d/linux-lts.preset: FIXED
+- /etc/modprobe.d/amdgpu.conf: EXISTS but may not be in initramfs
+- Current pg_mask value on booted system: Unknown (need to check after boot)
+
+* Verification Commands
+
+Check if GPU fix is active:
+#+begin_src bash
+cat /sys/module/amdgpu/parameters/pg_mask
+# Should return: 0
+# If returns 4294967295 (0xFFFFFFFF), fix is NOT active
+#+end_src
+
+Check if amdgpu.conf is in initramfs:
+#+begin_src bash
+lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu
+#+end_src
+
+* Recovery Procedure (Option 3 - recommended)
+
+From archzfs live ISO:
+
+#+begin_src bash
+# Import and mount ZFS
+zpool import -f zroot
+zfs mount zroot/ROOT/default
+mount /dev/nvme0n1p1 /boot
+
+# Ensure GPU fix file exists
+cat > /etc/modprobe.d/amdgpu.conf << 'EOF'
+# Disable power gating to prevent VPE freeze on Strix Halo GPUs
+# Remove this file when linux-lts reaches 6.18+
+options amdgpu pg_mask=0
+EOF
+
+# Mount system directories for chroot
+mount --rbind /dev /dev
+mount --rbind /sys /sys
+mount --rbind /proc /proc
+mount --rbind /run /run
+
+# Rebuild initramfs (should include amdgpu.conf via modconf hook)
+chroot / mkinitcpio -P
+
+# Verify amdgpu.conf is in initramfs
+lsinitcpio /boot/initramfs-linux-lts.img | grep amdgpu
+
+# Reboot and test
+reboot
+#+end_src
+
+After reboot, verify pg_mask=0 is active, then test =mkinitcpio -P= again.
+
+* Related Files
+
+- [[file:2026-01-22-mkinitcpio-config-boot-failure.org]] - The config fix that was applied
+- archsetup NOTES.org - AMD GPU freeze diagnosis details
+
+* Machine Details
+
+- Machine: ratio (desktop)
+- CPU: AMD (Strix Halo)
+- GPU: AMD RDNA 3.5 (integrated)
+- Storage: Two NVMe in ZFS mirror
+- Kernel: linux-lts 6.12.66-1