aboutsummaryrefslogtreecommitdiff
path: root/testing-strategy.org
blob: 6ae0e6f5ca1b9437a6a1fc1565e40808f6a2c3e1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
#+TITLE: Testing Strategy
#+AUTHOR: Craig Jennings
#+DATE: 2026-01-25

* Overview

This document describes the testing strategy for the archzfs installer project,
including automated VM testing and the rationale for key technical decisions.

* Running Tests

** Makefile Targets

| Target | Description |
|--------+-------------|
| =make test-install= | Run all 12 automated install tests (builds ISO first) |
| =make test-vm= | Boot ISO in a single-disk VM (interactive) |
| =make test-multi= | Boot ISO in a 2-disk VM for mirror/RAID testing |
| =make test-multi3= | Boot ISO in a 3-disk VM for raidz1 testing |
| =make test-boot= | Boot from installed disk (after running install in VM) |
| =make test-clean= | Remove VM disks and OVMF vars, start fresh |
| =make lint= | Run shellcheck on all scripts |
| =make test= | Run lint (alias) |

** Running a Single Automated Test

#+begin_src bash
./scripts/test-install.sh zfs-encrypt
#+end_src

** Running Multiple Specific Tests

#+begin_src bash
./scripts/test-install.sh zfs-encrypt zfs-mirror-encrypt btrfs-luks
#+end_src

** Listing Available Test Configs

#+begin_src bash
./scripts/test-install.sh --list
#+end_src

* Test Infrastructure

** Test Scripts

- =scripts/test-install.sh= - Main test runner
- =scripts/test-configs/= - Configuration files for different test scenarios

** Test Flow

1. Build ISO with =./build.sh=
2. Boot QEMU VM from ISO
3. Run unattended installation via config file
4. Verify installation (packages, services, filesystem)
5. Reboot from installed disk (no ISO)
6. Verify system survives reboot
7. Test rollback functionality (ZFS and btrfs)

* LUKS Encryption Testing

** The Challenge

LUKS-encrypted systems require TWO passphrase prompts at boot:

1. *GRUB prompt* - GRUB must decrypt /boot to read kernel/initramfs
2. *Initramfs prompt* - encrypt hook must decrypt root to mount filesystem

This blocks automated testing because:
- SSH is unavailable until after both decryptions complete
- Both prompts require interactive passphrase entry

** Options Evaluated

*** Option A: Put /boot on EFI partition for testing

Move /boot to the unencrypted EFI partition when TESTING=yes, so GRUB
doesn't need to decrypt anything.

*Rejected* - Tests different code path than production. Bugs in GRUB
cryptodisk setup would not be caught. "Testing something different than
what ships defeats the purpose."

*** Option B: Accept limitation, enhance installation verification

Skip reboot tests for LUKS. Instead, verify configs before cleanup:
- Check crypttab, grub.cfg, mkinitcpio.conf are correct
- If configs are right, boot should work

*Rejected* - We already found bugs (empty grub.cfg from FAT32 sync) that
only manifested at boot time. Config inspection wouldn't catch everything.

*** Option C: Hybrid approach (Chosen)

Use TWO mechanisms to handle the two prompts:

1. *GRUB prompt* - QEMU monitor sendkey (timing is predictable)
2. *Initramfs prompt* - Keyfile in initramfs (deterministic)

The GRUB countdown provides clear timing signal:
#+begin_example
The highlighted entry will be executed automatically in 0s.
Booting 'Arch Linux'
Enter passphrase for hd0,gpt2:
#+end_example

We know exactly when the GRUB prompt appears. After sendkey handles GRUB,
the keyfile handles initramfs automatically.

** Why Option C

- Tests actual production code path (critical requirement)
- GRUB timing is predictable (countdown visible in serial)
- Keyfile handles the harder timing problem (initramfs)
- Only one sendkey interaction needed (GRUB prompt)

** Implementation

*** GRUB Passphrase (sendkey)

1. Change serial from file-based to real-time (socket or pty)
2. Monitor for "Enter passphrase for" text after GRUB countdown
3. Send passphrase via QEMU monitor: =sendkey= commands
4. Send Enter key to submit

*** Initramfs Passphrase (keyfile)

When =TESTING=yes= is set in config:

1. Generate random 2KB keyfile at =/etc/cryptroot.key=
2. Add keyfile to LUKS slot 1 (passphrase remains in slot 0)
3. Set keyfile permissions to 000
4. Add keyfile to mkinitcpio FILES= array
5. Configure crypttab to use keyfile instead of "none"
6. Initramfs unlocks automatically (no prompt)

** Security Mitigations

- Test-only flag: Only activates when TESTING=yes
- Separate key slot: Keyfile in slot 1, passphrase in slot 0
- Random per-build: Fresh keyfile generated each installation
- Never shipped: Keyfile only in test VMs, not in ISO
- Restricted permissions: chmod 000 on keyfile

** Files Modified

- =custom/lib/btrfs.sh= - setup_luks_testing_keyfile(), configure_crypttab(), configure_luks_initramfs()
- =custom/archangel= - Calls keyfile setup in LUKS flow
- =scripts/test-install.sh= - sendkey for GRUB, real-time serial monitoring
- =scripts/test-configs/btrfs-luks.conf= - TESTING=yes
- =scripts/test-configs/btrfs-mirror-luks.conf= - TESTING=yes

* Adding a New Test

** Step 1: Create a Config File

Add a =.conf= file in =scripts/test-configs/=. First line should be a comment
describing the test:

#+begin_example
# Test config: Description of what this tests
#+end_example

Required fields:
- =HOSTNAME= - unique per test
- =TIMEZONE=, =LOCALE=, =KEYMAP= - use =UTC=, =en_US.UTF-8=, =us= for defaults
- =DISKS= - =/dev/vda= for single, =/dev/vda,/dev/vdb= for 2-disk, etc.
- =ROOT_PASSWORD= - needed for SSH into installed system
- =ENABLE_SSH= - =yes= for full verification, =no= for console-only boot check

Filesystem choice:
- ZFS (default): no =FILESYSTEM= needed, or set =FILESYSTEM=zfs=
- Btrfs: =FILESYSTEM=btrfs=

Encryption:
- No encryption: =NO_ENCRYPT=yes=
- ZFS encryption: =ZFS_PASSPHRASE=testpass= (no =NO_ENCRYPT=)
- LUKS encryption: =LUKS_PASSPHRASE=testpassphrase=, =TESTING=yes= (no =NO_ENCRYPT=)

Multi-disk:
- Mirror: =RAID_LEVEL=mirror=
- RAIDZ1: =RAID_LEVEL=raidz1= (ZFS only, needs 3+ disks)
- Btrfs stripe: =RAID_LEVEL=stripe=

** Step 2: Run the Test

#+begin_example
./scripts/test-install.sh my-new-test
#+end_example

The test runner automatically:
- Counts disks from =DISKS== to create QCOW2 images
- Detects encryption type from =LUKS_PASSPHRASE= / =ZFS_PASSPHRASE=
- Adds QEMU monitor socket when encryption is detected
- Dispatches to =send_luks_passphrase()= or =send_zfs_passphrase()= at reboot
- Runs =verify_install()=, =verify_reboot_survival()=, =verify_rollback()=

** Step 3: If Encryption Needs a New Prompt Handler

The encryption dispatch in =run_test()= uses =encrypt_flag=:

#+begin_example
if [[ "$encrypt_flag" == "luks" ]]; then
    send_luks_passphrase ...
elif [[ "$encrypt_flag" == "zfs" ]]; then
    send_zfs_passphrase ...
fi
#+end_example

To add a new encryption type:
1. Add detection logic that reads the config and sets =encrypt_flag=
2. Write a =send_<type>_passphrase()= function
3. Add an =elif= branch in the dispatch

** Key Functions in test-install.sh

| Function | Purpose |
|----------+---------|
| =monitor_sendkeys()= | Sends a string as QEMU sendkey commands (char-by-char + Enter) |
| =send_luks_passphrase()= | Detects GRUB prompt in serial, sends passphrase per disk |
| =send_zfs_passphrase()= | Timed delay (no serial), sends passphrase twice (ZBM + initramfs) |
| =start_vm_from_disk()= | Boots installed disk; adds monitor socket if encrypt mode is set |
| =verify_install()= | Checks filesystem, snapshots, encryption properties via SSH |
| =verify_reboot_survival()= | Checks pool/filesystem health after reboot |
| =verify_rollback()= | Creates file, snapshots, deletes, rolls back, verifies restore |

** Debugging a Failing Test

Serial log is saved to =test-logs/<name>-reboot-serial.log= on failure.

For VGA-only boot stages (ZFSBootMenu), take a screenshot via QEMU monitor:

#+begin_example
echo "screendump /tmp/screen.ppm" | socat -t 2 - UNIX-CONNECT:vm/monitor-<name>.sock
convert /tmp/screen.ppm /tmp/screen.png
#+end_example

This requires keeping the VM alive (add debugging before =stop_vm= in the failure path).

* Test Configurations

** Btrfs Tests

| Config            | Disks | LUKS | Status                  |
|-------------------+-------+------+-------------------------|
| btrfs-single      |     1 | No   | Pass                    |
| btrfs-luks        |     1 | Yes  | Pass (with TESTING=yes) |
| btrfs-mirror      |     2 | No   | Pass                    |
| btrfs-stripe      |     2 | No   | Pass                    |
| btrfs-mirror-luks |     2 | Yes  | Pass (with TESTING=yes) |

** ZFS Tests

| Config             | Disks | Encryption | Status |
|--------------------+-------+------------+--------|
| single-disk        |     1 | No         | Pass   |
| mirror             |     2 | No         | Pass   |
| raidz1             |     3 | No         | Pass   |
| zfs-encrypt        |     1 | Yes        | Pass   |
| zfs-mirror-encrypt |     2 | Yes        | Pass   |

* ZFS Native Encryption Testing

** The Challenge

ZFS native encryption (=keylocation=prompt=) requires TWO passphrase prompts
at boot, similar to LUKS but from different components:

1. *ZFSBootMenu prompt* - ZFSBootMenu must unlock the pool to enumerate boot environments
2. *Initramfs prompt* - mkinitcpio's =zfs= hook re-imports the pool after kexec

** Key Difference from LUKS

- *LUKS*: GRUB prompts once per encrypted disk, initramfs uses a keyfile (no prompt)
- *ZFS*: One prompt regardless of disk count (pool-level encryption), but TWO prompts
  from different boot stages (ZFSBootMenu + initramfs)

** Key Difference: Serial Console

GRUB outputs to serial console, so its passphrase prompt is detectable in the serial
log. ZFSBootMenu renders entirely to the VGA framebuffer — its passphrase prompt
(and the initramfs prompt) never appear in serial output.

The serial log only shows the UEFI firmware loading ZFSBootMenu:
#+begin_example
BdsDxe: starting Boot0009 "ZFSBootMenu" from HD(...)
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
#+end_example

After that, nothing until the booted system's getty starts.

** Implementation: Timed Sendkey

Since prompt detection via serial is not possible, we use timed delays:

1. Detect UEFI firmware log line (=starting.*ZFSBootMenu=) in serial
2. Wait 15s for ZFSBootMenu to initialize and display passphrase prompt
3. Send passphrase via QEMU monitor sendkey
4. Wait 30s for ZFSBootMenu to boot kernel and mkinitcpio to reach zfs hook
5. Send passphrase again via sendkey
6. Wait for SSH to become available

** Why Not a Keyfile (Like LUKS)?

For LUKS, we embed a keyfile in the initramfs to avoid the second prompt. For ZFS,
this would require changing =keylocation= from =prompt= to =file:///path/to/key= and
embedding the key in the initramfs — which tests a different code path than production.
The timed sendkey approach tests the actual production passphrase flow.

** Files

- =scripts/test-install.sh= - =send_zfs_passphrase()= function
- =scripts/test-configs/zfs-encrypt.conf= - Single disk, TESTING=yes
- =scripts/test-configs/zfs-mirror-encrypt.conf= - Mirror, TESTING=yes

* References

- Arch Wiki: dm-crypt/System configuration
- HashiCorp Discuss: LUKS Encryption Key on Initial Reboot
- GitHub: tylert/packer-build Issue #31 (LUKS unattended builds)