From 663cec6520a72680609c0d803494fb0bde4ce765 Mon Sep 17 00:00:00 2001 From: Craig Jennings Date: Sun, 17 May 2026 14:38:40 -0500 Subject: fix(testing): cleanup traps, arg validation, and two real bugs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two real bugs and a sweep of hygiene across the harness. `make test` passed cleanly on this branch with the same 52/0/5 profile as the 2026-05-11 run, so the wiring is verified end-to-end. Real bugs: - `lib/vm-utils.sh` `snapshot_exists` was running `qemu-img snapshot -l | grep -q "$snapshot_name"`, which matches the name as a substring anywhere in the output — including inside dates or filenames in other fields. Replaced with an awk field extraction on the TAG column plus `grep -Fxq` for a whole-line literal match. - `run-test-baremetal.sh` was setting `VALIDATION_PASSED=true|false` after validation, but `validation.sh` already uses `VALIDATION_PASSED` as a pass counter. The test report then referenced `$VALIDATION_PASSED_COUNT`, which is defined nowhere. Renamed the boolean to `TEST_PASSED` (matching run-test.sh's pattern) and report the actual counter. Cleanup traps and arg validation: - `run-test.sh` now installs a top-level EXIT trap that, on abort, kills QEMU and restores the clean-install snapshot. A `CLEANUP_DONE=1` sentinel keeps the existing normal-path cleanup from double-firing. This is the recurring pain from 2026-05-11 where two failed runs left orphaned QEMU processes and dirty base disks behind. - `create-base-vm.sh` and `debug-vm.sh` got the same kind of trap, plus `debug-vm.sh` now rejects non-`.qcow2` paths up front instead of letting QEMU fail later. - `run-test.sh`, `run-test-baremetal.sh`, and `cleanup-tests.sh` now validate that options with required values actually receive one (`${var:?msg}` for `--script`/`--snapshot`/`--host`/`--password`, numeric check for `--keep`). - `run-test-baremetal.sh` traps the temp git bundle for cleanup if the script aborts before its explicit `rm`. The ZFS rollback loop now uses `while IFS= read -r ds` and quotes `$ds` inside the ssh_cmd so dataset names with whitespace wouldn't break it. Smaller hygiene: - `vm-utils.sh` `check_ovmf` also checks `OVMF_VARS_TEMPLATE`; `start_qemu` validates disk and ISO paths before building the QEMU command; numeric tests quoted. - `cleanup-tests.sh` find expression for temp disks wrapped in `\( ... -o ... \)`, all `while read` loops use `IFS= read -r`, orphaned QEMU cleanup tries SIGTERM with a 2s sleep before SIGKILL. - `create-base-vm.sh` moved the "Copy an archangel-*.iso" info line before its `fatal` instead of after (unreachable), and added the serial-log path to the final summary. - `lib/logging.sh` `stop_timer` no longer produces `$((end - ))` when the named timer was never started. - `lib/network-diagnostics.sh` `read` → `IFS= read -r`. - `setup-testing-env.sh` now installs all missing pacman packages in one transaction instead of one-at-a-time (avoids half-installed state if package N fails). KVM check also verifies the user has read+write on `/dev/kvm` and prints the `gpasswd -a $(id -un) kvm` fix if not. A few items from the review I deliberately skipped: replacing the codebase-wide unquoted `$SSH_OPTS` string with an array (cosmetic, would need to be done everywhere at once), `set -e` adds where the existing fall-through-on-failure is intentional, and a `--force` gate on `create-base-vm.sh` (would break the expected workflow). --- scripts/testing/cleanup-tests.sh | 24 +++++++++++++------ scripts/testing/create-base-vm.sh | 19 +++++++++++---- scripts/testing/debug-vm.sh | 18 +++++++++++++- scripts/testing/lib/logging.sh | 9 +++++-- scripts/testing/lib/network-diagnostics.sh | 2 +- scripts/testing/lib/vm-utils.sh | 24 ++++++++++++++++--- scripts/testing/run-test-baremetal.sh | 38 ++++++++++++++++++------------ scripts/testing/run-test.sh | 34 +++++++++++++++++++++----- scripts/testing/setup-testing-env.sh | 26 +++++++++++++------- 9 files changed, 146 insertions(+), 48 deletions(-) (limited to 'scripts/testing') diff --git a/scripts/testing/cleanup-tests.sh b/scripts/testing/cleanup-tests.sh index fd2f8de..5c0153b 100755 --- a/scripts/testing/cleanup-tests.sh +++ b/scripts/testing/cleanup-tests.sh @@ -20,6 +20,12 @@ FORCE=false while [[ $# -gt 0 ]]; do case $1 in --keep) + case "${2:-}" in + ''|*[!0-9]*) + echo "Error: --keep requires a non-negative integer (got: '${2:-}')" >&2 + exit 1 + ;; + esac KEEP_LAST="$2" shift 2 ;; @@ -68,13 +74,17 @@ QEMU_PIDS=$(pgrep -f "qemu-system.*archsetup-test" 2>/dev/null || true) if [ -n "$QEMU_PIDS" ]; then info "Found orphaned QEMU processes: $QEMU_PIDS" if $FORCE; then - echo "$QEMU_PIDS" | xargs kill -9 2>/dev/null || true + echo "$QEMU_PIDS" | xargs -r kill 2>/dev/null || true + sleep 2 + echo "$QEMU_PIDS" | xargs -r kill -9 2>/dev/null || true success "Orphaned processes killed" else read -p "Kill orphaned QEMU processes? [y/N] " -n 1 -r echo "" if [[ $REPLY =~ ^[Yy]$ ]]; then - echo "$QEMU_PIDS" | xargs kill -9 2>/dev/null || true + echo "$QEMU_PIDS" | xargs -r kill 2>/dev/null || true + sleep 2 + echo "$QEMU_PIDS" | xargs -r kill -9 2>/dev/null || true success "Orphaned processes killed" fi fi @@ -88,7 +98,7 @@ section "Cleaning Up Disk Images" step "Finding temporary disk images" if [ -d "$PROJECT_ROOT/vm-images" ]; then - TEMP_DISKS=$(find "$PROJECT_ROOT/vm-images" -name "debug-overlay-*.qcow2" -o -name "archsetup-test-*.qcow2" 2>/dev/null || true) + TEMP_DISKS=$(find "$PROJECT_ROOT/vm-images" \( -name "debug-overlay-*.qcow2" -o -name "archsetup-test-*.qcow2" \) 2>/dev/null || true) if [ -z "$TEMP_DISKS" ]; then info "No temporary disk images found" @@ -98,7 +108,7 @@ if [ -d "$PROJECT_ROOT/vm-images" ]; then info "Found $DISK_COUNT temporary disk image(s) totaling $DISK_SIZE" if $FORCE; then - echo "$TEMP_DISKS" | while read disk; do + echo "$TEMP_DISKS" | while IFS= read -r disk; do rm -f "$disk" done success "Temporary disk images deleted" @@ -109,7 +119,7 @@ if [ -d "$PROJECT_ROOT/vm-images" ]; then read -p "Delete these disk images? [y/N] " -n 1 -r echo "" if [[ $REPLY =~ ^[Yy]$ ]]; then - echo "$TEMP_DISKS" | while read disk; do + echo "$TEMP_DISKS" | while IFS= read -r disk; do rm -f "$disk" done success "Temporary disk images deleted" @@ -143,7 +153,7 @@ else info "Keeping last $KEEP_LAST, deleting $DELETE_COUNT old result(s)" if $FORCE; then - echo "$TO_DELETE" | while read dir; do + echo "$TO_DELETE" | while IFS= read -r dir; do rm -rf "$dir" done success "Old test results deleted" @@ -155,7 +165,7 @@ else read -p "Delete these test results? [y/N] " -n 1 -r echo "" if [[ $REPLY =~ ^[Yy]$ ]]; then - echo "$TO_DELETE" | while read dir; do + echo "$TO_DELETE" | while IFS= read -r dir; do rm -rf "$dir" done success "Old test results deleted" diff --git a/scripts/testing/create-base-vm.sh b/scripts/testing/create-base-vm.sh index 11c40a7..4ecf4d6 100755 --- a/scripts/testing/create-base-vm.sh +++ b/scripts/testing/create-base-vm.sh @@ -29,6 +29,14 @@ LOGFILE="$PROJECT_ROOT/test-results/create-base-vm-$(date +'%Y%m%d-%H%M%S').log" init_logging "$LOGFILE" init_vm_paths "$VM_IMAGES_DIR" +# Kill QEMU if we exit before reaching the controlled stop_qemu calls below. +cleanup_create_base() { + if declare -f vm_is_running >/dev/null 2>&1 && vm_is_running; then + kill_qemu 2>/dev/null || true + fi +} +trap cleanup_create_base EXIT + section "Creating Base VM for ArchSetup Testing" # ─── Prerequisites ──────────────────────────────────────────────────── @@ -45,8 +53,8 @@ fi # Find archangel ISO in vm-images/ ISO_PATH=$(find "$VM_IMAGES_DIR" -maxdepth 1 -name "archangel-*.iso" -type f 2>/dev/null | sort -V | tail -1) if [ -z "$ISO_PATH" ]; then - fatal "No archangel ISO found in $VM_IMAGES_DIR/" info "Copy an archangel-*.iso file to: $VM_IMAGES_DIR/" + fatal "No archangel ISO found in $VM_IMAGES_DIR/" fi info "Using ISO: $(basename "$ISO_PATH")" @@ -150,10 +158,11 @@ create_snapshot "$DISK_PATH" "$SNAPSHOT_NAME" || fatal "Failed to create snapsho section "Base VM Created Successfully" info "" -info " Disk: $DISK_PATH" -info " Snapshot: $SNAPSHOT_NAME" -info " Config: $(basename "$CONFIG_FILE")" -info " Log: $LOGFILE" +info " Disk: $DISK_PATH" +info " Snapshot: $SNAPSHOT_NAME" +info " Config: $(basename "$CONFIG_FILE")" +info " Log: $LOGFILE" +info " Serial log: $SERIAL_LOG" info "" info "Next step: Run ./scripts/testing/run-test.sh" info "" diff --git a/scripts/testing/debug-vm.sh b/scripts/testing/debug-vm.sh index 5b2b197..32f377c 100755 --- a/scripts/testing/debug-vm.sh +++ b/scripts/testing/debug-vm.sh @@ -25,7 +25,13 @@ if [ $# -eq 0 ]; then elif [ "$1" = "--base" ]; then USE_BASE=true elif [ -f "$1" ]; then - VM_DISK="$1" + case "$1" in + *.qcow2) VM_DISK="$1" ;; + *) + echo "Error: disk image must be qcow2 (got: $1)" >&2 + exit 1 + ;; + esac else echo "Usage: $0 [disk-image.qcow2 | --base]" echo "" @@ -48,6 +54,16 @@ LOGFILE="/tmp/debug-vm-$TIMESTAMP.log" init_logging "$LOGFILE" init_vm_paths "$VM_IMAGES_DIR" +cleanup_debug() { + if declare -f vm_is_running >/dev/null 2>&1 && vm_is_running; then + kill_qemu 2>/dev/null || true + fi + if [ -n "$OVERLAY_DISK" ] && [ -f "$OVERLAY_DISK" ]; then + rm -f "$OVERLAY_DISK" + fi +} +trap cleanup_debug EXIT + section "Launching Debug VM" # Determine which disk to use diff --git a/scripts/testing/lib/logging.sh b/scripts/testing/lib/logging.sh index eda9eb1..ed20707 100755 --- a/scripts/testing/lib/logging.sh +++ b/scripts/testing/lib/logging.sh @@ -135,8 +135,13 @@ start_timer() { stop_timer() { local name="${1:-default}" - local start=${TIMERS[$name]} - local end=$(date +%s) + local start="${TIMERS[$name]:-}" + if [ -z "$start" ]; then + log "TIMER STOP: $name (never started, skipping)" + return 0 + fi + local end + end=$(date +%s) local duration=$((end - start)) local mins=$((duration / 60)) local secs=$((duration % 60)) diff --git a/scripts/testing/lib/network-diagnostics.sh b/scripts/testing/lib/network-diagnostics.sh index d73ffe5..674aeba 100644 --- a/scripts/testing/lib/network-diagnostics.sh +++ b/scripts/testing/lib/network-diagnostics.sh @@ -53,7 +53,7 @@ run_network_diagnostics() { # Show network info info "Network configuration:" - $ssh_base "ip addr show | grep 'inet ' | grep -v '127.0.0.1'" 2>/dev/null | while read line; do + $ssh_base "ip addr show | grep 'inet ' | grep -v '127.0.0.1'" 2>/dev/null | while IFS= read -r line; do info " $line" done diff --git a/scripts/testing/lib/vm-utils.sh b/scripts/testing/lib/vm-utils.sh index 47bd391..a8736a3 100755 --- a/scripts/testing/lib/vm-utils.sh +++ b/scripts/testing/lib/vm-utils.sh @@ -72,6 +72,11 @@ check_ovmf() { info "Install with: sudo pacman -S edk2-ovmf" return 1 fi + if [ ! -f "$OVMF_VARS_TEMPLATE" ]; then + error "OVMF vars template not found: $OVMF_VARS_TEMPLATE" + info "Install with: sudo pacman -S edk2-ovmf" + return 1 + fi return 0 } @@ -132,6 +137,15 @@ start_qemu() { local iso_path="${3:-}" local display="${4:-none}" + if [ -z "$disk" ] || [ ! -f "$disk" ]; then + error "Disk image not found: ${disk:-}" + return 1 + fi + if [ "$mode" = "iso" ] && { [ -z "$iso_path" ] || [ ! -f "$iso_path" ]; }; then + error "ISO not found: ${iso_path:-}" + return 1 + fi + # Stop any existing instance stop_qemu 2>/dev/null || true @@ -223,7 +237,7 @@ stop_qemu() { # Wait for graceful shutdown local elapsed=0 - while [ $elapsed -lt $timeout ]; do + while [ "$elapsed" -lt "$timeout" ]; do if ! vm_is_running; then success "VM stopped gracefully" _cleanup_qemu_files @@ -319,7 +333,11 @@ list_snapshots() { snapshot_exists() { local disk="${1:-$DISK_PATH}" local snapshot_name="${2:-clean-install}" - qemu-img snapshot -l "$disk" 2>/dev/null | grep -q "$snapshot_name" + # Match on the TAG column (field 2) so a name appearing inside a timestamp + # or filename elsewhere in the output can't false-positive the check. + qemu-img snapshot -l "$disk" 2>/dev/null \ + | awk 'NR > 2 { print $2 }' \ + | grep -Fxq "$snapshot_name" } # ─── SSH Operations ─────────────────────────────────────────────────── @@ -331,7 +349,7 @@ wait_for_ssh() { local elapsed=0 progress "Waiting for SSH on localhost:$SSH_PORT..." - while [ $elapsed -lt $timeout ]; do + while [ "$elapsed" -lt "$timeout" ]; do if sshpass -p "$password" ssh $SSH_OPTS -p "$SSH_PORT" root@localhost true 2>/dev/null; then success "SSH is available" return 0 diff --git a/scripts/testing/run-test-baremetal.sh b/scripts/testing/run-test-baremetal.sh index c108e6f..3beaefc 100755 --- a/scripts/testing/run-test-baremetal.sh +++ b/scripts/testing/run-test-baremetal.sh @@ -47,11 +47,11 @@ VALIDATE_ONLY=false while [[ $# -gt 0 ]]; do case $1 in --host) - TARGET_HOST="$2" + TARGET_HOST="${2:?--host requires a value}" shift 2 ;; --password) - ROOT_PASSWORD="$2" + ROOT_PASSWORD="${2:?--password requires a value}" shift 2 ;; --rollback-first) @@ -86,6 +86,12 @@ fi TIMESTAMP=$(date +'%Y%m%d-%H%M%S') TEST_RESULTS_DIR="$PROJECT_ROOT/test-results/baremetal-$TIMESTAMP" ARCHZFS_INBOX="$HOME/code/archzfs/inbox" +BUNDLE_FILE="" + +cleanup_baremetal() { + [ -n "$BUNDLE_FILE" ] && [ -f "$BUNDLE_FILE" ] && rm -f "$BUNDLE_FILE" +} +trap cleanup_baremetal EXIT # Override VM_IP for validation.sh ssh_cmd function VM_IP="$TARGET_HOST" @@ -121,12 +127,13 @@ if $ROLLBACK_FIRST; then DATASETS=$(ssh_cmd "zfs list -H -o name -t snapshot | grep '@genesis$' | sed 's/@genesis$//'") step "Rolling back all datasets to genesis" - for ds in $DATASETS; do + while IFS= read -r ds; do + [ -z "$ds" ] && continue info "Rolling back $ds@genesis" - if ! ssh_cmd "zfs rollback -r $ds@genesis" &>> "$LOGFILE"; then + if ! ssh_cmd "zfs rollback -r \"$ds@genesis\"" &>> "$LOGFILE"; then warn "Failed to rollback $ds@genesis" fi - done + done <<< "$DATASETS" success "Rollback complete" # Need to reconnect after rollback @@ -246,11 +253,11 @@ fi # Generate reports generate_issue_report "$TEST_RESULTS_DIR" "$ARCHZFS_INBOX" -# Set validation result -if [ $VALIDATION_FAILED -eq 0 ]; then - VALIDATION_PASSED=true +# Set validation result (TEST_PASSED is the boolean; VALIDATION_PASSED stays the counter) +if [ "$VALIDATION_FAILED" -eq 0 ]; then + TEST_PASSED=true else - VALIDATION_PASSED=false + TEST_PASSED=false fi # Generate test report @@ -269,10 +276,10 @@ Test Method: Bare Metal ZFS Results: ArchSetup Exit Code: $ARCHSETUP_EXIT_CODE - Validation: $(if $VALIDATION_PASSED; then echo "PASSED"; else echo "FAILED"; fi) + Validation: $(if $TEST_PASSED; then echo "PASSED"; else echo "FAILED"; fi) Validation Summary: - Passed: $VALIDATION_PASSED_COUNT + Passed: $VALIDATION_PASSED Failed: $VALIDATION_FAILED Warnings: $VALIDATION_WARNINGS @@ -290,17 +297,18 @@ if $ROLLBACK_AFTER; then section "Rolling Back to Genesis (cleanup)" DATASETS=$(ssh_cmd "zfs list -H -o name -t snapshot | grep '@genesis$' | sed 's/@genesis$//'") - for ds in $DATASETS; do + while IFS= read -r ds; do + [ -z "$ds" ] && continue info "Rolling back $ds@genesis" - ssh_cmd "zfs rollback -r $ds@genesis" &>> "$LOGFILE" || true - done + ssh_cmd "zfs rollback -r \"$ds@genesis\"" &>> "$LOGFILE" || true + done <<< "$DATASETS" success "Rollback complete" fi # Final summary section "Test Complete" -if [ $ARCHSETUP_EXIT_CODE -eq 0 ] && $VALIDATION_PASSED; then +if [ "$ARCHSETUP_EXIT_CODE" -eq 0 ] && $TEST_PASSED; then success "TEST PASSED" exit 0 else diff --git a/scripts/testing/run-test.sh b/scripts/testing/run-test.sh index c0eaf50..18f4fdf 100755 --- a/scripts/testing/run-test.sh +++ b/scripts/testing/run-test.sh @@ -36,11 +36,11 @@ while [[ $# -gt 0 ]]; do shift ;; --script) - ARCHSETUP_SCRIPT="$2" + ARCHSETUP_SCRIPT="${2:?--script requires a value}" shift 2 ;; --snapshot) - SNAPSHOT_NAME="$2" + SNAPSHOT_NAME="${2:?--snapshot requires a value}" shift 2 ;; *) @@ -53,6 +53,28 @@ while [[ $# -gt 0 ]]; do esac done +# Failure-path cleanup. Normal completion (further down) sets CLEANUP_DONE=1 +# so the trap becomes a no-op. If we abort partway through, the trap is the +# safety net that stops a leaked QEMU and reverts the base disk. +CLEANUP_DONE=0 +BUNDLE_FILE="" +cleanup_run_test() { + [ "$CLEANUP_DONE" = "1" ] && return 0 + CLEANUP_DONE=1 + [ -n "$BUNDLE_FILE" ] && [ -f "$BUNDLE_FILE" ] && rm -f "$BUNDLE_FILE" + if [ "$KEEP_VM" = "true" ]; then + return 0 + fi + if declare -f stop_qemu >/dev/null 2>&1; then + stop_qemu 2>/dev/null || true + fi + if [ -n "${DISK_PATH:-}" ] && [ -n "${SNAPSHOT_NAME:-}" ] \ + && declare -f restore_snapshot >/dev/null 2>&1; then + restore_snapshot "$DISK_PATH" "$SNAPSHOT_NAME" 2>/dev/null || true + fi +} +trap cleanup_run_test EXIT + # Configuration TIMESTAMP=$(date +'%Y%m%d-%H%M%S') VM_IMAGES_DIR="$PROJECT_ROOT/vm-images" @@ -79,8 +101,8 @@ fi # Check disk exists if [ ! -f "$DISK_PATH" ]; then - fatal "Base disk not found: $DISK_PATH" info "Create it first: ./scripts/testing/create-base-vm.sh" + fatal "Base disk not found: $DISK_PATH" fi # Check if snapshot exists @@ -88,11 +110,11 @@ section "Preparing Test Environment" step "Checking for snapshot: $SNAPSHOT_NAME" if ! snapshot_exists "$DISK_PATH" "$SNAPSHOT_NAME"; then - fatal "Snapshot '$SNAPSHOT_NAME' not found on $DISK_PATH" info "Available snapshots:" list_snapshots "$DISK_PATH" info "" info "Create base VM with: ./scripts/testing/create-base-vm.sh" + fatal "Snapshot '$SNAPSHOT_NAME' not found on $DISK_PATH" fi success "Snapshot $SNAPSHOT_NAME exists" @@ -127,7 +149,6 @@ section "Simulating Git Clone" step "Creating shallow git clone on VM" info "This simulates: git clone --depth 1 /home/cjennings/code/archsetup" -# Create a temporary git bundle from current repo BUNDLE_FILE=$(mktemp) git -C "$PROJECT_ROOT" bundle create "$BUNDLE_FILE" HEAD >> "$LOGFILE" 2>&1 @@ -276,7 +297,7 @@ analyze_log_diff "$TEST_RESULTS_DIR" generate_issue_report "$TEST_RESULTS_DIR" "$ARCHZFS_INBOX" # Set validation result based on failure count -if [ $VALIDATION_FAILED -eq 0 ]; then +if [ "$VALIDATION_FAILED" -eq 0 ]; then TEST_PASSED=true else TEST_PASSED=false @@ -350,6 +371,7 @@ else warn "Failed to revert snapshot - VM may be in modified state" fi fi +CLEANUP_DONE=1 # Final summary section "Test Complete" diff --git a/scripts/testing/setup-testing-env.sh b/scripts/testing/setup-testing-env.sh index f0e63aa..fb0628b 100755 --- a/scripts/testing/setup-testing-env.sh +++ b/scripts/testing/setup-testing-env.sh @@ -45,25 +45,35 @@ PACKAGES=( socat ) +to_install=() for pkg in "${PACKAGES[@]}"; do if pacman -Qi "$pkg" &>/dev/null; then info "$pkg is already installed" else - step "Installing $pkg" - if sudo pacman -S --noconfirm "$pkg" >> "$LOGFILE" 2>&1; then - success "$pkg installed" - else - error "Failed to install $pkg" - fatal "Package installation failed" - fi + to_install+=("$pkg") fi done +if [ "${#to_install[@]}" -gt 0 ]; then + step "Installing in one transaction: ${to_install[*]}" + if sudo pacman -S --needed --noconfirm "${to_install[@]}" >> "$LOGFILE" 2>&1; then + success "All required packages installed" + else + fatal "Package installation failed" + fi +fi + # Verify KVM support section "Verifying KVM Support" if [ -e /dev/kvm ]; then - success "KVM is available" + if [ -r /dev/kvm ] && [ -w /dev/kvm ]; then + success "KVM is available and accessible" + else + warn "KVM exists but is not readable/writable by user $(id -un)" + info "Add the user to the kvm group: sudo gpasswd -a $(id -un) kvm" + info "Then log out and back in so the new group takes effect" + fi else error "KVM is not available" info "Check if virtualization is enabled in BIOS" -- cgit v1.2.3