aboutsummaryrefslogtreecommitdiff
path: root/.claude/rules/testing.md
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2026-04-28 13:56:06 -0500
committerCraig Jennings <c@cjennings.net>2026-04-28 13:56:06 -0500
commit71ccfdd0e6216356ec6cac90bc627fe02dbfdeb1 (patch)
treea75a387d0c2ec95bfc3850e01fb58067b56aa628 /.claude/rules/testing.md
downloadgloss-71ccfdd0e6216356ec6cac90bc627fe02dbfdeb1.tar.gz
gloss-71ccfdd0e6216356ec6cac90bc627fe02dbfdeb1.zip
chore: scaffold gloss package
Five layered files per the design at docs/design/gloss.org. gloss-core for the data layer, gloss-fetch for the network layer, gloss-display for the UI, gloss-drill for the spaced-repetition export, and gloss.el as the entry point. All five are skeletons. Implementation comes next. The Makefile delegates to ert with the usual unit, integration, and per-file targets. It also runs paren and lint passes. The package is licensed GPL-3.0-or-later. README is a placeholder pointing at the design doc.
Diffstat (limited to '.claude/rules/testing.md')
-rw-r--r--.claude/rules/testing.md277
1 files changed, 277 insertions, 0 deletions
diff --git a/.claude/rules/testing.md b/.claude/rules/testing.md
new file mode 100644
index 0000000..b91b76c
--- /dev/null
+++ b/.claude/rules/testing.md
@@ -0,0 +1,277 @@
+# Testing Standards
+
+Applies to: `**/*`
+
+Core TDD discipline and test quality rules. Language-specific patterns
+(frameworks, fixture idioms, mocking tools) live in per-language testing files
+under `languages/<lang>/claude/rules/`.
+
+## Test-Driven Development (Default)
+
+TDD is the default workflow for all code, including demos and prototypes. **Write tests first, before any implementation code.** Tests are how you prove you understand the problem — if you can't write a failing test, you don't yet understand what needs to change.
+
+1. **Red**: Write a failing test that defines the desired behavior
+2. **Green**: Write the minimal code to make the test pass
+3. **Refactor**: Clean up while keeping tests green
+
+Do not skip TDD for demo code. Demos build muscle memory — the habit carries into production.
+
+### Understand Before You Test
+
+Before writing tests, invest time in understanding the code:
+
+1. **Explore the codebase** — Read the module under test, its callers, and its dependencies. Understand the data flow end to end.
+2. **Identify the root cause** — If fixing a bug, trace the problem to its origin. Don't test (or fix) surface symptoms when the real issue is deeper in the call chain.
+3. **Reason through edge cases** — Consider boundary conditions, error states, concurrent access, and interactions with adjacent modules. Your tests should cover what could actually go wrong, not just the obvious happy path.
+
+### Adding Tests to Existing Untested Code
+
+When working in a codebase without tests:
+
+1. Write a **characterization test** that captures current behavior before making changes
+2. Use the characterization test as a safety net while refactoring
+3. Then follow normal TDD for the new change
+
+## Test Categories (Required for All Code)
+
+Every unit under test requires coverage across three categories:
+
+### 1. Normal Cases (Happy Path)
+- Standard inputs and expected use cases
+- Common workflows and default configurations
+- Typical data volumes
+
+### 2. Boundary Cases
+- Minimum/maximum values (0, 1, -1, MAX_INT)
+- Empty vs null vs undefined (language-appropriate)
+- Single-element collections
+- Unicode and internationalization (emoji, RTL text, combining characters)
+- Very long strings, deeply nested structures
+- Timezone boundaries (midnight, DST transitions)
+- Date edge cases (leap years, month boundaries)
+
+### 3. Error Cases
+- Invalid inputs and type mismatches
+- Network failures and timeouts
+- Missing required parameters
+- Permission denied scenarios
+- Resource exhaustion
+- Malformed data
+
+## Combinatorial Coverage
+
+For functions with 3+ parameters that each take multiple values (feature-flag
+combinations, config matrices, permission/role interactions, multi-field
+form validation, API parameter spaces), the exhaustive test count explodes
+(M^N) while 3-5 ad-hoc cases miss pair interactions. Use **pairwise /
+combinatorial testing** — generate a minimal matrix that hits every 2-way
+combination of parameter values. Empirically catches 60-90% of combinatorial
+bugs with 80-99% fewer tests.
+
+Invoke `/pairwise-tests` on the offending function; continue using `/add-tests`
+and the Normal/Boundary/Error discipline for the rest. The two approaches
+complement: pairwise covers parameter *interactions*; category discipline
+covers each parameter's individual edge space.
+
+Skip pairwise when: the function has 1-2 parameters (just write the cases),
+the context requires *provably* exhaustive coverage (regulated systems — document
+in an ADR), or the testing target is non-parametric (single happy path,
+performance regression, a specific error).
+
+## Test Organization
+
+Typical layout:
+
+```
+tests/
+ unit/ # One test file per source file
+ integration/ # Multi-component workflows
+ e2e/ # Full system tests
+```
+
+Per-language files may adjust this (e.g. Elisp collates ERT tests into
+`tests/test-<module>*.el` without subdirectories).
+
+### Testing Pyramid
+
+Rough proportions for most projects:
+- Unit tests: 70-80% (fast, isolated, granular)
+- Integration tests: 15-25% (component interactions, real dependencies)
+- E2E tests: 5-10% (full system, slowest)
+
+Don't duplicate coverage: if unit tests fully exercise a function's logic,
+integration tests should focus on *how* components interact — not repeat the
+function's case coverage.
+
+## Integration Tests
+
+Integration tests exercise multiple components together. Two rules:
+
+**The docstring names every component integrated** and marks which are real vs
+mocked. Integration failures are harder to pinpoint than unit failures;
+enumerating the participants up front tells you where to start looking.
+
+Example:
+
+```
+def test_integration_refund_during_sync_updates_ledger_atomically():
+ """Refund processed mid-sync updates order and ledger in one transaction.
+
+ Components integrated:
+ - OrderService.refund (entry point)
+ - PaymentGateway.reverse (MOCKED — returns success)
+ - Ledger.credit (real)
+ - db.transaction (real)
+
+ Validates:
+ - Refund rolls back if ledger write fails
+ - Both tables updated or neither
+ """
+```
+
+**Write an integration test when** multiple components must work together,
+state crosses function boundaries, or edge cases combine. **Don't** when
+single-function behavior suffices, or when mocking would erase the interaction
+you meant to test.
+
+## Naming Convention
+
+- Unit: `test_<module>_<function>_<scenario>_<expected>`
+- Integration: `test_integration_<workflow>_<scenario>_<outcome>`
+
+Examples:
+- `test_cart_apply_discount_expired_coupon_raises_error`
+- `test_integration_order_sync_network_timeout_retries_three_times`
+
+Languages that prefer camelCase, kebab-case, or other conventions keep the
+structure but use their idiom. Consistency within a project matters more than
+the specific case choice.
+
+## Test Quality
+
+### Independence
+- No shared mutable state between tests
+- Each test runs successfully in isolation
+- Explicit setup and teardown
+
+### Determinism
+- Never hardcode dates or times — generate them relative to `now()`
+- No reliance on test execution order
+- No flaky network calls in unit tests
+
+### Performance
+- Unit tests: <100ms each
+- Integration tests: <1s each
+- E2E tests: <10s each
+- Mark slow tests with appropriate decorators/tags
+
+### Mocking Boundaries
+Mock external dependencies at the system boundary:
+- Network calls (HTTP, gRPC, WebSocket)
+- File I/O and cloud storage
+- Time and dates
+- Third-party service clients
+
+Never mock:
+- The code under test
+- Internal domain logic
+- Framework behavior (ORM queries, middleware, hooks, buffer primitives)
+
+### Signs of Overmocking
+
+Ask yourself:
+
+- Would this test still pass if I replaced the function body with `raise NotImplementedError` (or equivalent)? If yes, the mocks are doing the work — you're testing mocks, not code.
+- Is the mock more complex than the function being tested? Smell.
+- Am I mocking internal string / parsing / decoding helpers? Those aren't boundaries — they're the work.
+- Does the test break when I refactor without changing behavior? Good tests survive refactors; overmocked ones couple to implementation.
+
+When tests demand heavy internal mocking, the fix isn't better mocks — it's
+restructuring the code (see *If Tests Are Hard to Write* below).
+
+### Testing Code That Uses Frameworks
+
+When a function mostly delegates to framework or library code, test *your*
+integration logic:
+- ✓ "I call the library with the right arguments in the right context"
+- ✓ "I handle its return value correctly"
+- ✗ "The library works in 50 scenarios" — trust it; it has its own tests
+
+For polyglot behavior (e.g., comment handling across C/Java/Go/JS), test 2-3
+representative modes thoroughly plus a minimal smoke test in the others.
+Exhaustive permutations are diminishing returns.
+
+### Test Real Code, Not Copies
+
+Never inline or copy production code into test files. Always `require`/`import`
+the module under test. Copied code passes even when production breaks — the
+bug hides behind the duplicate.
+
+Mock dependencies at their boundary; exercise the real function body.
+
+### Error Behavior, Not Error Text
+
+Test that errors occur with the right type; don't assert exact wording:
+- ✓ Right exception type (`pytest.raises(ValueError)`, `(should-error ... :type 'user-error)`)
+- ✓ Regex on values the message *must* contain (e.g., the offending filename)
+- ✗ `assert str(e) == "File 'foo' not found"` — breaks when prose changes even though behavior is unchanged
+
+Production code should emit clear, contextual errors. Tests verify the
+behavior (raised, caught, returned nil) and values that must appear — not the
+prose.
+
+## If Tests Are Hard to Write, Refactor the Code
+
+If a test needs extensive mocking of internal helpers, elaborate fixture
+scaffolding, or mocks that recreate the function's own logic, the production
+code needs restructuring — not the test.
+
+Signals:
+- Deep nesting (callbacks inside callbacks)
+- Long functions doing multiple things ("fetch AND parse AND decode AND save")
+- Tests that mock internal string / parsing / I/O helpers
+- Tests that break on refactors with no behavior change
+
+Fix: extract focused helpers (one responsibility each), test each in isolation
+with real inputs, compose them in a thin outer function. Several small unit
+tests plus one composition test beats one monster test behind a wall of mocks.
+
+## Coverage Targets
+
+- Business logic and domain services: **90%+**
+- API endpoints and views: **80%+**
+- UI components: **70%+**
+- Utilities and helpers: **90%+**
+- Overall project minimum: **80%+**
+
+New code must not decrease coverage. PRs that lower coverage require justification.
+
+## TDD Discipline
+
+TDD is non-negotiable. These are the rationalizations agents use to skip it — don't fall for them:
+
+| Excuse | Why It's Wrong |
+|--------|----------------|
+| "This is too simple to need a test" | Simple code breaks too. The test takes 30 seconds. Write it. |
+| "I'll add tests after the implementation" | You won't, and even if you do, they'll test what you wrote rather than what was needed. Test-after validates implementation, not behavior. |
+| "Let me just get it working first" | That's not TDD. If you can't write a failing test, you don't understand the requirement yet. |
+| "This is just a refactor" | Refactors without tests are guesses. Write a characterization test first, then refactor while it stays green. |
+| "I'm only changing one line" | One-line changes cause production outages. Write a test that covers the line you're changing. |
+| "The existing code has no tests" | Start with a characterization test. Don't make the problem worse. |
+| "This is demo/prototype code" | Demos build habits. Untested demo code becomes untested production code. |
+| "I need to spike first" | Spikes are fine — then throw away the spike, write the test, and implement properly. |
+
+If you catch yourself thinking any of these, stop and write the test.
+
+## Anti-Patterns (Do Not Do)
+
+- Hardcoded dates or timestamps (they rot)
+- Testing implementation details instead of behavior
+- Mocking the thing you're testing
+- Mocking internal helpers (string ops, parsing, decoding) — those are the work
+- Inlining production code into test files — always `require` / `import` the real module
+- Asserting exact error-message text instead of type + key values
+- Shared mutable state between tests
+- Non-deterministic tests (random without seed, network in unit tests)
+- Testing framework behavior instead of your code
+- Ignoring or skipping failing tests without a tracking issue