chore: sync testing rules — pyramid, overmocking, refactor-for-testability, interactive/internal split

author: Craig Jennings <c@cjennings.net> 2026-04-19 13:16:20 -0500
committer: Craig Jennings <c@cjennings.net> 2026-04-19 13:16:20 -0500
commit: 7eb56084cc543d3455d277ef766302b1ad922b74 (patch)
tree: 488d19aea1c05146fa1735bb897a6490f08e83e6 /.claude
parent: 9c61a64a539bdff55b149dc0eee86366a0f694c7 (diff)
download: chime-7eb56084cc543d3455d277ef766302b1ad922b74.tar.gz
chime-7eb56084cc543d3455d277ef766302b1ad922b74.zip
2 files changed, 130 insertions, 0 deletions
diff --git a/.claude/rules/elisp-testing.md b/.claude/rules/elisp-testing.md
index 3883902..b5def78 100644
--- a/.claude/rules/elisp-testing.md
+++ b/.claude/rules/elisp-testing.md
@@ -43,6 +43,32 @@ Write the failing test first. A failing test proves you understand the change. A
 
 For untested code, write a **characterization test** that captures current behavior before you change anything. It becomes the safety net for the refactor.
 
+## Interactive vs Internal — Split for Testability
+
+When a function mixes business logic with user interaction, split it:
+
+- **Internal** (`cj/--foo`) — pure logic. All parameters explicit. No prompts,
+  no UI. Deterministic and trivially testable.
+- **Interactive wrapper** (`cj/foo`) — thin layer that reads user input and
+  delegates to the internal.
+
+```elisp
+(defun cj/--move-buffer-and-file (dir &optional ok-if-exists)
+  "Move the current buffer's file into DIR. Overwrite if OK-IF-EXISTS."
+  ...)
+
+(defun cj/move-buffer-and-file ()
+  "Interactive wrapper: prompt for DIR, delegate."
+  (interactive)
+  (let ((dir (read-directory-name "Move to: ")))
+    (cj/--move-buffer-and-file dir)))
+```
+
+Test the internal directly with parameter values — no `cl-letf` on
+`read-directory-name`, `yes-or-no-p`, etc. The wrapper gets a smoke test or
+nothing — Emacs already tests its own prompts. The internal also becomes
+reusable by other Elisp code without triggering UI.
+
 ## Mocking
 
 Mock at boundaries:
diff --git a/.claude/rules/testing.md b/.claude/rules/testing.md
index 42cc528..f67ace2 100644
--- a/.claude/rules/testing.md
+++ b/.claude/rules/testing.md
@@ -72,6 +72,48 @@ tests/
 Per-language files may adjust this (e.g. Elisp collates ERT tests into
 `tests/test-<module>*.el` without subdirectories).
 
+### Testing Pyramid
+
+Rough proportions for most projects:
+- Unit tests: 70-80% (fast, isolated, granular)
+- Integration tests: 15-25% (component interactions, real dependencies)
+- E2E tests: 5-10% (full system, slowest)
+
+Don't duplicate coverage: if unit tests fully exercise a function's logic,
+integration tests should focus on *how* components interact — not repeat the
+function's case coverage.
+
+## Integration Tests
+
+Integration tests exercise multiple components together. Two rules:
+
+**The docstring names every component integrated** and marks which are real vs
+mocked. Integration failures are harder to pinpoint than unit failures;
+enumerating the participants up front tells you where to start looking.
+
+Example:
+
+```
+def test_integration_refund_during_sync_updates_ledger_atomically():
+    """Refund processed mid-sync updates order and ledger in one transaction.
+
+    Components integrated:
+    - OrderService.refund (entry point)
+    - PaymentGateway.reverse (MOCKED — returns success)
+    - Ledger.credit (real)
+    - db.transaction (real)
+
+    Validates:
+    - Refund rolls back if ledger write fails
+    - Both tables updated or neither
+    """
+```
+
+**Write an integration test when** multiple components must work together,
+state crosses function boundaries, or edge cases combine. **Don't** when
+single-function behavior suffices, or when mocking would erase the interaction
+you meant to test.
+
 ## Naming Convention
 
 - Unit: `test_<module>_<function>_<scenario>_<expected>`
@@ -115,6 +157,65 @@ Never mock:
 - Internal domain logic
 - Framework behavior (ORM queries, middleware, hooks, buffer primitives)
 
+### Signs of Overmocking
+
+Ask yourself:
+
+- Would this test still pass if I replaced the function body with `raise NotImplementedError` (or equivalent)? If yes, the mocks are doing the work — you're testing mocks, not code.
+- Is the mock more complex than the function being tested? Smell.
+- Am I mocking internal string / parsing / decoding helpers? Those aren't boundaries — they're the work.
+- Does the test break when I refactor without changing behavior? Good tests survive refactors; overmocked ones couple to implementation.
+
+When tests demand heavy internal mocking, the fix isn't better mocks — it's
+restructuring the code (see *If Tests Are Hard to Write* below).
+
+### Testing Code That Uses Frameworks
+
+When a function mostly delegates to framework or library code, test *your*
+integration logic:
+- ✓ "I call the library with the right arguments in the right context"
+- ✓ "I handle its return value correctly"
+- ✗ "The library works in 50 scenarios" — trust it; it has its own tests
+
+For polyglot behavior (e.g., comment handling across C/Java/Go/JS), test 2-3
+representative modes thoroughly plus a minimal smoke test in the others.
+Exhaustive permutations are diminishing returns.
+
+### Test Real Code, Not Copies
+
+Never inline or copy production code into test files. Always `require`/`import`
+the module under test. Copied code passes even when production breaks — the
+bug hides behind the duplicate.
+
+Mock dependencies at their boundary; exercise the real function body.
+
+### Error Behavior, Not Error Text
+
+Test that errors occur with the right type; don't assert exact wording:
+- ✓ Right exception type (`pytest.raises(ValueError)`, `(should-error ... :type 'user-error)`)
+- ✓ Regex on values the message *must* contain (e.g., the offending filename)
+- ✗ `assert str(e) == "File 'foo' not found"` — breaks when prose changes even though behavior is unchanged
+
+Production code should emit clear, contextual errors. Tests verify the
+behavior (raised, caught, returned nil) and values that must appear — not the
+prose.
+
+## If Tests Are Hard to Write, Refactor the Code
+
+If a test needs extensive mocking of internal helpers, elaborate fixture
+scaffolding, or mocks that recreate the function's own logic, the production
+code needs restructuring — not the test.
+
+Signals:
+- Deep nesting (callbacks inside callbacks)
+- Long functions doing multiple things ("fetch AND parse AND decode AND save")
+- Tests that mock internal string / parsing / I/O helpers
+- Tests that break on refactors with no behavior change
+
+Fix: extract focused helpers (one responsibility each), test each in isolation
+with real inputs, compose them in a thin outer function. Several small unit
+tests plus one composition test beats one monster test behind a wall of mocks.
+
 ## Coverage Targets
 
 - Business logic and domain services: **90%+**
@@ -147,6 +248,9 @@ If you catch yourself thinking any of these, stop and write the test.
 - Hardcoded dates or timestamps (they rot)
 - Testing implementation details instead of behavior
 - Mocking the thing you're testing
+- Mocking internal helpers (string ops, parsing, decoding) — those are the work
+- Inlining production code into test files — always `require` / `import` the real module
+- Asserting exact error-message text instead of type + key values
 - Shared mutable state between tests
 - Non-deterministic tests (random without seed, network in unit tests)
 - Testing framework behavior instead of your code
author	Craig Jennings <c@cjennings.net>	2026-04-19 13:16:20 -0500
committer	Craig Jennings <c@cjennings.net>	2026-04-19 13:16:20 -0500
commit	7eb56084cc543d3455d277ef766302b1ad922b74 (patch)
tree	488d19aea1c05146fa1735bb897a6490f08e83e6 /.claude
parent	9c61a64a539bdff55b149dc0eee86366a0f694c7 (diff)
download	chime-7eb56084cc543d3455d277ef766302b1ad922b74.tar.gz chime-7eb56084cc543d3455d277ef766302b1ad922b74.zip