diff options
| author | Craig Jennings <c@cjennings.net> | 2026-06-21 03:19:08 -0400 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-06-21 03:19:08 -0400 |
| commit | 6d7a73e616b3111ad5bd46eeb56fdb579e7799bd (patch) | |
| tree | d733568e51521efa916ab9682aafd09e29394364 /docs/native-comp-subr-mocking.org | |
| parent | 0aa85dd219f4be8dbf3383661fd2b42370945b87 (diff) | |
| download | dotemacs-6d7a73e616b3111ad5bd46eeb56fdb579e7799bd.tar.gz dotemacs-6d7a73e616b3111ad5bd46eeb56fdb579e7799bd.zip | |
docs: explain native-comp vs primitive-mocking, refine the insight
A reference for the native-comp + subr-mocking trap: the mechanism, the three failure modes, the research with URLs, and the decision (variadic mocks + a meta-test now, migrate off primitive-mocking long-term). Refines the CLAUDE.md codified insight, whose old 'don't mock subrs' framing was too broad, and points it at the new doc.
Diffstat (limited to 'docs/native-comp-subr-mocking.org')
| -rw-r--r-- | docs/native-comp-subr-mocking.org | 159 |
1 files changed, 159 insertions, 0 deletions
diff --git a/docs/native-comp-subr-mocking.org b/docs/native-comp-subr-mocking.org new file mode 100644 index 000000000..f66e5d102 --- /dev/null +++ b/docs/native-comp-subr-mocking.org @@ -0,0 +1,159 @@ +#+TITLE: Native Compilation vs. Mocking C Primitives in Tests +#+AUTHOR: Craig Jennings +#+DATE: 2026-06-21 + +* What this is + +A reference for a real, recurring trap: tests that redefine an Emacs C +primitive (a "subr") with =cl-letf=, =fset=, =setf=, or =advice-add= behave +differently once native compilation is enabled, and the failures are +intermittent. We hit it head-on after re-enabling native-comp config-wide +(early-init.el, commit 3fd28987, 2026-06-20). This document records the +mechanism, the research, and the decision so we don't re-derive it. + +* The symptom + +After native-comp was re-enabled, tests that had been green for months started +failing, with no change to their source. The errors looked like: + +: wrong-number-of-arguments #[nil (nil) (t)] 1 + +That is a zero-argument mock lambda being called with one argument. The 8 tests +that first tripped were in =test-dirvish-config-wrappers.el= and +=test-calibredb-epub-config.el=, all mocking window primitives +(=current-window-configuration=, =window-body-width=, =window-margins=, +=get-buffer-window=). + +The failures were intermittent across the session: the same test passed, then +crashed, then passed again. That non-determinism is the tell. + +* The mechanism + +Native-comp emits *direct* calls to primitives for speed. So when Lisp code +redefines or advises a primitive (which is exactly what a test mock does), +natively-compiled callers would normally bypass the redefinition entirely. To +prevent that, Emacs generates a small per-primitive *trampoline* (a =.eln= +under =eln-cache/=) the first time a primitive is redefined. The trampoline +reroutes calls to the primitive through its Lisp function cell, where the mock +lives. + +The trampoline is generated lazily and cached on disk, and that is the source +of the non-determinism: whether a given mock "works" depends on whether the +trampoline for that primitive has been compiled into the eln-cache yet. As +native-comp compiles more in the background, more mocks start routing through +trampolines. + +** Three distinct failure modes + +Because behavior depends on trampoline state, the same mock can fail three +different ways: + +1. *Generation failure.* The trampoline =.eln= can't be built or loaded + (notably under =emacs --batch=), giving + =native-lisp-load-failed "... subr--trampoline-*.eln"=. This is the mode our + older CLAUDE.md insight first documented. +2. *Silent bypass.* When a trampoline isn't available and can't be generated, + the manual states natively-compiled callers *ignore* the redefinition and + call the real primitive. The mock does nothing, so the test passes for the + wrong reason or asserts against real behavior. +3. *Arity mismatch.* The trampoline *is* built and routes to the mock, but + calls it with the primitive's *maximum* arity (filling optionals with nil), + not the arity the source used. A fixed-arity mock narrower than the + primitive then throws =wrong-number-of-arguments=. This is the mode that bit + us this session (every one of the 8 was this). + +* Important: this is a test-only artifact + +Production code never redefines a C primitive, so these trampolines are never +generated for this reason in normal use. Nothing here is a defect in the +config. It is an incompatibility between *mocking primitives in tests* and +native-comp, confined to the test suite. + +* What the wider community has found + +This is well known and genuinely hard. It is not us doing something wrong. + +- [[https://lists.gnu.org/archive/html/bug-gnu-emacs/2021-10/msg00971.html][bug#51140 (emacs-devel)]] — "cl-letf appears not to work with native-comp." + Redefining a built-in like =process-exit-status= via =cl-letf= breaks under + native compilation. Confirms the core problem. +- [[https://github.com/jorgenschaefer/emacs-buttercup/issues/230][buttercup issue #230]] — the buttercup test framework's =spy-on= on primitives + (=file-exists-p=, =buffer-file-name=) fails with the + =native-lisp-load-failed ... subr--trampoline-*.eln= error (failure mode 1). + Our scenario exactly, in a mainstream test framework. +- [[https://groups.google.com/g/linux.debian.bugs.dist/c/n9P2xhpruDE][Debian bug#1021842]] — buttercup's *own self-tests* hit the trampoline + compilation error. Even the test framework's maintainers run into it. +- [[https://lists.gnu.org/archive/html/bug-gnu-emacs/2023-03/msg00076.html][bug#61880 (emacs-devel)]] — native compilation fails to generate trampolines + in certain sequential cases (failure mode 1, deterministic variant). +- [[https://lists.gnu.org/archive/html/emacs-diffs/2023-03/msg00145.html][emacs-29 commit (bug-fix)]] — Emacs added a warning when you redefine a + primitive that the trampoline machinery itself depends on + ("Redefining '%s' might break trampoline native compilation"). Shows the + maintainers' stance: redefining primitives is discouraged. +- [[https://www.gnu.org/software/emacs/manual/html_node/elisp/Native_002dCompilation-Variables.html][ELisp Manual: Native-Compilation Variables]] — documents + =native-comp-enable-subr-trampolines=. Default on; generates trampolines on + the fly. When *off* and no cached trampoline exists, "calls to that primitive + from natively-compiled Lisp will ignore redefinitions and advices" (this is + failure mode 2, and the catch in the common workaround below). + +** The two commonly-cited workarounds, and their costs + +- *Disable subr trampolines for tests* (=native-comp-enable-subr-trampolines + nil=). The most-cited quick fix. One line. But per the manual it makes + natively-compiled callers *ignore* the mock (failure mode 2). It only works + reliably when the code under test runs interpreted, not natively compiled. + With native-comp aggressively compiling our modules, the code under test is + increasingly native, so this risks silent mock-bypass: tests that pass while + asserting against the real primitive. Worse than a loud failure. +- *Don't mock primitives at all.* The maintainers' and our own + =elisp-testing.md='s position: inject dependencies or test pure helpers + instead. The only fix immune to all three failure modes. Also the most work. + +* Our decision (2026-06-21) + +We chose a pragmatic middle path with a clear long-term direction. + +1. *Make subr mocks variadic.* The arity mode (3) is the only one we have + actually suffered. A mock written =(lambda (&rest _) VALUE)= tolerates the + trampoline's full-arity call. We swept every arity-narrow subr mock in the + suite to append =&rest _= to its arglist (preserving any named args the + body uses). This is deterministic and keeps trampolines on, so mocks still + route correctly (no silent bypass). +2. *Enforce it with a meta-test.* =tests/test-meta-subr-mock-arity.el= statically + scans every test file for =symbol-function= / =fset= redefinitions of a + subr and fails =make test= if any mock can't accept the primitive's maximum + arity (=func-arity=). It is deterministic (a pure source read; no dependence + on eln-cache state), so a new arity-narrow mock can't merge silently. The + rule it enforces is NOT "never mock a subr" (the suite mocks subrs like + =message= and =completing-read= hundreds of times, all fine) but "a subr + mock must accept the primitive's arity." +3. *Treat "migrate off primitive-mocking" as a long-term test-quality project.* + The variadic sweep fixes the mode we hit but leaves modes 1 and 2 latent + (we haven't hit them, but they exist). The durable fix the ecosystem points + to is restructuring tests to not redefine primitives at all. Filed as a + standalone TODO rather than forced now. + +** Why not just disable trampolines for tests? + +Because of failure mode 2 (silent bypass) above. In our native-comp-heavy +setup, disabling trampolines would let natively-compiled code under test ignore +the mocks, producing tests that pass while testing nothing. A loud +=wrong-number-of-arguments= that the meta-test prevents up front is strictly +safer than a quiet false pass. + +* Practical rule for writing tests (today) + +When you mock a C primitive (subr) in a test, make the replacement variadic: + +: (cl-letf (((symbol-function 'window-body-width) (lambda (&rest _) 200))) +: ...) + +not + +: (cl-letf (((symbol-function 'window-body-width) (lambda (_) 200))) ; breaks under native-comp +: ...) + +If the body needs the argument, keep it and append =&rest _=: + +: (lambda (cmd &rest _) (member cmd allowed)) + +The meta-test will catch you if you forget. Better still, when practical, don't +mock the primitive: pass the value in as a parameter, or test a pure helper. |
