diff options
| author | Craig Jennings <c@cjennings.net> | 2026-05-29 18:25:27 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-05-29 18:25:27 -0500 |
| commit | 7355d21d9a36716afbeeb7d63de58dff302f44c1 (patch) | |
| tree | 063dee01328b38827a0c926dd9b745c350155ce4 /scripts/tests/remove.bats | |
| parent | 1a795d6147ba15519e8cf934c1b764809f79567f (diff) | |
| download | rulesets-7355d21d9a36716afbeeb7d63de58dff302f44c1.tar.gz rulesets-7355d21d9a36716afbeeb7d63de58dff302f44c1.zip | |
docs(voice): Phase 2 corpus findings: email + PR registers added to voice-profile.org
Phase 2 of the writing voice corpus adds four sub-corpora to the
profile: personal email (1139 messages, 283k words), work email (22
messages, small sample), PR descriptions (9 PRs), and PR review
comments (3 comments, very small sample). Mu queries on local
maildirs and gh API calls on the cjennings github identity.
Signatures, quoted replies, and forwarded blocks stripped before
analysis. No corpus files written to disk.
The headline finding is the register split. Phase 1's commit-prose
signal does not generalize cleanly. Em-dashes and semicolons are
concentrated in commit prose (3.49 and 3.16 per 1000). Conversational
and PR prose run an order of magnitude lower (em-dash 0.28 in
personal email, 0.00 in PR review comments). Semicolon shows the
same shape (0.64 in personal email, 0.00 in PR comments). The
personal-mode rules on those still earn their place, but the basis
shifts. The rules mostly enforce what is already true for non-commit
registers.
Contractions invert the pattern: commits 3.57 per 1000, personal
email 38.52, PR review comments 50.78. The Phase 1 curiosity (I'm
and I'll surprisingly rare relative to standalone I in commits) is
resolved as a register effect. Personal email shows I'm at 6.04 per
1000 vs standalone I at 36.91, near natural English. Craig's voice
is heavily-contracted in conversational prose and uniquely suppresses
contractions in commit prose. The contraction rule is strongly
confirmed in the registers where contractions are most expected.
Updates land in:
- Top-level Corpus section: a new Phase 2 subsection with the four
sub-corpora and a cross-register findings table.
- Curiosities (resolved by Phase 2) section: I'm/I'll rarity puzzle
answered.
- §7 (AI vocabulary) Basis: cross-register watch-word measurements.
Comprehensive is concentrated in commits. Leverage shows up
modestly in personal email.
- §13 (em-dash) Basis: register split documented.
- §32 (first-person) Basis: standalone I rates across all five
registers.
- §33 (semicolon) Basis: register split parallels em-dash.
- §34 (contractions) Basis: register inversion documented, Phase 1
curiosity resolved.
- §38 (terse cut) Basis: single-sentence-paragraph rate across
registers, highest in PR descriptions.
AI tells stay near zero across all five corpora. Leverage 18
occurrences in personal email is the only non-zero hit on the
watch-list outside commits.
Diffstat (limited to 'scripts/tests/remove.bats')
0 files changed, 0 insertions, 0 deletions
