docs(voice): Phase 2 corpus findings: email + PR registers added to voice-profile.org

Phase 2 of the writing voice corpus adds four sub-corpora to the profile: personal email (1139 messages, 283k words), work email (22 messages, small sample), PR descriptions (9 PRs), and PR review comments (3 comments, very small sample). Mu queries on local maildirs and gh API calls on the cjennings github identity. Signatures, quoted replies, and forwarded blocks stripped before analysis. No corpus files written to disk. The headline finding is the register split. Phase 1's commit-prose signal does not generalize cleanly. Em-dashes and semicolons are concentrated in commit prose (3.49 and 3.16 per 1000). Conversational and PR prose run an order of magnitude lower (em-dash 0.28 in personal email, 0.00 in PR review comments). Semicolon shows the same shape (0.64 in personal email, 0.00 in PR comments). The personal-mode rules on those still earn their place, but the basis shifts. The rules mostly enforce what is already true for non-commit registers. Contractions invert the pattern: commits 3.57 per 1000, personal email 38.52, PR review comments 50.78. The Phase 1 curiosity (I'm and I'll surprisingly rare relative to standalone I in commits) is resolved as a register effect. Personal email shows I'm at 6.04 per 1000 vs standalone I at 36.91, near natural English. Craig's voice is heavily-contracted in conversational prose and uniquely suppresses contractions in commit prose. The contraction rule is strongly confirmed in the registers where contractions are most expected. Updates land in: - Top-level Corpus section: a new Phase 2 subsection with the four sub-corpora and a cross-register findings table. - Curiosities (resolved by Phase 2) section: I'm/I'll rarity puzzle answered. - §7 (AI vocabulary) Basis: cross-register watch-word measurements. Comprehensive is concentrated in commits. Leverage shows up modestly in personal email. - §13 (em-dash) Basis: register split documented. - §32 (first-person) Basis: standalone I rates across all five registers. - §33 (semicolon) Basis: register split parallels em-dash. - §34 (contractions) Basis: register inversion documented, Phase 1 curiosity resolved. - §38 (terse cut) Basis: single-sentence-paragraph rate across registers, highest in PR descriptions. AI tells stay near zero across all five corpora. Leverage 18 occurrences in personal email is the only non-zero hit on the watch-list outside commits.
author: Craig Jennings <c@cjennings.net> 2026-05-29 18:25:27 -0500
committer: Craig Jennings <c@cjennings.net> 2026-05-29 18:25:27 -0500
commit: 7355d21d9a36716afbeeb7d63de58dff302f44c1 (patch)
tree: 063dee01328b38827a0c926dd9b745c350155ce4 /voice
parent: 1a795d6147ba15519e8cf934c1b764809f79567f (diff)
download: rulesets-7355d21d9a36716afbeeb7d63de58dff302f44c1.tar.gz
rulesets-7355d21d9a36716afbeeb7d63de58dff302f44c1.zip
1 files changed, 45 insertions, 9 deletions
diff --git a/voice/references/voice-profile.org b/voice/references/voice-profile.org
index d61abaa..51df8cb 100644
--- a/voice/references/voice-profile.org
+++ b/voice/references/voice-profile.org
@@ -14,13 +14,47 @@ When the agent runs =/voice=, it reads SKILL.md for the rules and consults this
 
 * Corpus
 
+** Phase 1 (2026-05-29): git commit bodies
+
 Git commit bodies authored by Craig Jennings across all repos under =~/code/= and =~/projects/=. After cleanup (subject lines, trailers, URL-only lines, AI-attribution lines, blank-run collapse):
 
 - 5355 raw commits, 1895 with non-trivial bodies
 - 128608 words, 912400 characters
 - 33 repos contributing; top sources: archsetup (703), rulesets (621), work (565), archangel (455), home (395)
 
-PRs deferred. Email + Slack deferred. This is one register (deliberate technical prose). The view is useful but narrow.
+One register (deliberate technical prose). The view is useful but narrow on its own.
+
+** Phase 2 (2026-05-29): email + GitHub PR bodies + PR review comments
+
+Four sub-corpora added so the rules can be tested across registers.
+
+- *Personal email* (gmail + cmail, sent-only, body ≥50 words after cleanup): 1139 messages, 283,092 words.
+- *Work email* (dmail, same filter): 22 messages, 3910 words. Small sample.
+- *PR descriptions* (github.com, author cjennings, body ≥100 chars after cleanup): 9 PRs, 1613 words. Small sample.
+- *PR review comments* (github.com, author cjennings, ≥20 words): 3 comments, 256 words. Tiny sample. Public GHE work isn't in this index.
+
+Signatures, quoted replies, and forwarded blocks stripped before analysis. Stats streamed; no corpus files written to disk.
+
+** Cross-register findings (the key result of Phase 2)
+
+The most important Phase 2 result is that *register splits matter*. Phase 1's signal from commit prose does not generalize cleanly to conversational prose.
+
+| Metric (per 1000 words)  | Commits | Personal email | Work email | PR bodies | PR comments |
+|--------------------------+---------+----------------+------------+-----------+-------------|
+| Em-dash                  |    3.49 |           0.28 |       2.05 |      0.62 |        0.00 |
+| Semicolon                |    3.16 |           0.64 |       0.26 |      0.62 |        0.00 |
+| Contractions             |    3.57 |          38.52 |      28.13 |     17.36 |       50.78 |
+| Standalone "I"           |    3.85 |          36.91 |      23.79 |      8.68 |       42.97 |
+| "we"                     |    0.22 |           8.18 |      14.83 |      1.24 |        0.00 |
+| "I'm"                    |    0.07 |           6.04 |       3.58 |      1.24 |        7.81 |
+
+Three observations:
+
+1. Em-dashes and semicolons are concentrated in commit prose, not conversational prose. The personal-mode rules on those (§13 and §33) hold up under Phase 2, but the basis shifts: the rules mostly enforce what is already true for email and PR comments. Commit prose is the outlier register that needs the rule, not the universal pattern.
+2. Contractions invert. Commits suppress contractions; email and PR-review prose use them heavily (38 to 50 per 1000). The Phase 1 contraction rule (§34) is strongly confirmed in the registers where contractions are most expected.
+3. The Phase 1 curiosity (I'm/I'll surprisingly rare relative to standalone "I") was a register effect, not a personal preference. In personal email, "I'm" runs 6.04 per 1000 vs standalone I at 36.91 — ratio close to natural English. Commit prose is the outlier where "I am" beats "I'm".
+
+AI-writing tells stay near zero across all five corpora. "leverage" surfaces 18 times in personal email (0.064 per 1000) — small but the only non-zero hit on the watch-list outside commits. All other watch-words clock 0 to 4 per corpus.
 
 * Findings against the 41 SKILL.md patterns
 
@@ -70,9 +104,9 @@ These two rules are still valuable. Em-dashes and semicolons both read cleaner w
 - *Pattern 39 (public-artifact scope).* The corpus IS the public artifacts. The check is circular. Defer.
 - *Pattern 40 (praise vs correction asymmetry).* Not detectable in commit bodies. Email or PR-review corpus needed.
 
-** Curiosities
+** Curiosities (resolved by Phase 2)
 
-- *=I'm=* (9 occurrences) and *=I'll=* (2 occurrences) are surprisingly rare relative to standalone =I= (495 occurrences). Either Craig prefers "I am" / "I will" in commit prose, or the regex missed contexts. Worth a quick sample check in Phase 2.
+- *=I'm=* (9 occurrences) and *=I'll=* (2 occurrences) were surprisingly rare in Phase 1 relative to standalone =I= (495 occurrences). Phase 2 resolved this. Personal email shows I'm at 1710 occurrences (6.04 per 1000), I'll at 865 (3.06), I've at 458 (1.62), I'd at 384 (1.36). The Phase 1 rarity was a register effect, not a personal preference. Commit prose uniquely suppresses contractions; conversational prose runs them at near-natural English rate.
 
 * Suggested deltas
 
@@ -282,7 +316,9 @@ Flag and rewrite around the high-frequency AI vocabulary list. Watch-list words:
 These words appear far more frequently in post-2023 text. They often co-occur.
 
 *** Basis
-Corpus-measured: 2026-05-29 commit corpus. "comprehensive" 42 occurrences; every other watch-word 0 or 1 (delve 0, embark 0, navigate the 0, in the realm of 0, seamless 0, moreover 0, furthermore 0, in conclusion 0, additionally 1, robust 1, leverage 1). "comprehensive" is genuine Craig vocabulary in technical contexts ("comprehensive test coverage", "comprehensive audit"). Craig has chosen to keep it on the watch-list because he is consciously trying to use it sparingly.
+Corpus-measured across registers (2026-05-29). Phase 1 git commits: "comprehensive" 42 occurrences, every other watch-word 0 or 1. Phase 2 conversational and PR corpora: "comprehensive" 1 in personal email, 0 in work email, PR descriptions, and PR review comments. "leverage" 18 in personal email, 0 to 1 elsewhere. Every other watch-word stays at 0 to 4 across all five corpora.
+
+Two takeaways. First, "comprehensive" is concentrated in commit prose (technical-doc register: "comprehensive test coverage", "comprehensive audit") and almost absent from conversational prose. Craig has chosen to keep it on the watch-list because he is consciously trying to use it sparingly. Second, "leverage" earns a soft watch in personal email even though the rest of the list stays clean. The two together suggest the rule should flag-and-suggest individual hits in technical prose without treating any single watch-word as automatic disqualification.
 
 *** Before
 #+begin_example
@@ -452,7 +488,7 @@ Replace em-dashes (=—=) with a comma, period, colon, or parentheses, whichever
 LLMs use em dashes more than the median human writer, mimicking "punchy" sales writing.
 
 *** Basis
-Corpus measurement (2026-05-29 git commits, 128k words): 3.49 em-dashes per 1000 words. Comparable to AI-generated prose. The zero-tolerance rule in prose and personal modes is self-discipline, not habit-reflection. Craig has decided his published voice drops em-dashes by choice because the result reads cleaner and avoids the most common AI tell, regardless of his pre-rule frequency.
+Phase 1 corpus (git commits, 128k words): 3.49 em-dashes per 1000 words. Comparable to AI-generated prose. Phase 2 corpus reveals a sharp register split: personal email 0.28 per 1000, work email 2.05, PR descriptions 0.62, PR review comments 0.00. Em-dashes are concentrated in commit prose, almost absent from email and PR review prose. The zero-tolerance rule in prose and personal modes mostly enforces what is already true for non-commit registers. The rule still earns its place because commit prose is the high-volume register where the AI-tell em-dash habit shows up. Self-discipline, not habit-reflection, for the commit register specifically.
 
 *** Before
 #+begin_example
@@ -1013,7 +1049,7 @@ Rewrite impersonal third-person publish-artifact bodies into first person ("I ad
 Impersonal third-person ("Add support for X", "The change adds Y") reads as press-release voice in a commit body or PR description. First-person ("I added X", "I kept Y because...") sounds like one engineer talking to another.
 
 *** Basis
-Corpus-measured: 2026-05-29 commit corpus shows standalone "I" at 3.85 per 1000 words. Craig writes first-person heavily in his commit prose. This is real, not aspirational.
+Corpus-measured across registers (2026-05-29): standalone "I" runs 3.85 per 1000 words in git commits, 36.91 in personal email, 23.79 in work email, 8.68 in PR descriptions, 42.97 in PR review comments. First-person density is roughly 10x higher in conversational registers than in commits. Craig writes first-person heavily across the board, but commit prose under-uses "I" relative to natural English. The rule strengthens the under-using register without overreaching: it asks the publish-artifact body to write the way the email body already does.
 
 *** Before
 #+begin_example
@@ -1044,7 +1080,7 @@ Replace semicolons with a period (split into two sentences) or a comma (when the
 Craig's voice avoids semicolons. They make the writing feel unnecessarily literary. The period-split usually reads better, and dropping semicolons removes one common AI tell.
 
 *** Basis
-Corpus-measured: 2026-05-29 commit corpus shows 3.16 semicolons per 1000 words, comparable to AI-generated prose. The rule is self-discipline, not habit-reflection. Craig has decided his published voice drops semicolons because the period-split usually reads better, not because semicolons are inherently an AI tell.
+Corpus-measured across registers (2026-05-29): semicolons run 3.16 per 1000 words in git commits, 0.64 in personal email, 0.26 in work email, 0.62 in PR descriptions, 0.00 in PR review comments. Same register split as em-dashes (§13). Semicolons are concentrated in commit prose; conversational prose almost never uses them. The rule mostly enforces what is already true for non-commit registers. It earns its place because commit prose is the register where Craig's habit and the AI-tell pattern overlap.
 
 *** Before
 #+begin_example
@@ -1076,7 +1112,7 @@ Prefer contractions in Craig's prose (it's, that's, don't, we're, I'd, won't) un
 Uncontracted English reads stiff in a short prose body unless a negation or emphasis needs the weight. Prefer contractions in his prose: emails, documents, commit and PR bodies.
 
 *** Basis
-Corpus-measured: 2026-05-29 commit corpus shows 459 contractions total (3.57 per 1000 words). Top hits: doesn't (92), don't (59), isn't (46), it's (43), can't (40), that's (34). Rule confirmed as established Craig habit.
+Corpus-measured across registers (2026-05-29). Contraction rate per 1000 words: git commits 3.57, personal email 38.52, work email 28.13, PR descriptions 17.36, PR review comments 50.78. Commit prose is the outlier register that suppresses contractions; conversational and PR-review prose use them heavily, near the natural-English rate. The Phase 1 curiosity (I'm 9 occurrences vs standalone I at 495 in commits) was a register effect, not a personal preference. Personal email runs I'm at 6.04 per 1000 vs standalone I at 36.91, ratio close to natural English. Top contractions in personal email: i'm 1710, it's 928, i'll 865, don't 632, you're 567, i've 458, that's 433, i'd 384, we're 307, didn't 299. The rule confirms across the board, with the strongest evidence from the conversational registers where contractions are most expected.
 
 *** Before
 #+begin_example
@@ -1200,7 +1236,7 @@ Strip soft rhetorical padding ("worth noting", "it's important to understand", "
 Tier 1 omit-needless-words (§26) catches the most rigid offenders ("the fact that", "in order to"). Prose and personal modes are more aggressive. They also strip soft padding like "worth noting" and "it's important to understand". Academic writing often retains these as transition markers, so the aggressive cut is prose and personal only because it conflicts with that register.
 
 *** Basis
-Corpus-measured: 2026-05-29 commit corpus shows 41.1% of paragraphs are single-sentence. Craig writes terse. Paragraph breaks land after one complete thought even when short. Confirmed indirectly via paragraph structure.
+Corpus-measured across registers (2026-05-29). Single-sentence-paragraph rate: git commits 41.1%, personal email 57.4%, work email 44.5%, PR descriptions 74.4%, PR review comments 50.0%. The terse-paragraph cadence is even more pronounced in conversational and PR-description prose than in commits. Craig writes terse across registers, with the highest density in deliberate PR descriptions where each paragraph carries one focused thought. Confirmed indirectly via paragraph structure across all five corpora.
 
 *** Before
 #+begin_example
author	Craig Jennings <c@cjennings.net>	2026-05-29 18:25:27 -0500
committer	Craig Jennings <c@cjennings.net>	2026-05-29 18:25:27 -0500
commit	7355d21d9a36716afbeeb7d63de58dff302f44c1 (patch)
tree	063dee01328b38827a0c926dd9b745c350155ce4 /voice
parent	1a795d6147ba15519e8cf934c1b764809f79567f (diff)
download	rulesets-7355d21d9a36716afbeeb7d63de58dff302f44c1.tar.gz rulesets-7355d21d9a36716afbeeb7d63de58dff302f44c1.zip