aboutsummaryrefslogtreecommitdiff
path: root/scripts/sync-check.sh
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2026-05-29 18:25:27 -0500
committerCraig Jennings <c@cjennings.net>2026-05-29 18:25:27 -0500
commit7355d21d9a36716afbeeb7d63de58dff302f44c1 (patch)
tree063dee01328b38827a0c926dd9b745c350155ce4 /scripts/sync-check.sh
parent1a795d6147ba15519e8cf934c1b764809f79567f (diff)
downloadrulesets-7355d21d9a36716afbeeb7d63de58dff302f44c1.tar.gz
rulesets-7355d21d9a36716afbeeb7d63de58dff302f44c1.zip
docs(voice): Phase 2 corpus findings: email + PR registers added to voice-profile.org
Phase 2 of the writing voice corpus adds four sub-corpora to the profile: personal email (1139 messages, 283k words), work email (22 messages, small sample), PR descriptions (9 PRs), and PR review comments (3 comments, very small sample). Mu queries on local maildirs and gh API calls on the cjennings github identity. Signatures, quoted replies, and forwarded blocks stripped before analysis. No corpus files written to disk. The headline finding is the register split. Phase 1's commit-prose signal does not generalize cleanly. Em-dashes and semicolons are concentrated in commit prose (3.49 and 3.16 per 1000). Conversational and PR prose run an order of magnitude lower (em-dash 0.28 in personal email, 0.00 in PR review comments). Semicolon shows the same shape (0.64 in personal email, 0.00 in PR comments). The personal-mode rules on those still earn their place, but the basis shifts. The rules mostly enforce what is already true for non-commit registers. Contractions invert the pattern: commits 3.57 per 1000, personal email 38.52, PR review comments 50.78. The Phase 1 curiosity (I'm and I'll surprisingly rare relative to standalone I in commits) is resolved as a register effect. Personal email shows I'm at 6.04 per 1000 vs standalone I at 36.91, near natural English. Craig's voice is heavily-contracted in conversational prose and uniquely suppresses contractions in commit prose. The contraction rule is strongly confirmed in the registers where contractions are most expected. Updates land in: - Top-level Corpus section: a new Phase 2 subsection with the four sub-corpora and a cross-register findings table. - Curiosities (resolved by Phase 2) section: I'm/I'll rarity puzzle answered. - §7 (AI vocabulary) Basis: cross-register watch-word measurements. Comprehensive is concentrated in commits. Leverage shows up modestly in personal email. - §13 (em-dash) Basis: register split documented. - §32 (first-person) Basis: standalone I rates across all five registers. - §33 (semicolon) Basis: register split parallels em-dash. - §34 (contractions) Basis: register inversion documented, Phase 1 curiosity resolved. - §38 (terse cut) Basis: single-sentence-paragraph rate across registers, highest in PR descriptions. AI tells stay near zero across all five corpora. Leverage 18 occurrences in personal email is the only non-zero hit on the watch-list outside commits.
Diffstat (limited to 'scripts/sync-check.sh')
0 files changed, 0 insertions, 0 deletions