docs(voice): land Phase 1 voice profile derived from git-commit corpus

Phase 1 of the writing voice profile TODO (filed 7a861ed). The work covers corpus assembly, statistics, and a cross-check against the 41 SKILL.md patterns. Email, PR, Slack, and long-form sources deferred to Phase 2. Corpus: 5355 commits, 1895 with non-trivial bodies, 128608 words across 33 repos. Strong findings: - Pattern 17 (no emojis), Pattern 7 (AI vocabulary), Pattern 22 (filler), Pattern 32 (first-person), Pattern 34 (contractions), and Pattern 38 (terse cut) are all confirmed by direct corpus measurement. - Pattern 13 (em-dash zero-tolerance) and Pattern 33 (semicolons to period) contradict the corpus. Craig USES em-dashes at 3.49 per 1000 words and semicolons at 3.16 per 1000 words, rates comparable to AI-generated prose. The rules are self-discipline, not habit-reflection. SKILL.md should say so honestly. - Pattern 7 watch-word "comprehensive" appears 42 times in the corpus while every other watch-word clocks zero or one. "comprehensive" is genuine Craig vocabulary. The rule should pull it from the watch-list or flag only when it co-occurs with other AI tells. New patterns the corpus suggests adding: single-sentence-paragraph cadence (41.1% of paragraphs are exactly one sentence), parenthetical density (23 opening parens per 1000), declarative-default register (0.33 question marks per 1000). Six concrete SKILL.md edits proposed in the doc, none applied. The deltas await Craig's call. Phase 2 sources are documented in the doc body.
author: Craig Jennings <c@cjennings.net> 2026-05-29 14:25:21 -0500
committer: Craig Jennings <c@cjennings.net> 2026-05-29 14:25:21 -0500
commit: 0870a61a28b89305ba0f0be887eb6c563c9ba3e6 (patch)
tree: 9d68dad1056974a13ea0aa00bc9c445fa1165a2b
parent: 7a861eda6dc785ea9767886e13cab1166b3f5d22 (diff)
download: rulesets-0870a61a28b89305ba0f0be887eb6c563c9ba3e6.tar.gz
rulesets-0870a61a28b89305ba0f0be887eb6c563c9ba3e6.zip
1 files changed, 89 insertions, 0 deletions
diff --git a/voice/references/voice-profile.org b/voice/references/voice-profile.org
new file mode 100644
index 0000000..ec074b9
--- /dev/null
+++ b/voice/references/voice-profile.org
@@ -0,0 +1,89 @@
+#+TITLE: Voice profile Phase 1 — corpus-grounded delta proposal
+#+DATE: 2026-05-29
+#+SOURCE: rulesets session 2026-05-29
+
+* Corpus
+
+Git commit bodies authored by Craig Jennings across all repos under =~/code/= and =~/projects/=. After cleanup (subject lines, trailers, URL-only lines, AI-attribution lines, blank-run collapse):
+
+- 5355 raw commits, 1895 with non-trivial bodies
+- 128608 words, 912400 characters
+- 33 repos contributing; top sources: archsetup (703), rulesets (621), work (565), archangel (455), home (395)
+
+PRs deferred. Email + Slack deferred. This is one register (deliberate technical prose) — useful but narrow.
+
+* Findings against the 41 SKILL.md patterns
+
+** Strongly confirmed by the corpus
+
+*Pattern 17 (no emojis).* Zero emojis in corpus. Confirmed.
+
+*Pattern 7 (AI vocabulary).* "delve" 0. "embark" 0. "navigate the" 0. "in the realm of" 0. "seamless" 0. "moreover" 0. "furthermore" 0. "in conclusion" 0. "additionally" 1. "robust" 1. "leverage" 1. Rule confirmed for 11 of 12 watch-words. (One exception below.)
+
+*Pattern 22 (filler).* "moreover" / "furthermore" / "additionally" / "in conclusion": all zero or one occurrence. Filler-phrase avoidance confirmed.
+
+*Pattern 32 (first-person rewrite).* Standalone "I" at 3.85 per 1000 words. Craig writes first-person heavily — this is real, not aspirational.
+
+*Pattern 34 (contractions).* 459 contractions total (3.57 per 1000). Top hits: =doesn't= (92), =don't= (59), =isn't= (46), =it's= (43), =can't= (40), =that's= (34). Rule confirmed.
+
+*Pattern 38 (terse cut).* 41.1% of paragraphs are single-sentence. Craig writes terse — paragraph breaks land after one complete thought even when short. Confirmed indirectly via paragraph structure.
+
+** Aspirational (corpus contradicts, but the rule is intentional self-discipline)
+
+*Pattern 13 (em-dash zero-tolerance, personal mode).* Corpus rate: 3.49 em-dashes per 1000 words. Comparable to AI-generated prose. Craig USES em-dashes regularly in commit bodies — the rule overrides his habit, it doesn't reflect it. Suggested rewording: drop the "LLMs use em dashes more than humans" framing; keep the zero-tolerance directive but rationale becomes "Craig's published voice — commit messages going forward, PR bodies, emails — drops em-dashes by choice because it reads cleaner and avoids a common AI tell, regardless of his pre-rule habit." Honest about the source.
+
+*Pattern 33 (semicolons → period/comma).* Corpus rate: 3.16 semicolons per 1000. Craig uses semicolons regularly. Same shape as #13: rule is self-discipline, not habit-reflection. Suggested rewording: acknowledge the rule overrides habit rather than implying it codifies one.
+
+These two rules are still valuable — em-dashes and semicolons both read cleaner when absent from short imperative-leaning prose. But the SKILL.md should say "this is a rule I've decided to follow," not "this is how I already write."
+
+** Worth challenging
+
+*Pattern 7 watch-word "comprehensive".* 42 occurrences in corpus (~0.33 per 1000). All other AI-tell watch-words clock near zero. "comprehensive" appears to be genuine vocabulary for Craig in technical contexts ("comprehensive test coverage", "comprehensive audit"). Suggested change: pull "comprehensive" out of the watch-list, or carve out a "watch in clusters, not solo" note — flag only when "comprehensive" co-occurs with other AI-tell words.
+
+** Worth adding (corpus surfaces traits the rules don't capture)
+
+*Single-sentence paragraph cadence.* 41.1% of paragraphs are exactly one sentence. This is distinctive — most prose-style guides advise multi-sentence paragraphs. Suggested addition (prose + personal): a positive pattern noting "a one-sentence paragraph is a finished thought, not a fragment. Break paragraphs after one complete thought when the next thought shifts angle, even if both are short." Anti-rule against "merge short paragraphs into multi-sentence ones."
+
+*Parenthetical density.* 23.07 opening parens per 1000 words. Heavy parenthetical use — asides, clarifications, scope-narrowing in parens. Currently no rule addresses this either way. Could add a positive pattern: "parentheses for asides are part of the voice. Don't strip them in a 'clean prose' pass."
+
+*Question-mark rarity.* 0.33 per 1000. Craig's prose is declarative — he states things, rarely asks them. Worth noting as a register marker (when /voice personal output has questions, double-check whether they're contextual or AI rhetoric).
+
+** Out of corpus (commits don't test these — Phase 2 needed)
+
+- *Pattern 13 in long-form prose.* Commit bodies are short. Email and PR bodies may show different em-dash rates.
+- *Pattern 14 (boldface).* Org-mode bold uses =*word*=, not detectable by simple grep. Markdown bold rare in commits.
+- *Pattern 16 (title case in headings).* Commits don't carry headings.
+- *Pattern 19 (collaborative artifacts).* Not present in commit bodies.
+- *Pattern 35 (sentence split on conjunctions).* Average sentence is 18.81 words, median 14, with 28% of sentences 21+ words — long-sentence rate is moderate. Need to inspect actual sentences to know if they're conjunction-stitched. Defer.
+- *Pattern 36 (felt-experience cut).* Commit bodies wouldn't carry felt-experience prose. Email + journal corpus needed.
+- *Pattern 37 (sentence fragments).* 9.7% of sentences are 1-5 words. Some are legitimate ("All eight pass."), some may be fragments. Can't tell from word-count alone. Defer to a pass that does syntactic detection.
+- *Pattern 39 (public-artifact scope).* The corpus IS the public artifacts — circular. Defer.
+- *Pattern 40 (praise vs correction asymmetry).* Not detectable in commit bodies. Email or PR-review corpus needed.
+
+** Curiosities
+
+- *=I'm=* (9 occurrences) and *=I'll=* (2 occurrences) are surprisingly rare relative to standalone =I= (495 occurrences). Either Craig prefers "I am" / "I will" in commit prose, or the regex missed contexts. Worth a quick sample check in Phase 2.
+
+* Suggested deltas
+
+Six concrete edits to =voice/SKILL.md=, all of which can land independently:
+
+1. *#13 (Em-Dash).* Drop the "LLMs use em dashes more than humans" framing in the personal-mode section. Restate the zero-tolerance rule as self-discipline ("Craig's published voice drops em-dashes by choice"), not habit-reflection. Cite: corpus rate 3.49/1000, AI-comparable.
+
+2. *#33 (Semicolons).* Same shape. Restate as self-discipline. Cite: corpus rate 3.16/1000.
+
+3. *#7 (AI Vocabulary).* Remove "comprehensive" from the watch-list, OR add a note that "comprehensive" alone is acceptable; flag only when it co-occurs with =delve= / =leverage= / =robust= / =seamless= / =moreover= etc. Cite: 42 occurrences, all other watch-words at 0 or 1.
+
+4. *NEW pattern (prose + personal): "Single-sentence paragraph cadence is a feature."* 41.1% of corpus paragraphs are exactly one sentence. A one-sentence paragraph is a finished thought, not a fragment. The voice pass should not merge short paragraphs into multi-sentence ones.
+
+5. *NEW pattern (prose + personal): "Parentheses for asides are part of the voice."* 23 opening parens per 1000 words. Heavy parenthetical use is distinctive. Don't strip parenthetical asides in a "clean prose" pass.
+
+6. *Register marker (advisory, not a rewrite rule): "Declarative is the default."* 0.33 question marks per 1000. Voice personal output that contains rhetorical questions should be checked — they're often AI rhetoric, not Craig's register.
+
+* What Phase 2 would add
+
+- Email corpus (gmail + cmail, sent-only, long-form) — different register, especially long-form prose flow.
+- PR bodies and review comments — longer prose, deliberate register, includes the praise/correction asymmetry test ground.
+- Slack messages — casual register, contraction rate, sentence-fragment rate.
+- Syntactic detection — distinguish fragments from terse complete sentences for pattern #37.
+- Long-form documents (résumé, proposals if any) — single register but high prose density.
author	Craig Jennings <c@cjennings.net>	2026-05-29 14:25:21 -0500
committer	Craig Jennings <c@cjennings.net>	2026-05-29 14:25:21 -0500
commit	0870a61a28b89305ba0f0be887eb6c563c9ba3e6 (patch)
tree	9d68dad1056974a13ea0aa00bc9c445fa1165a2b
parent	7a861eda6dc785ea9767886e13cab1166b3f5d22 (diff)
download	rulesets-0870a61a28b89305ba0f0be887eb6c563c9ba3e6.tar.gz rulesets-0870a61a28b89305ba0f0be887eb6c563c9ba3e6.zip