diff options
| author | Craig Jennings <c@cjennings.net> | 2026-05-29 14:25:21 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-05-29 14:25:21 -0500 |
| commit | 0870a61a28b89305ba0f0be887eb6c563c9ba3e6 (patch) | |
| tree | 9d68dad1056974a13ea0aa00bc9c445fa1165a2b | |
| parent | 7a861eda6dc785ea9767886e13cab1166b3f5d22 (diff) | |
| download | rulesets-0870a61a28b89305ba0f0be887eb6c563c9ba3e6.tar.gz rulesets-0870a61a28b89305ba0f0be887eb6c563c9ba3e6.zip | |
docs(voice): land Phase 1 voice profile derived from git-commit corpus
Phase 1 of the writing voice profile TODO (filed 7a861ed). The work
covers corpus assembly, statistics, and a cross-check against the 41
SKILL.md patterns. Email, PR, Slack, and long-form sources deferred to
Phase 2.
Corpus: 5355 commits, 1895 with non-trivial bodies, 128608 words across
33 repos. Strong findings:
- Pattern 17 (no emojis), Pattern 7 (AI vocabulary), Pattern 22
(filler), Pattern 32 (first-person), Pattern 34 (contractions), and
Pattern 38 (terse cut) are all confirmed by direct corpus
measurement.
- Pattern 13 (em-dash zero-tolerance) and Pattern 33 (semicolons to
period) contradict the corpus. Craig USES em-dashes at 3.49 per 1000
words and semicolons at 3.16 per 1000 words, rates comparable to
AI-generated prose. The rules are self-discipline, not
habit-reflection. SKILL.md should say so honestly.
- Pattern 7 watch-word "comprehensive" appears 42 times in the corpus
while every other watch-word clocks zero or one. "comprehensive" is
genuine Craig vocabulary. The rule should pull it from the watch-list
or flag only when it co-occurs with other AI tells.
New patterns the corpus suggests adding: single-sentence-paragraph
cadence (41.1% of paragraphs are exactly one sentence), parenthetical
density (23 opening parens per 1000), declarative-default register
(0.33 question marks per 1000).
Six concrete SKILL.md edits proposed in the doc, none applied. The
deltas await Craig's call.
Phase 2 sources are documented in the doc body.
| -rw-r--r-- | voice/references/voice-profile.org | 89 |
1 files changed, 89 insertions, 0 deletions
diff --git a/voice/references/voice-profile.org b/voice/references/voice-profile.org new file mode 100644 index 0000000..ec074b9 --- /dev/null +++ b/voice/references/voice-profile.org @@ -0,0 +1,89 @@ +#+TITLE: Voice profile Phase 1 — corpus-grounded delta proposal +#+DATE: 2026-05-29 +#+SOURCE: rulesets session 2026-05-29 + +* Corpus + +Git commit bodies authored by Craig Jennings across all repos under =~/code/= and =~/projects/=. After cleanup (subject lines, trailers, URL-only lines, AI-attribution lines, blank-run collapse): + +- 5355 raw commits, 1895 with non-trivial bodies +- 128608 words, 912400 characters +- 33 repos contributing; top sources: archsetup (703), rulesets (621), work (565), archangel (455), home (395) + +PRs deferred. Email + Slack deferred. This is one register (deliberate technical prose) — useful but narrow. + +* Findings against the 41 SKILL.md patterns + +** Strongly confirmed by the corpus + +*Pattern 17 (no emojis).* Zero emojis in corpus. Confirmed. + +*Pattern 7 (AI vocabulary).* "delve" 0. "embark" 0. "navigate the" 0. "in the realm of" 0. "seamless" 0. "moreover" 0. "furthermore" 0. "in conclusion" 0. "additionally" 1. "robust" 1. "leverage" 1. Rule confirmed for 11 of 12 watch-words. (One exception below.) + +*Pattern 22 (filler).* "moreover" / "furthermore" / "additionally" / "in conclusion": all zero or one occurrence. Filler-phrase avoidance confirmed. + +*Pattern 32 (first-person rewrite).* Standalone "I" at 3.85 per 1000 words. Craig writes first-person heavily — this is real, not aspirational. + +*Pattern 34 (contractions).* 459 contractions total (3.57 per 1000). Top hits: =doesn't= (92), =don't= (59), =isn't= (46), =it's= (43), =can't= (40), =that's= (34). Rule confirmed. + +*Pattern 38 (terse cut).* 41.1% of paragraphs are single-sentence. Craig writes terse — paragraph breaks land after one complete thought even when short. Confirmed indirectly via paragraph structure. + +** Aspirational (corpus contradicts, but the rule is intentional self-discipline) + +*Pattern 13 (em-dash zero-tolerance, personal mode).* Corpus rate: 3.49 em-dashes per 1000 words. Comparable to AI-generated prose. Craig USES em-dashes regularly in commit bodies — the rule overrides his habit, it doesn't reflect it. Suggested rewording: drop the "LLMs use em dashes more than humans" framing; keep the zero-tolerance directive but rationale becomes "Craig's published voice — commit messages going forward, PR bodies, emails — drops em-dashes by choice because it reads cleaner and avoids a common AI tell, regardless of his pre-rule habit." Honest about the source. + +*Pattern 33 (semicolons → period/comma).* Corpus rate: 3.16 semicolons per 1000. Craig uses semicolons regularly. Same shape as #13: rule is self-discipline, not habit-reflection. Suggested rewording: acknowledge the rule overrides habit rather than implying it codifies one. + +These two rules are still valuable — em-dashes and semicolons both read cleaner when absent from short imperative-leaning prose. But the SKILL.md should say "this is a rule I've decided to follow," not "this is how I already write." + +** Worth challenging + +*Pattern 7 watch-word "comprehensive".* 42 occurrences in corpus (~0.33 per 1000). All other AI-tell watch-words clock near zero. "comprehensive" appears to be genuine vocabulary for Craig in technical contexts ("comprehensive test coverage", "comprehensive audit"). Suggested change: pull "comprehensive" out of the watch-list, or carve out a "watch in clusters, not solo" note — flag only when "comprehensive" co-occurs with other AI-tell words. + +** Worth adding (corpus surfaces traits the rules don't capture) + +*Single-sentence paragraph cadence.* 41.1% of paragraphs are exactly one sentence. This is distinctive — most prose-style guides advise multi-sentence paragraphs. Suggested addition (prose + personal): a positive pattern noting "a one-sentence paragraph is a finished thought, not a fragment. Break paragraphs after one complete thought when the next thought shifts angle, even if both are short." Anti-rule against "merge short paragraphs into multi-sentence ones." + +*Parenthetical density.* 23.07 opening parens per 1000 words. Heavy parenthetical use — asides, clarifications, scope-narrowing in parens. Currently no rule addresses this either way. Could add a positive pattern: "parentheses for asides are part of the voice. Don't strip them in a 'clean prose' pass." + +*Question-mark rarity.* 0.33 per 1000. Craig's prose is declarative — he states things, rarely asks them. Worth noting as a register marker (when /voice personal output has questions, double-check whether they're contextual or AI rhetoric). + +** Out of corpus (commits don't test these — Phase 2 needed) + +- *Pattern 13 in long-form prose.* Commit bodies are short. Email and PR bodies may show different em-dash rates. +- *Pattern 14 (boldface).* Org-mode bold uses =*word*=, not detectable by simple grep. Markdown bold rare in commits. +- *Pattern 16 (title case in headings).* Commits don't carry headings. +- *Pattern 19 (collaborative artifacts).* Not present in commit bodies. +- *Pattern 35 (sentence split on conjunctions).* Average sentence is 18.81 words, median 14, with 28% of sentences 21+ words — long-sentence rate is moderate. Need to inspect actual sentences to know if they're conjunction-stitched. Defer. +- *Pattern 36 (felt-experience cut).* Commit bodies wouldn't carry felt-experience prose. Email + journal corpus needed. +- *Pattern 37 (sentence fragments).* 9.7% of sentences are 1-5 words. Some are legitimate ("All eight pass."), some may be fragments. Can't tell from word-count alone. Defer to a pass that does syntactic detection. +- *Pattern 39 (public-artifact scope).* The corpus IS the public artifacts — circular. Defer. +- *Pattern 40 (praise vs correction asymmetry).* Not detectable in commit bodies. Email or PR-review corpus needed. + +** Curiosities + +- *=I'm=* (9 occurrences) and *=I'll=* (2 occurrences) are surprisingly rare relative to standalone =I= (495 occurrences). Either Craig prefers "I am" / "I will" in commit prose, or the regex missed contexts. Worth a quick sample check in Phase 2. + +* Suggested deltas + +Six concrete edits to =voice/SKILL.md=, all of which can land independently: + +1. *#13 (Em-Dash).* Drop the "LLMs use em dashes more than humans" framing in the personal-mode section. Restate the zero-tolerance rule as self-discipline ("Craig's published voice drops em-dashes by choice"), not habit-reflection. Cite: corpus rate 3.49/1000, AI-comparable. + +2. *#33 (Semicolons).* Same shape. Restate as self-discipline. Cite: corpus rate 3.16/1000. + +3. *#7 (AI Vocabulary).* Remove "comprehensive" from the watch-list, OR add a note that "comprehensive" alone is acceptable; flag only when it co-occurs with =delve= / =leverage= / =robust= / =seamless= / =moreover= etc. Cite: 42 occurrences, all other watch-words at 0 or 1. + +4. *NEW pattern (prose + personal): "Single-sentence paragraph cadence is a feature."* 41.1% of corpus paragraphs are exactly one sentence. A one-sentence paragraph is a finished thought, not a fragment. The voice pass should not merge short paragraphs into multi-sentence ones. + +5. *NEW pattern (prose + personal): "Parentheses for asides are part of the voice."* 23 opening parens per 1000 words. Heavy parenthetical use is distinctive. Don't strip parenthetical asides in a "clean prose" pass. + +6. *Register marker (advisory, not a rewrite rule): "Declarative is the default."* 0.33 question marks per 1000. Voice personal output that contains rhetorical questions should be checked — they're often AI rhetoric, not Craig's register. + +* What Phase 2 would add + +- Email corpus (gmail + cmail, sent-only, long-form) — different register, especially long-form prose flow. +- PR bodies and review comments — longer prose, deliberate register, includes the praise/correction asymmetry test ground. +- Slack messages — casual register, contraction rate, sentence-fragment rate. +- Syntactic detection — distinguish fragments from terse complete sentences for pattern #37. +- Long-form documents (résumé, proposals if any) — single register but high prose density. |
