docs(todo): file [#C] :spec: TODO to build Craig's writing voice profile from real corpora

Files a TODO under Rulesets Open Work to mine Craig's actual writing (sent email across all three accounts, commit messages, PR bodies, org files he authored, slack threads, long-form artifacts) into a grounded voice profile. The voice/SKILL.md patterns today are observation-derived. Some are spot-on. Others are intuition. A corpus pass would tell us which patterns are genuinely Craig's voice, which were guesses, and which Craig-specific positive traits the current ruleset misses entirely. Output: voice/references/voice-profile.org with findings cited to evidence samples, plus a reconciliation pass against voice/SKILL.md to confirm, strengthen, weaken, add, or remove patterns based on what the corpus shows. Approach phased into corpus assembly, analysis (subagent-friendly), draft profile, reconcile-with-user. The body includes a privacy note: raw corpus stays out of commits if the project's remote ever stops being private. There's no urgency. The work is useful but optional, hence [#C].
author: Craig Jennings <c@cjennings.net> 2026-05-29 12:03:03 -0500
committer: Craig Jennings <c@cjennings.net> 2026-05-29 12:03:03 -0500
commit: 7a861eda6dc785ea9767886e13cab1166b3f5d22 (patch)
tree: 57471a9820636dd9460e01de83b8286aff381d76
parent: 2f9f8eb52405c42b64a9af14a7f3c789ea25f4ce (diff)
download: rulesets-7a861eda6dc785ea9767886e13cab1166b3f5d22.tar.gz
rulesets-7a861eda6dc785ea9767886e13cab1166b3f5d22.zip
1 files changed, 52 insertions, 0 deletions
diff --git a/todo.org b/todo.org
index d9442a8..926fd15 100644
--- a/todo.org
+++ b/todo.org
@@ -1221,6 +1221,58 @@ If GV registration is still pending when this task runs, block here and surface
 
 =page-signal= is the fast path (a hook, a script, a make recipe can call it without an MCP round-trip). The MCP server is the smart path. When Claude wants to send and then *react to the reply*, the CLI can't do that — only the MCP server can. The two complement each other; this task adds the second half.
 
+** TODO [#C] Build Craig's writing voice profile from real corpora :spec:
+:PROPERTIES:
+:CREATED: [2026-05-29 Fri]
+:LAST_REVIEWED: 2026-05-29
+:END:
+
+Build a grounded profile of Craig's actual writing voice by mining the corpora he's produced over time. The =voice/SKILL.md= patterns today are observation-derived (em-dash zero-tolerance, semicolon → period, contractions kept, sentence-fragment rewrite, felt-experience cut, etc.). Some are spot-on; others are intuition. A real corpus pass would tell us which patterns are genuinely Craig's voice and which were guesses, plus surface idioms, sentence structures, and vocabulary the current ruleset misses.
+
+*** Sources to mine
+
+- *Email* — sent folders across all three accounts (=gmail=, =dmail/DeepSat=, =cmail/Proton=). Filter to Craig-authored (not forwards or replies-just-quoting). Separate work voice (=dmail=) from personal voice (=gmail=, =cmail=) since they're likely distinct registers.
+- *Commit messages* — =git log --author= across his repos. Captures terse-imperative voice.
+- *PR descriptions and review comments* — same corpora. More deliberate prose than commits.
+- *Org files he authored* — =notes.org=, todo bodies he typed, design docs in =docs/design/=, journal entries. Heavier on first-person voice than emails.
+- *Slack/messages* — DeepSat work slack, family group, friends. Casual register.
+- *Long-form artifacts* — résumé, proposals, white papers, blog posts (if any).
+
+Skip session-context files, which are Claude-co-written and would muddy the signal.
+
+*** Output
+
+- =voice/references/voice-profile.org= (or =.md=) — the canonical reference doc:
+  - Vocabulary tendencies (preferred verbs, avoided cliché classes, technical-vs-plain word choice).
+  - Sentence structures (typical length, conjunction patterns, parenthetical use).
+  - Punctuation patterns (em-dash actual frequency, semicolon vs period split, contraction rate).
+  - Register markers (signs of formal vs casual mode, work vs personal).
+  - Idioms and recurring phrasings.
+  - "Anti-patterns" — phrasings Craig consistently avoids that show up in AI-generated prose.
+- Updated =voice/SKILL.md= patterns grounded in evidence rather than intuition. Patterns that the corpus confirms get strengthened; patterns the corpus contradicts get rewritten or removed.
+
+Each finding should cite at least two evidence samples from the corpora so the basis for a rule is reviewable.
+
+*** Approach
+
+Phase 1 (corpus assembly) — pull the relevant slices: sent-mail dumps, =git log --author --no-merges --pretty=format:'%B'=, =gh pr list --author= bodies, org-file extracts. Strip headers, replies-quoted blocks, signatures. Land in =voice/corpus/= (gitignored if the project's =.ai/= is gitignored, tracked if private repo with private remote).
+
+Phase 2 (analysis) — pass over the corpus with focused queries: distribution of em-dashes per 1000 words, semicolon count, contraction frequency by register, sentence-length histogram, top-N adjectives/adverbs, etc. Subagent dispatch fits here.
+
+Phase 3 (draft profile) — write =voice-profile.org= with findings + evidence. Surface contradictions with the current ruleset.
+
+Phase 4 (reconcile with voice/SKILL.md) — present the deltas to Craig. Each delta is one of: confirm existing rule with evidence, strengthen rule, weaken rule, add new pattern, remove unsupported pattern. Apply approved deltas.
+
+*** Privacy
+
+Email and Slack content is private. The corpus must NOT enter any commit unless rulesets stays on the private cjennings.net remote (which it does today). If a future move to a public remote is on the table, the corpus and any direct quotes have to go before that happens. The profile doc itself can stay (it's analysis, not raw content), but cite by pattern not by verbatim quote.
+
+*** Why this matters
+
+The voice skill earns its place when Craig sees the rewrite and recognizes it as his own voice rather than a "clean" AI voice that approximates him. Today the skill catches common AI tells (em-dashes, semicolons, the felt-experience tic), which is useful. Corpus-grounding would make it catch the absence of *Craig-specific positive traits* — the phrasings he actually reaches for — not just the AI traits he doesn't.
+
+Likely improves =/voice personal= output quality on PR bodies, commit messages, and email drafts. Compound interest over the long run.
+
 ** TODO [#C] Enumerate implementation tasks in =spec-review.org= Phase 6 :feature:solo:
 :PROPERTIES:
 :CREATED: [2026-05-28 Thu]
author	Craig Jennings <c@cjennings.net>	2026-05-29 12:03:03 -0500
committer	Craig Jennings <c@cjennings.net>	2026-05-29 12:03:03 -0500
commit	7a861eda6dc785ea9767886e13cab1166b3f5d22 (patch)
tree	57471a9820636dd9460e01de83b8286aff381d76
parent	2f9f8eb52405c42b64a9af14a7f3c789ea25f4ce (diff)
download	rulesets-7a861eda6dc785ea9767886e13cab1166b3f5d22.tar.gz rulesets-7a861eda6dc785ea9767886e13cab1166b3f5d22.zip