From b0d722d1a985326fb38e4e7fea237b9c4a2adcfd Mon Sep 17 00:00:00 2001 From: Craig Jennings Date: Thu, 30 Apr 2026 07:55:28 -0500 Subject: docs: record four ADRs for gloss design decisions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The four decisions called out in the brainstorm now have their own files under docs/decisions/, each with Context / Decision / Consequences / Alternatives Considered. - 0001 — storage path default: respects org-directory if set, falls back to user-emacs-directory. - 0002 — auto-fetch on local miss: silent fall-through, network failures surface via the regular error rollup. No y/n prompt for v1. - 0003 — drill direction: every entry exports as twosided. One card per entry, both directions over time, no per-entry override. - 0004 — HTML strip strategy: libxml-parse-html-region. Plain text only, no italic/bold preservation. Online fetch disabled package-wide for the session if libxml is missing. The "Open Questions" section in the design doc is now "Decisions Recorded" with links into the ADRs. --- docs/decisions/0004-html-strip-strategy.org | 54 +++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) create mode 100644 docs/decisions/0004-html-strip-strategy.org (limited to 'docs/decisions/0004-html-strip-strategy.org') diff --git a/docs/decisions/0004-html-strip-strategy.org b/docs/decisions/0004-html-strip-strategy.org new file mode 100644 index 0000000..4ec7293 --- /dev/null +++ b/docs/decisions/0004-html-strip-strategy.org @@ -0,0 +1,54 @@ +#+TITLE: ADR-4: HTML strip strategy +#+DATE: 2026-04-30 +#+STATUS: Accepted + +* Context + +Wiktionary's REST API returns definition text with HTML markup — +== wrappers, == anchors, transclusion markers, occasional +inline == and ==. The package needs plain text in the saved +glossary entry. The strip strategy must be robust on real responses +(not toy inputs) and shouldn't add a heavyweight dependency. + +* Decision + +Strip via =libxml-parse-html-region=. Take the parsed tree, recurse +through it collecting text nodes, drop everything else. No +preservation of inline formatting (italic, bold, links). + +If the running Emacs wasn't built with libxml2, online fetching is +disabled package-wide for the session with a one-shot user-error. +Manual =gloss-add= still works without libxml. + +* Consequences + +*Positive.* + +- Robust on edge cases — nested tags, malformed HTML, unusual + attributes. The libxml parser handles all of these. +- libxml2 is standard on Linux and macOS; ships with most Emacs + builds. The "missing libxml" path is real but rare. +- ~30 lines of strip code. Maintainable. + +*Negative.* + +- Loses italic/bold/link formatting from definitions. The saved + entry is plain text only. v1 trades fidelity for simplicity. +- A user on a barebones Emacs build (no libxml2) loses online + fetching entirely. The error message tells them why and what to + do, but it's still a hit. + +* Alternatives Considered + +*Regex strip* — pattern-replace =<[^>]+>= and known HTML entities. +Rejected: misses entities the regex didn't anticipate, breaks on +attributes containing =>=, fights when tags are malformed. Looks +simpler but rots fast. + +*Preserve markdown-style inline formatting* — italic → =/.../=, bold +→ =*...*=. Rejected for v1: scope creep on a personal package. +Defensible v1.1 if requested. + +*=shr-render-region=.* Rejected: shr is a renderer, not a stripper. +It produces text-with-faces meant for display, not text for +storage. Wrong shape for the use case. -- cgit v1.2.3