diff options
| author | Craig Jennings <c@cjennings.net> | 2026-04-30 07:55:28 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-04-30 07:55:28 -0500 |
| commit | b0d722d1a985326fb38e4e7fea237b9c4a2adcfd (patch) | |
| tree | 47793265082155fe8ddacfc09d5990d7760de15a /docs/decisions/0004-html-strip-strategy.org | |
| parent | 9e90517a98785c450cd13cd940bd1787a4771529 (diff) | |
| download | gloss-b0d722d1a985326fb38e4e7fea237b9c4a2adcfd.tar.gz gloss-b0d722d1a985326fb38e4e7fea237b9c4a2adcfd.zip | |
docs: record four ADRs for gloss design decisions
The four decisions called out in the brainstorm now have their own
files under docs/decisions/, each with Context / Decision /
Consequences / Alternatives Considered.
- 0001 — storage path default: respects org-directory if set, falls
back to user-emacs-directory.
- 0002 — auto-fetch on local miss: silent fall-through, network
failures surface via the regular error rollup. No y/n prompt for
v1.
- 0003 — drill direction: every entry exports as twosided. One card
per entry, both directions over time, no per-entry override.
- 0004 — HTML strip strategy: libxml-parse-html-region. Plain text
only, no italic/bold preservation. Online fetch disabled package-wide
for the session if libxml is missing.
The "Open Questions" section in the design doc is now "Decisions
Recorded" with links into the ADRs.
Diffstat (limited to 'docs/decisions/0004-html-strip-strategy.org')
| -rw-r--r-- | docs/decisions/0004-html-strip-strategy.org | 54 |
1 files changed, 54 insertions, 0 deletions
diff --git a/docs/decisions/0004-html-strip-strategy.org b/docs/decisions/0004-html-strip-strategy.org new file mode 100644 index 0000000..4ec7293 --- /dev/null +++ b/docs/decisions/0004-html-strip-strategy.org @@ -0,0 +1,54 @@ +#+TITLE: ADR-4: HTML strip strategy +#+DATE: 2026-04-30 +#+STATUS: Accepted + +* Context + +Wiktionary's REST API returns definition text with HTML markup — +=<span>= wrappers, =<a>= anchors, transclusion markers, occasional +inline =<i>= and =<b>=. The package needs plain text in the saved +glossary entry. The strip strategy must be robust on real responses +(not toy inputs) and shouldn't add a heavyweight dependency. + +* Decision + +Strip via =libxml-parse-html-region=. Take the parsed tree, recurse +through it collecting text nodes, drop everything else. No +preservation of inline formatting (italic, bold, links). + +If the running Emacs wasn't built with libxml2, online fetching is +disabled package-wide for the session with a one-shot user-error. +Manual =gloss-add= still works without libxml. + +* Consequences + +*Positive.* + +- Robust on edge cases — nested tags, malformed HTML, unusual + attributes. The libxml parser handles all of these. +- libxml2 is standard on Linux and macOS; ships with most Emacs + builds. The "missing libxml" path is real but rare. +- ~30 lines of strip code. Maintainable. + +*Negative.* + +- Loses italic/bold/link formatting from definitions. The saved + entry is plain text only. v1 trades fidelity for simplicity. +- A user on a barebones Emacs build (no libxml2) loses online + fetching entirely. The error message tells them why and what to + do, but it's still a hit. + +* Alternatives Considered + +*Regex strip* — pattern-replace =<[^>]+>= and known HTML entities. +Rejected: misses entities the regex didn't anticipate, breaks on +attributes containing =>=, fights when tags are malformed. Looks +simpler but rots fast. + +*Preserve markdown-style inline formatting* — italic → =/.../=, bold +→ =*...*=. Rejected for v1: scope creep on a personal package. +Defensible v1.1 if requested. + +*=shr-render-region=.* Rejected: shr is a renderer, not a stripper. +It produces text-with-faces meant for display, not text for +storage. Wrong shape for the use case. |
