aboutsummaryrefslogtreecommitdiff
path: root/docs/decisions/0004-html-strip-strategy.org
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2026-04-30 07:55:28 -0500
committerCraig Jennings <c@cjennings.net>2026-04-30 07:55:28 -0500
commitb0d722d1a985326fb38e4e7fea237b9c4a2adcfd (patch)
tree47793265082155fe8ddacfc09d5990d7760de15a /docs/decisions/0004-html-strip-strategy.org
parent9e90517a98785c450cd13cd940bd1787a4771529 (diff)
downloadgloss-b0d722d1a985326fb38e4e7fea237b9c4a2adcfd.tar.gz
gloss-b0d722d1a985326fb38e4e7fea237b9c4a2adcfd.zip
docs: record four ADRs for gloss design decisions
The four decisions called out in the brainstorm now have their own files under docs/decisions/, each with Context / Decision / Consequences / Alternatives Considered. - 0001 — storage path default: respects org-directory if set, falls back to user-emacs-directory. - 0002 — auto-fetch on local miss: silent fall-through, network failures surface via the regular error rollup. No y/n prompt for v1. - 0003 — drill direction: every entry exports as twosided. One card per entry, both directions over time, no per-entry override. - 0004 — HTML strip strategy: libxml-parse-html-region. Plain text only, no italic/bold preservation. Online fetch disabled package-wide for the session if libxml is missing. The "Open Questions" section in the design doc is now "Decisions Recorded" with links into the ADRs.
Diffstat (limited to 'docs/decisions/0004-html-strip-strategy.org')
-rw-r--r--docs/decisions/0004-html-strip-strategy.org54
1 files changed, 54 insertions, 0 deletions
diff --git a/docs/decisions/0004-html-strip-strategy.org b/docs/decisions/0004-html-strip-strategy.org
new file mode 100644
index 0000000..4ec7293
--- /dev/null
+++ b/docs/decisions/0004-html-strip-strategy.org
@@ -0,0 +1,54 @@
+#+TITLE: ADR-4: HTML strip strategy
+#+DATE: 2026-04-30
+#+STATUS: Accepted
+
+* Context
+
+Wiktionary's REST API returns definition text with HTML markup —
+=<span>= wrappers, =<a>= anchors, transclusion markers, occasional
+inline =<i>= and =<b>=. The package needs plain text in the saved
+glossary entry. The strip strategy must be robust on real responses
+(not toy inputs) and shouldn't add a heavyweight dependency.
+
+* Decision
+
+Strip via =libxml-parse-html-region=. Take the parsed tree, recurse
+through it collecting text nodes, drop everything else. No
+preservation of inline formatting (italic, bold, links).
+
+If the running Emacs wasn't built with libxml2, online fetching is
+disabled package-wide for the session with a one-shot user-error.
+Manual =gloss-add= still works without libxml.
+
+* Consequences
+
+*Positive.*
+
+- Robust on edge cases — nested tags, malformed HTML, unusual
+ attributes. The libxml parser handles all of these.
+- libxml2 is standard on Linux and macOS; ships with most Emacs
+ builds. The "missing libxml" path is real but rare.
+- ~30 lines of strip code. Maintainable.
+
+*Negative.*
+
+- Loses italic/bold/link formatting from definitions. The saved
+ entry is plain text only. v1 trades fidelity for simplicity.
+- A user on a barebones Emacs build (no libxml2) loses online
+ fetching entirely. The error message tells them why and what to
+ do, but it's still a hit.
+
+* Alternatives Considered
+
+*Regex strip* — pattern-replace =<[^>]+>= and known HTML entities.
+Rejected: misses entities the regex didn't anticipate, breaks on
+attributes containing =>=, fights when tags are malformed. Looks
+simpler but rots fast.
+
+*Preserve markdown-style inline formatting* — italic → =/.../=, bold
+→ =*...*=. Rejected for v1: scope creep on a personal package.
+Defensible v1.1 if requested.
+
+*=shr-render-region=.* Rejected: shr is a renderer, not a stripper.
+It produces text-with-faces meant for display, not text for
+storage. Wrong shape for the use case.