devdocs/elisp/parsing-html_002fxml.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

 <h3 class="section">Parsing HTML and XML</h3>  <p>Emacs can be compiled with built-in libxml2 support. </p> <dl> <dt id="libxml-available-p">Function: <strong>libxml-available-p</strong>
</dt> <dd><p>This function returns non-<code>nil</code> if built-in libxml2 support is available in this Emacs session. </p></dd>
</dl> <p>When libxml2 support is available, the following functions can be used to parse HTML or XML text into Lisp object trees. </p> <dl> <dt id="libxml-parse-html-region">Function: <strong>libxml-parse-html-region</strong> <em>start end &amp;optional base-url discard-comments</em>
</dt> <dd>
<p>This function parses the text between <var>start</var> and <var>end</var> as HTML, and returns a list representing the HTML <em>parse tree</em>. It attempts to handle real-world HTML by robustly coping with syntax mistakes. </p> <p>The optional argument <var>base-url</var>, if non-<code>nil</code>, should be a string specifying the base URL for relative URLs occurring in links. </p> <p>If the optional argument <var>discard-comments</var> is non-<code>nil</code>, any top-level comment is discarded. (This argument is obsolete and will be removed in future Emacs versions. To remove comments, use the <code>xml-remove-comments</code> utility function on the data before you call the parsing function.) </p> <p>In the parse tree, each HTML node is represented by a list in which the first element is a symbol representing the node name, the second element is an alist of node attributes, and the remaining elements are the subnodes. </p> <p>The following example demonstrates this. Given this (malformed) HTML document: </p> <div class="example"> <pre class="example">&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body width=101&gt;&lt;div class=thing&gt;Foo&lt;div&gt;Yes
</pre>
</div> <p>A call to <code>libxml-parse-html-region</code> returns this <acronym>DOM</acronym> (document object model): </p> <div class="example"> <pre class="example">(html nil
 (head nil)
 (body ((width . "101"))
  (div ((class . "thing"))
   "Foo"
   (div nil
    "Yes"))))
</pre>
</div> </dd>
</dl>  <dl> <dt id="shr-insert-document">Function: <strong>shr-insert-document</strong> <em>dom</em>
</dt> <dd><p>This function renders the parsed HTML in <var>dom</var> into the current buffer. The argument <var>dom</var> should be a list as generated by <code>libxml-parse-html-region</code>. This function is, e.g., used by <a href="https://www.gnu.org/software/emacs/manual/html_node/eww/index.html#Top">EWW</a> in <cite>The Emacs Web Wowser Manual</cite>. </p></dd>
</dl>  <dl> <dt id="libxml-parse-xml-region">Function: <strong>libxml-parse-xml-region</strong> <em>start end &amp;optional base-url discard-comments</em>
</dt> <dd><p>This function is the same as <code>libxml-parse-html-region</code>, except that it parses the text as XML rather than HTML (so it is stricter about syntax). </p></dd>
</dl> <table class="menu" border="0" cellspacing="0"> <tr>
<td align="left" valign="top">• <a href="document-object-model" accesskey="1">Document Object Model</a>
</td>
<td> </td>
<td align="left" valign="top">Access, manipulate and search the <acronym>DOM</acronym>. </td>
</tr> </table><div class="_attribution">
  <p class="_attribution-p">
    Copyright &copy; 1990-1996, 1998-2022 Free Software Foundation, Inc. <br>Licensed under the GNU GPL license.<br>
    <a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Parsing-HTML_002fXML.html" class="_attribution-link">https://www.gnu.org/software/emacs/manual/html_node/elisp/Parsing-HTML_002fXML.html</a>
  </p>
</div>