summaryrefslogtreecommitdiff
path: root/devdocs/elisp/text-representations.html
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2024-04-07 13:41:34 -0500
committerCraig Jennings <c@cjennings.net>2024-04-07 13:41:34 -0500
commit754bbf7a25a8dda49b5d08ef0d0443bbf5af0e36 (patch)
treef1190704f78f04a2b0b4c977d20fe96a828377f1 /devdocs/elisp/text-representations.html
new repository
Diffstat (limited to 'devdocs/elisp/text-representations.html')
-rw-r--r--devdocs/elisp/text-representations.html27
1 files changed, 27 insertions, 0 deletions
diff --git a/devdocs/elisp/text-representations.html b/devdocs/elisp/text-representations.html
new file mode 100644
index 00000000..ef4cc620
--- /dev/null
+++ b/devdocs/elisp/text-representations.html
@@ -0,0 +1,27 @@
+ <h3 class="section">Text Representations</h3> <p>Emacs buffers and strings support a large repertoire of characters from many different scripts, allowing users to type and display text in almost any known written language. </p> <p>To support this multitude of characters and scripts, Emacs closely follows the <em>Unicode Standard</em>. The Unicode Standard assigns a unique number, called a <em>codepoint</em>, to each and every character. The range of codepoints defined by Unicode, or the Unicode <em>codespace</em>, is <code>0..#x10FFFF</code> (in hexadecimal notation), inclusive. Emacs extends this range with codepoints in the range <code>#x110000..#x3FFFFF</code>, which it uses for representing characters that are not unified with Unicode and <em>raw 8-bit bytes</em> that cannot be interpreted as characters. Thus, a character codepoint in Emacs is a 22-bit integer. </p> <p>To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint<a id="DOCF20" href="#FOOT20"><sup>20</sup></a>. For example, any <acronym>ASCII</acronym> character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text <em>multibyte</em>. </p> <p>Outside Emacs, characters can be represented in many different encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts between these external encodings and its internal representation, as appropriate, when it reads text into a buffer or a string, or when it writes text to a disk file or passes it to some other process. </p> <p>Occasionally, Emacs needs to hold and manipulate encoded text or binary non-text data in its buffers or strings. For example, when Emacs visits a file, it first reads the file’s text verbatim into a buffer, and only then converts it to the internal representation. Before the conversion, the buffer holds encoded text. </p> <p>Encoded text is not really text, as far as Emacs is concerned, but rather a sequence of raw 8-bit bytes. We call buffers and strings that hold encoded text <em>unibyte</em> buffers and strings, because Emacs treats them as a sequence of individual bytes. Usually, Emacs displays unibyte buffers and strings as octal codes such as <code>\237</code>. We recommend that you never use unibyte buffers and strings except for manipulating encoded text or binary non-text data. </p> <p>In a buffer, the buffer-local value of the variable <code>enable-multibyte-characters</code> specifies the representation used. The representation for a string is determined and recorded in the string when the string is constructed. </p> <dl> <dt id="enable-multibyte-characters">Variable: <strong>enable-multibyte-characters</strong>
+</dt> <dd>
+<p>This variable specifies the current buffer’s text representation. If it is non-<code>nil</code>, the buffer contains multibyte text; otherwise, it contains unibyte encoded text or binary non-text data. </p> <p>You cannot set this variable directly; instead, use the function <code>set-buffer-multibyte</code> to change a buffer’s representation. </p>
+</dd>
+</dl> <dl> <dt id="position-bytes">Function: <strong>position-bytes</strong> <em>position</em>
+</dt> <dd><p>Buffer positions are measured in character units. This function returns the byte-position corresponding to buffer position <var>position</var> in the current buffer. This is 1 at the start of the buffer, and counts upward in bytes. If <var>position</var> is out of range, the value is <code>nil</code>. </p></dd>
+</dl> <dl> <dt id="byte-to-position">Function: <strong>byte-to-position</strong> <em>byte-position</em>
+</dt> <dd><p>Return the buffer position, in character units, corresponding to given <var>byte-position</var> in the current buffer. If <var>byte-position</var> is out of range, the value is <code>nil</code>. In a multibyte buffer, an arbitrary value of <var>byte-position</var> can be not at character boundary, but inside a multibyte sequence representing a single character; in this case, this function returns the buffer position of the character whose multibyte sequence includes <var>byte-position</var>. In other words, the value does not change for all byte positions that belong to the same character. </p></dd>
+</dl> <p>The following two functions are useful when a Lisp program needs to map buffer positions to byte offsets in a file visited by the buffer. </p> <dl> <dt id="bufferpos-to-filepos">Function: <strong>bufferpos-to-filepos</strong> <em>position &amp;optional quality coding-system</em>
+</dt> <dd>
+<p>This function is similar to <code>position-bytes</code>, but instead of byte position in the current buffer it returns the offset from the beginning of the current buffer’s file of the byte that corresponds to the given character <var>position</var> in the buffer. The conversion requires to know how the text is encoded in the buffer’s file; this is what the <var>coding-system</var> argument is for, defaulting to the value of <code>buffer-file-coding-system</code>. The optional argument <var>quality</var> specifies how accurate the result should be; it should be one of the following: </p> <dl compact> <dt><code>exact</code></dt> <dd><p>The result must be accurate. The function may need to encode and decode a large part of the buffer, which is expensive and can be slow. </p></dd> <dt><code>approximate</code></dt> <dd><p>The value can be an approximation. The function may avoid expensive processing and return an inexact result. </p></dd> <dt><code>nil</code></dt> <dd><p>If the exact result needs expensive processing, the function will return <code>nil</code> rather than an approximation. This is the default if the argument is omitted. </p></dd> </dl> </dd>
+</dl> <dl> <dt id="filepos-to-bufferpos">Function: <strong>filepos-to-bufferpos</strong> <em>byte &amp;optional quality coding-system</em>
+</dt> <dd><p>This function returns the buffer position corresponding to a file position specified by <var>byte</var>, a zero-base byte offset from the file’s beginning. The function performs the conversion opposite to what <code>bufferpos-to-filepos</code> does. Optional arguments <var>quality</var> and <var>coding-system</var> have the same meaning and values as for <code>bufferpos-to-filepos</code>. </p></dd>
+</dl> <dl> <dt id="multibyte-string-p">Function: <strong>multibyte-string-p</strong> <em>string</em>
+</dt> <dd><p>Return <code>t</code> if <var>string</var> is a multibyte string, <code>nil</code> otherwise. This function also returns <code>nil</code> if <var>string</var> is some object other than a string. </p></dd>
+</dl> <dl> <dt id="string-bytes">Function: <strong>string-bytes</strong> <em>string</em>
+</dt> <dd>
+ <p>This function returns the number of bytes in <var>string</var>. If <var>string</var> is a multibyte string, this can be greater than <code>(length <var>string</var>)</code>. </p>
+</dd>
+</dl> <dl> <dt id="unibyte-string">Function: <strong>unibyte-string</strong> <em>&amp;rest bytes</em>
+</dt> <dd><p>This function concatenates all its argument <var>bytes</var> and makes the result a unibyte string. </p></dd>
+</dl><div class="_attribution">
+ <p class="_attribution-p">
+ Copyright &copy; 1990-1996, 1998-2022 Free Software Foundation, Inc. <br>Licensed under the GNU GPL license.<br>
+ <a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html" class="_attribution-link">https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html</a>
+ </p>
+</div>