devdocs/elisp/coding-system-basics.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

 <h4 class="subsection">Basic Concepts of Coding Systems</h4>  <p><em>Character code conversion</em> involves conversion between the internal representation of characters used inside Emacs and some other encoding. Emacs supports many different encodings, in that it can convert to and from them. For example, it can convert text to or from encodings such as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some cases, Emacs supports several alternative encodings for the same characters; for example, there are three coding systems for the Cyrillic (Russian) alphabet: ISO, Alternativnyj, and KOI8. </p>   <p>Every coding system specifies a particular set of character code conversions, but the coding system <code>undecided</code> is special: it leaves the choice unspecified, to be chosen heuristically for each file or string, based on the file’s or string’s data, when they are decoded or encoded. The coding system <code>prefer-utf-8</code> is like <code>undecided</code>, but it prefers to choose <code>utf-8</code> when possible. </p> <p>In general, a coding system doesn’t guarantee roundtrip identity: decoding a byte sequence using a coding system, then encoding the resulting text in the same coding system, can produce a different byte sequence. But some coding systems do guarantee that the byte sequence will be the same as what you originally decoded. Here are a few examples: </p> <blockquote> <p>iso-8859-1, utf-8, big5, shift_jis, euc-jp </p>
</blockquote> <p>Encoding buffer text and then decoding the result can also fail to reproduce the original text. For instance, if you encode a character with a coding system which does not support that character, the result is unpredictable, and thus decoding it using the same coding system may produce a different text. Currently, Emacs can’t report errors that result from encoding unsupported characters. </p>    <p><em>End of line conversion</em> handles three different conventions used on various systems for representing end of line in files. The Unix convention, used on GNU and Unix systems, is to use the linefeed character (also called newline). The DOS convention, used on MS-Windows and MS-DOS systems, is to use a carriage return and a linefeed at the end of a line. The Mac convention is to use just carriage return. (This was the convention used in Classic Mac OS.) </p>   <p><em>Base coding systems</em> such as <code>latin-1</code> leave the end-of-line conversion unspecified, to be chosen based on the data. <em>Variant coding systems</em> such as <code>latin-1-unix</code>, <code>latin-1-dos</code> and <code>latin-1-mac</code> specify the end-of-line conversion explicitly as well. Most base coding systems have three corresponding variants whose names are formed by adding ‘<samp>-unix</samp>’, ‘<samp>-dos</samp>’ and ‘<samp>-mac</samp>’. </p>  <p>The coding system <code>raw-text</code> is special in that it prevents character code conversion, and causes the buffer visited with this coding system to be a unibyte buffer. For historical reasons, you can save both unibyte and multibyte text with this coding system. When you use <code>raw-text</code> to encode multibyte text, it does perform one character code conversion: it converts eight-bit characters to their single-byte external representation. <code>raw-text</code> does not specify the end-of-line conversion, allowing that to be determined as usual by the data, and has the usual three variants which specify the end-of-line conversion. </p>   <p><code>no-conversion</code> (and its alias <code>binary</code>) is equivalent to <code>raw-text-unix</code>: it specifies no conversion of either character codes or end-of-line. </p>   <p>The coding system <code>utf-8-emacs</code> specifies that the data is represented in the internal Emacs encoding (see <a href="text-representations">Text Representations</a>). This is like <code>raw-text</code> in that no code conversion happens, but different in that the result is multibyte data. The name <code>emacs-internal</code> is an alias for <code>utf-8-emacs-unix</code> (so it forces no conversion of end-of-line, unlike <code>utf-8-emacs</code>, which can decode all 3 kinds of end-of-line conventions). </p> <dl> <dt id="coding-system-get">Function: <strong>coding-system-get</strong> <em>coding-system property</em>
</dt> <dd>
<p>This function returns the specified property of the coding system <var>coding-system</var>. Most coding system properties exist for internal purposes, but one that you might find useful is <code>:mime-charset</code>. That property’s value is the name used in MIME for the character coding which this coding system can read and write. Examples: </p> <div class="example"> <pre class="example">(coding-system-get 'iso-latin-1 :mime-charset)
     ⇒ iso-8859-1
(coding-system-get 'iso-2022-cn :mime-charset)
     ⇒ iso-2022-cn
(coding-system-get 'cyrillic-koi8 :mime-charset)
     ⇒ koi8-r
</pre>
</div> <p>The value of the <code>:mime-charset</code> property is also defined as an alias for the coding system. </p>
</dd>
</dl>  <dl> <dt id="coding-system-aliases">Function: <strong>coding-system-aliases</strong> <em>coding-system</em>
</dt> <dd><p>This function returns the list of aliases of <var>coding-system</var>. </p></dd>
</dl><div class="_attribution">
  <p class="_attribution-p">
    Copyright &copy; 1990-1996, 1998-2022 Free Software Foundation, Inc. <br>Licensed under the GNU GPL license.<br>
    <a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Coding-System-Basics.html" class="_attribution-link">https://www.gnu.org/software/emacs/manual/html_node/elisp/Coding-System-Basics.html</a>
  </p>
</div>