devdocs/elisp/non_002dascii-in-strings.html


1
2
3
4
5
6

 <h4 class="subsubsection">Non-ASCII Characters in Strings</h4> <p>There are two text representations for non-<acronym>ASCII</acronym> characters in Emacs strings: multibyte and unibyte (see <a href="text-representations">Text Representations</a>). Roughly speaking, unibyte strings store raw bytes, while multibyte strings store human-readable text. Each character in a unibyte string is a byte, i.e., its value is between 0 and 255. By contrast, each character in a multibyte string may have a value between 0 to 4194303 (see <a href="character-type">Character Type</a>). In both cases, characters above 127 are non-<acronym>ASCII</acronym>. </p> <p>You can include a non-<acronym>ASCII</acronym> character in a string constant by writing it literally. If the string constant is read from a multibyte source, such as a multibyte buffer or string, or a file that would be visited as multibyte, then Emacs reads each non-<acronym>ASCII</acronym> character as a multibyte character and automatically makes the string a multibyte string. If the string constant is read from a unibyte source, then Emacs reads the non-<acronym>ASCII</acronym> character as unibyte, and makes the string unibyte. </p> <p>Instead of writing a character literally into a multibyte string, you can write it as its character code using an escape sequence. See <a href="general-escape-syntax">General Escape Syntax</a>, for details about escape sequences. </p> <p>If you use any Unicode-style escape sequence ‘<samp>\uNNNN</samp>’ or ‘<samp>\U00NNNNNN</samp>’ in a string constant (even for an <acronym>ASCII</acronym> character), Emacs automatically assumes that it is multibyte. </p> <p>You can also use hexadecimal escape sequences (‘<samp>\x<var>n</var></samp>’) and octal escape sequences (‘<samp>\<var>n</var></samp>’) in string constants. <strong>But beware:</strong> If a string constant contains hexadecimal or octal escape sequences, and these escape sequences all specify unibyte characters (i.e., less than 256), and there are no other literal non-<acronym>ASCII</acronym> characters or Unicode-style escape sequences in the string, then Emacs automatically assumes that it is a unibyte string. That is to say, it assumes that all non-<acronym>ASCII</acronym> characters occurring in the string are 8-bit raw bytes. </p> <p>In hexadecimal and octal escape sequences, the escaped character code may contain a variable number of digits, so the first subsequent character which is not a valid hexadecimal or octal digit terminates the escape sequence. If the next character in a string could be interpreted as a hexadecimal or octal digit, write ‘<samp>\ </samp>’ (backslash and space) to terminate the escape sequence. For example, ‘<samp>\xe0\ </samp>’ represents one character, ‘<samp>a</samp>’ with grave accent. ‘<samp>\ </samp>’ in a string constant is just like backslash-newline; it does not contribute any character to the string, but it does terminate any preceding hex escape. </p><div class="_attribution">
  <p class="_attribution-p">
    Copyright &copy; 1990-1996, 1998-2022 Free Software Foundation, Inc. <br>Licensed under the GNU GPL license.<br>
    <a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Non_002dASCII-in-Strings.html" class="_attribution-link">https://www.gnu.org/software/emacs/manual/html_node/elisp/Non_002dASCII-in-Strings.html</a>
  </p>
</div>