summaryrefslogtreecommitdiff
path: root/devdocs/c/string%2Fmultibyte.html
blob: 554581e69b8549b58c42014cf0686f63018296f0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
    <h1 id="firstHeading" class="firstHeading">Null-terminated multibyte strings</h1>            <p>A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).</p>
<p>Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array <code>{'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'</code>} is an NTMBS holding the string <code>"你好"</code> in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array <code>{'\xc4', '\xe3', '\xba', '\xc3', '\0'</code>}, where each of the two characters is encoded as a two-byte sequence.</p>
<p>In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are BOCU-1 and <a rel="nofollow" class="external text" href="http://www.unicode.org/reports/tr6">SCSU</a>.</p>
<p>A multibyte character string is layout-compatible with <a href="byte" title="c/string/byte">null-terminated byte string</a> (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the following locale-dependent conversion functions:</p>
<h3 id="Multibyte.2Fwide_character_conversions"> Multibyte/wide character conversions</h3> <table class="t-dsc-begin"> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code>&lt;stdlib.h&gt;</code>  </th>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mblen" title="c/string/multibyte/mblen"> <span class="t-lines"><span>mblen</span></span></a></div> </td> <td> returns the number of bytes in the next multibyte character <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbtowc" title="c/string/multibyte/mbtowc"> <span class="t-lines"><span>mbtowc</span></span></a></div> </td> <td> converts the next multibyte character to wide character <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wctomb" title="c/string/multibyte/wctomb"> <span class="t-lines"><span>wctomb</span><span>wctomb_s</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a wide character to its multibyte representation <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbstowcs" title="c/string/multibyte/mbstowcs"> <span class="t-lines"><span>mbstowcs</span><span>mbstowcs_s</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a narrow multibyte character string to wide string <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wcstombs" title="c/string/multibyte/wcstombs"> <span class="t-lines"><span>wcstombs</span><span>wcstombs_s</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a wide string to narrow multibyte character string <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code>&lt;wchar.h&gt;</code>  </th>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbsinit" title="c/string/multibyte/mbsinit"> <span class="t-lines"><span>mbsinit</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> checks if the mbstate_t object represents initial shift state <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/btowc" title="c/string/multibyte/btowc"> <span class="t-lines"><span>btowc</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> widens a single-byte narrow character to wide character, if possible <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wctob" title="c/string/multibyte/wctob"> <span class="t-lines"><span>wctob</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> narrows a wide character to a single-byte narrow character, if possible <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrlen" title="c/string/multibyte/mbrlen"> <span class="t-lines"><span>mbrlen</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> returns the number of bytes in the next multibyte character, given state <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrtowc" title="c/string/multibyte/mbrtowc"> <span class="t-lines"><span>mbrtowc</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> converts the next multibyte character to wide character, given state <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wcrtomb" title="c/string/multibyte/wcrtomb"> <span class="t-lines"><span>wcrtomb</span><span>wcrtomb_s</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a wide character to its multibyte representation, given state <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbsrtowcs" title="c/string/multibyte/mbsrtowcs"> <span class="t-lines"><span>mbsrtowcs</span><span>mbsrtowcs_s</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a narrow multibyte character string to wide string, given state <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wcsrtombs" title="c/string/multibyte/wcsrtombs"> <span class="t-lines"><span>wcsrtombs</span><span>wcsrtombs_s</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a wide string to narrow multibyte character string, given state <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code>&lt;uchar.h&gt;</code>  </th>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrtoc8" title="c/string/multibyte/mbrtoc8"> <span class="t-lines"><span>mbrtoc8</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c23">(C23)</span></span></span></div> </td> <td> converts a narrow multibyte character to UTF-8 encoding <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/c8rtomb" title="c/string/multibyte/c8rtomb"> <span class="t-lines"><span>c8rtomb</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c23">(C23)</span></span></span></div> </td> <td> converts UTF-8 string to narrow multibyte encoding <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrtoc16" title="c/string/multibyte/mbrtoc16"> <span class="t-lines"><span>mbrtoc16</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> generates the next 16-bit wide character from a narrow multibyte string <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/c16rtomb" title="c/string/multibyte/c16rtomb"> <span class="t-lines"><span>c16rtomb</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a 16-bit wide character to narrow multibyte string <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrtoc32" title="c/string/multibyte/mbrtoc32"> <span class="t-lines"><span>mbrtoc32</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> generates the next 32-bit wide character from a narrow multibyte string <br> <span class="t-mark">(function)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/c32rtomb" title="c/string/multibyte/c32rtomb"> <span class="t-lines"><span>c32rtomb</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a 32-bit wide character to narrow multibyte string <br> <span class="t-mark">(function)</span>  </td>
</tr> </table> <h3 id="Types"> Types</h3> <table class="t-dsc-begin"> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code>&lt;wchar.h&gt;</code>  </th>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbstate_t" title="c/string/multibyte/mbstate t"> <span class="t-lines"><span>mbstate_t</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> conversion state information necessary to iterate multibyte character strings <br> <span class="t-mark">(class)</span>  </td>
</tr> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code>&lt;uchar.h&gt;</code>  </th>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/char8_t" title="c/string/multibyte/char8 t"> <span class="t-lines"><span>char8_t</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c23">(C23)</span></span></span></div> </td> <td> UTF-8 character type, an alias for <code>unsigned char</code> <br> <span class="t-mark">(typedef)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/char16_t" title="c/string/multibyte/char16 t"> <span class="t-lines"><span>char16_t</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> 16-bit wide character type <br> <span class="t-mark">(typedef)</span>  </td>
</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/char32_t" title="c/string/multibyte/char32 t"> <span class="t-lines"><span>char32_t</span></span></a></div>
<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> 32-bit wide character type <br> <span class="t-mark">(typedef)</span>  </td>
</tr> </table> <h3 id="Macros"> Macros</h3> <table class="t-dsc-begin"> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code>&lt;limits.h&gt;</code>  </th>
</tr> <tr class="t-dsc"> <td> <div><span class="t-lines"><span>MB_LEN_MAX</span></span></div> </td> <td> maximum number of bytes in a multibyte character, for any supported locale <br> <span class="t-mark">(macro constant)</span>  </td>
</tr> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code>&lt;stdlib.h&gt;</code>  </th>
</tr> <tr class="t-dsc"> <td> <div><span class="t-lines"><span>MB_CUR_MAX</span></span></div> </td> <td> maximum number of bytes in a multibyte character, in the current locale<br><span class="t-mark">(macro variable)</span>  </td>
</tr> </table> <h3 id="References"> References</h3>  <ul>
<li> C11 standard (ISO/IEC 9899:2011): </li>
<ul>
<li> 7.10 Sizes of integer types &lt;limits.h&gt; (p: 222) </li>
<li> 7.22 General utilities &lt;stdlib.h&gt; (p: 340-360) </li>
<li> 7.28 Unicode utilities &lt;uchar.h&gt; (p: 398-401) </li>
<li> 7.29 Extended multibyte and wide character utilities &lt;wchar.h&gt; (p: 402-446) </li>
<li> 7.31.12 General utilities &lt;stdlib.h&gt; (p: 456) </li>
<li> 7.31.16 Extended multibyte and wide character utilities &lt;wchar.h&gt; (p: 456) </li>
<li> K.3.6 General utilities &lt;stdlib.h&gt; (p: 604-614) </li>
<li> K.3.9 Extended multibyte and wide character utilities &lt;wchar.h&gt; (p: 627-651) </li>
</ul>
<li> C99 standard (ISO/IEC 9899:1999): </li>
<ul>
<li> 7.10 Sizes of integer types &lt;limits.h&gt; (p: 203) </li>
<li> 7.20 General utilities &lt;stdlib.h&gt; (p: 306-324) </li>
<li> 7.24 Extended multibyte and wide character utilities &lt;wchar.h&gt; (p: 348-392) </li>
<li> 7.26.10 General utilities &lt;stdlib.h&gt; (p: 402) </li>
<li> 7.26.12 Extended multibyte and wide character utilities &lt;wchar.h&gt; (p: 402) </li>
</ul>
<li> C89/C90 standard (ISO/IEC 9899:1990): </li>
<ul>
<li> 4.1.4 Limits &lt;float.h&gt; and &lt;limits.h&gt; </li>
<li> 4.10 GENERAL UTILITIES &lt;stdlib.h&gt; </li>
<li> 4.13.7 General utilities &lt;stdlib.h&gt; </li>
</ul>
</ul>                      <h3 id="See_also"> See also</h3> <table class="t-dsc-begin"> <tr class="t-dsc"> <td colspan="2"> <span><a href="https://en.cppreference.com/w/cpp/string/multibyte" title="cpp/string/multibyte">C++ documentation</a></span> for <code>Null-terminated multibyte strings</code> </td>
</tr> </table>            <div class="_attribution">
  <p class="_attribution-p">
    &copy; cppreference.com<br>Licensed under the Creative Commons Attribution-ShareAlike Unported License v3.0.<br>
    <a href="https://en.cppreference.com/w/c/string/multibyte" class="_attribution-link">https://en.cppreference.com/w/c/string/multibyte</a>
  </p>
</div>