diff options
| author | Craig Jennings <c@cjennings.net> | 2024-04-07 13:41:34 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2024-04-07 13:41:34 -0500 |
| commit | 754bbf7a25a8dda49b5d08ef0d0443bbf5af0e36 (patch) | |
| tree | f1190704f78f04a2b0b4c977d20fe96a828377f1 /devdocs/c/string%2Fmultibyte.html | |
new repository
Diffstat (limited to 'devdocs/c/string%2Fmultibyte.html')
| -rw-r--r-- | devdocs/c/string%2Fmultibyte.html | 90 |
1 files changed, 90 insertions, 0 deletions
diff --git a/devdocs/c/string%2Fmultibyte.html b/devdocs/c/string%2Fmultibyte.html new file mode 100644 index 00000000..554581e6 --- /dev/null +++ b/devdocs/c/string%2Fmultibyte.html @@ -0,0 +1,90 @@ + <h1 id="firstHeading" class="firstHeading">Null-terminated multibyte strings</h1> <p>A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).</p> +<p>Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array <code>{'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'</code>} is an NTMBS holding the string <code>"你好"</code> in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array <code>{'\xc4', '\xe3', '\xba', '\xc3', '\0'</code>}, where each of the two characters is encoded as a two-byte sequence.</p> +<p>In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are BOCU-1 and <a rel="nofollow" class="external text" href="http://www.unicode.org/reports/tr6">SCSU</a>.</p> +<p>A multibyte character string is layout-compatible with <a href="byte" title="c/string/byte">null-terminated byte string</a> (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the following locale-dependent conversion functions:</p> +<h3 id="Multibyte.2Fwide_character_conversions"> Multibyte/wide character conversions</h3> <table class="t-dsc-begin"> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code><stdlib.h></code> </th> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mblen" title="c/string/multibyte/mblen"> <span class="t-lines"><span>mblen</span></span></a></div> </td> <td> returns the number of bytes in the next multibyte character <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbtowc" title="c/string/multibyte/mbtowc"> <span class="t-lines"><span>mbtowc</span></span></a></div> </td> <td> converts the next multibyte character to wide character <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wctomb" title="c/string/multibyte/wctomb"> <span class="t-lines"><span>wctomb</span><span>wctomb_s</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a wide character to its multibyte representation <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbstowcs" title="c/string/multibyte/mbstowcs"> <span class="t-lines"><span>mbstowcs</span><span>mbstowcs_s</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a narrow multibyte character string to wide string <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wcstombs" title="c/string/multibyte/wcstombs"> <span class="t-lines"><span>wcstombs</span><span>wcstombs_s</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a wide string to narrow multibyte character string <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code><wchar.h></code> </th> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbsinit" title="c/string/multibyte/mbsinit"> <span class="t-lines"><span>mbsinit</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> checks if the mbstate_t object represents initial shift state <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/btowc" title="c/string/multibyte/btowc"> <span class="t-lines"><span>btowc</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> widens a single-byte narrow character to wide character, if possible <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wctob" title="c/string/multibyte/wctob"> <span class="t-lines"><span>wctob</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> narrows a wide character to a single-byte narrow character, if possible <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrlen" title="c/string/multibyte/mbrlen"> <span class="t-lines"><span>mbrlen</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> returns the number of bytes in the next multibyte character, given state <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrtowc" title="c/string/multibyte/mbrtowc"> <span class="t-lines"><span>mbrtowc</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> converts the next multibyte character to wide character, given state <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wcrtomb" title="c/string/multibyte/wcrtomb"> <span class="t-lines"><span>wcrtomb</span><span>wcrtomb_s</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a wide character to its multibyte representation, given state <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbsrtowcs" title="c/string/multibyte/mbsrtowcs"> <span class="t-lines"><span>mbsrtowcs</span><span>mbsrtowcs_s</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a narrow multibyte character string to wide string, given state <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/wcsrtombs" title="c/string/multibyte/wcsrtombs"> <span class="t-lines"><span>wcsrtombs</span><span>wcsrtombs_s</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a wide string to narrow multibyte character string, given state <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code><uchar.h></code> </th> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrtoc8" title="c/string/multibyte/mbrtoc8"> <span class="t-lines"><span>mbrtoc8</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c23">(C23)</span></span></span></div> </td> <td> converts a narrow multibyte character to UTF-8 encoding <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/c8rtomb" title="c/string/multibyte/c8rtomb"> <span class="t-lines"><span>c8rtomb</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c23">(C23)</span></span></span></div> </td> <td> converts UTF-8 string to narrow multibyte encoding <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrtoc16" title="c/string/multibyte/mbrtoc16"> <span class="t-lines"><span>mbrtoc16</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> generates the next 16-bit wide character from a narrow multibyte string <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/c16rtomb" title="c/string/multibyte/c16rtomb"> <span class="t-lines"><span>c16rtomb</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a 16-bit wide character to narrow multibyte string <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbrtoc32" title="c/string/multibyte/mbrtoc32"> <span class="t-lines"><span>mbrtoc32</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> generates the next 32-bit wide character from a narrow multibyte string <br> <span class="t-mark">(function)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/c32rtomb" title="c/string/multibyte/c32rtomb"> <span class="t-lines"><span>c32rtomb</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> converts a 32-bit wide character to narrow multibyte string <br> <span class="t-mark">(function)</span> </td> +</tr> </table> <h3 id="Types"> Types</h3> <table class="t-dsc-begin"> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code><wchar.h></code> </th> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/mbstate_t" title="c/string/multibyte/mbstate t"> <span class="t-lines"><span>mbstate_t</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c95">(C95)</span></span></span></div> </td> <td> conversion state information necessary to iterate multibyte character strings <br> <span class="t-mark">(class)</span> </td> +</tr> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code><uchar.h></code> </th> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/char8_t" title="c/string/multibyte/char8 t"> <span class="t-lines"><span>char8_t</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c23">(C23)</span></span></span></div> </td> <td> UTF-8 character type, an alias for <code>unsigned char</code> <br> <span class="t-mark">(typedef)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/char16_t" title="c/string/multibyte/char16 t"> <span class="t-lines"><span>char16_t</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> 16-bit wide character type <br> <span class="t-mark">(typedef)</span> </td> +</tr> <tr class="t-dsc"> <td> <div><a href="multibyte/char32_t" title="c/string/multibyte/char32 t"> <span class="t-lines"><span>char32_t</span></span></a></div> +<div><span class="t-lines"><span><span class="t-mark-rev t-since-c11">(C11)</span></span></span></div> </td> <td> 32-bit wide character type <br> <span class="t-mark">(typedef)</span> </td> +</tr> </table> <h3 id="Macros"> Macros</h3> <table class="t-dsc-begin"> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code><limits.h></code> </th> +</tr> <tr class="t-dsc"> <td> <div><span class="t-lines"><span>MB_LEN_MAX</span></span></div> </td> <td> maximum number of bytes in a multibyte character, for any supported locale <br> <span class="t-mark">(macro constant)</span> </td> +</tr> <tr class="t-dsc-header"> <th colspan="2"> Defined in header <code><stdlib.h></code> </th> +</tr> <tr class="t-dsc"> <td> <div><span class="t-lines"><span>MB_CUR_MAX</span></span></div> </td> <td> maximum number of bytes in a multibyte character, in the current locale<br><span class="t-mark">(macro variable)</span> </td> +</tr> </table> <h3 id="References"> References</h3> <ul> +<li> C11 standard (ISO/IEC 9899:2011): </li> +<ul> +<li> 7.10 Sizes of integer types <limits.h> (p: 222) </li> +<li> 7.22 General utilities <stdlib.h> (p: 340-360) </li> +<li> 7.28 Unicode utilities <uchar.h> (p: 398-401) </li> +<li> 7.29 Extended multibyte and wide character utilities <wchar.h> (p: 402-446) </li> +<li> 7.31.12 General utilities <stdlib.h> (p: 456) </li> +<li> 7.31.16 Extended multibyte and wide character utilities <wchar.h> (p: 456) </li> +<li> K.3.6 General utilities <stdlib.h> (p: 604-614) </li> +<li> K.3.9 Extended multibyte and wide character utilities <wchar.h> (p: 627-651) </li> +</ul> +<li> C99 standard (ISO/IEC 9899:1999): </li> +<ul> +<li> 7.10 Sizes of integer types <limits.h> (p: 203) </li> +<li> 7.20 General utilities <stdlib.h> (p: 306-324) </li> +<li> 7.24 Extended multibyte and wide character utilities <wchar.h> (p: 348-392) </li> +<li> 7.26.10 General utilities <stdlib.h> (p: 402) </li> +<li> 7.26.12 Extended multibyte and wide character utilities <wchar.h> (p: 402) </li> +</ul> +<li> C89/C90 standard (ISO/IEC 9899:1990): </li> +<ul> +<li> 4.1.4 Limits <float.h> and <limits.h> </li> +<li> 4.10 GENERAL UTILITIES <stdlib.h> </li> +<li> 4.13.7 General utilities <stdlib.h> </li> +</ul> +</ul> <h3 id="See_also"> See also</h3> <table class="t-dsc-begin"> <tr class="t-dsc"> <td colspan="2"> <span><a href="https://en.cppreference.com/w/cpp/string/multibyte" title="cpp/string/multibyte">C++ documentation</a></span> for <code>Null-terminated multibyte strings</code> </td> +</tr> </table> <div class="_attribution"> + <p class="_attribution-p"> + © cppreference.com<br>Licensed under the Creative Commons Attribution-ShareAlike Unported License v3.0.<br> + <a href="https://en.cppreference.com/w/c/string/multibyte" class="_attribution-link">https://en.cppreference.com/w/c/string/multibyte</a> + </p> +</div> |
