From 754bbf7a25a8dda49b5d08ef0d0443bbf5af0e36 Mon Sep 17 00:00:00 2001 From: Craig Jennings Date: Sun, 7 Apr 2024 13:41:34 -0500 Subject: new repository --- devdocs/c/string%2Fmultibyte.html | 90 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 90 insertions(+) create mode 100644 devdocs/c/string%2Fmultibyte.html (limited to 'devdocs/c/string%2Fmultibyte.html') diff --git a/devdocs/c/string%2Fmultibyte.html b/devdocs/c/string%2Fmultibyte.html new file mode 100644 index 00000000..554581e6 --- /dev/null +++ b/devdocs/c/string%2Fmultibyte.html @@ -0,0 +1,90 @@ +

Null-terminated multibyte strings

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).

+

Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'} is an NTMBS holding the string "你好" in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {'\xc4', '\xe3', '\xba', '\xc3', '\0'}, where each of the two characters is encoded as a two-byte sequence.

+

In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are BOCU-1 and SCSU.

+

A multibyte character string is layout-compatible with null-terminated byte string (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the following locale-dependent conversion functions:

+

Multibyte/wide character conversions

+ + + + + + + + + + + + + + + + + + + + + +
Defined in header <stdlib.h>
returns the number of bytes in the next multibyte character
(function)
converts the next multibyte character to wide character
(function)
+
(C11)
converts a wide character to its multibyte representation
(function)
+
(C11)
converts a narrow multibyte character string to wide string
(function)
+
(C11)
converts a wide string to narrow multibyte character string
(function)
Defined in header <wchar.h>
+
(C95)
checks if the mbstate_t object represents initial shift state
(function)
+
(C95)
widens a single-byte narrow character to wide character, if possible
(function)
+
(C95)
narrows a wide character to a single-byte narrow character, if possible
(function)
+
(C95)
returns the number of bytes in the next multibyte character, given state
(function)
+
(C95)
converts the next multibyte character to wide character, given state
(function)
+
(C95)(C11)
converts a wide character to its multibyte representation, given state
(function)
+
(C95)(C11)
converts a narrow multibyte character string to wide string, given state
(function)
+
(C95)(C11)
converts a wide string to narrow multibyte character string, given state
(function)
Defined in header <uchar.h>
+
(C23)
converts a narrow multibyte character to UTF-8 encoding
(function)
+
(C23)
converts UTF-8 string to narrow multibyte encoding
(function)
+
(C11)
generates the next 16-bit wide character from a narrow multibyte string
(function)
+
(C11)
converts a 16-bit wide character to narrow multibyte string
(function)
+
(C11)
generates the next 32-bit wide character from a narrow multibyte string
(function)
+
(C11)
converts a 32-bit wide character to narrow multibyte string
(function)

Types

+ + + + + +
Defined in header <wchar.h>
+
(C95)
conversion state information necessary to iterate multibyte character strings
(class)
Defined in header <uchar.h>
+
(C23)
UTF-8 character type, an alias for unsigned char
(typedef)
+
(C11)
16-bit wide character type
(typedef)
+
(C11)
32-bit wide character type
(typedef)

Macros

+ + + +
Defined in header <limits.h>
MB_LEN_MAX
maximum number of bytes in a multibyte character, for any supported locale
(macro constant)
Defined in header <stdlib.h>
MB_CUR_MAX
maximum number of bytes in a multibyte character, in the current locale
(macro variable)

References

See also

+
C++ documentation for Null-terminated multibyte strings
+

+ © cppreference.com
Licensed under the Creative Commons Attribution-ShareAlike Unported License v3.0.
+ https://en.cppreference.com/w/c/string/multibyte +

+
-- cgit v1.2.3