summaryrefslogtreecommitdiff
path: root/devdocs/c/language%2Ftranslation_phases.html
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2024-04-07 13:41:34 -0500
committerCraig Jennings <c@cjennings.net>2024-04-07 13:41:34 -0500
commit754bbf7a25a8dda49b5d08ef0d0443bbf5af0e36 (patch)
treef1190704f78f04a2b0b4c977d20fe96a828377f1 /devdocs/c/language%2Ftranslation_phases.html
new repository
Diffstat (limited to 'devdocs/c/language%2Ftranslation_phases.html')
-rw-r--r--devdocs/c/language%2Ftranslation_phases.html105
1 files changed, 105 insertions, 0 deletions
diff --git a/devdocs/c/language%2Ftranslation_phases.html b/devdocs/c/language%2Ftranslation_phases.html
new file mode 100644
index 00000000..2f629e71
--- /dev/null
+++ b/devdocs/c/language%2Ftranslation_phases.html
@@ -0,0 +1,105 @@
+ <h1 id="firstHeading" class="firstHeading">Phases of translation</h1> <p>The C source file is processed by the compiler <i>as if</i> the following phases take place, in this exact order. Actual implementation may combine these actions or process them differently as long as the behavior is the same.</p>
+<h3 id="Phase_1"> Phase 1</h3> <div class="t-li1">
+<span class="t-li">1)</span> The individual bytes of the source code file (which is generally a text file in some multibyte encoding such as UTF-8) are mapped, in implementation defined manner, to the characters of the <i>source character set</i>. In particular, OS-dependent end-of-line indicators are replaced by newline characters. The <i>source character set</i> is a multibyte character set which includes the <i>basic source character set</i> as a single-byte subset, consisting of the following 96 characters:</div> <div class="t-li2">
+<span class="t-li">a)</span> 5 whitespace characters (space, horizontal tab, vertical tab, form feed, new-line)</div> <div class="t-li2">
+<span class="t-li">b)</span> 10 digit characters from <code>'0'</code> to <code>'9'</code>
+</div> <div class="t-li2">
+<span class="t-li">c)</span> 52 letters from <code>'a'</code> to <code>'z'</code> and from <code>'A'</code> to <code>'Z'</code>
+</div> <div class="t-li2">
+<span class="t-li">d)</span> 29 punctuation characters: <code>_ { } [ ] # ( ) &lt; &gt; % : ; . ? * + - / ^ &amp; | ~ ! = , \ " '</code>
+</div> <div class="t-li1">
+<span class="t-li">2)</span> <span class="t-rev-inl t-until-c23"><span><a href="operator_alternative" title="c/language/operator alternative">Trigraph sequences</a> are replaced by corresponding single-character representations.</span><span><span class="t-mark-rev t-until-c23">(until C23)</span></span></span>
+</div> <h3 id="Phase_2"> Phase 2</h3> <div class="t-li1">
+<span class="t-li">1)</span> Whenever backslash appears at the end of a line (immediately followed by the newline character), both backslash and newline are deleted, combining two physical source lines into one logical source line. This is a single-pass operation: a line ending in two backslashes followed by an empty line does not combine three lines into one. <div class="t-example"> <div class="c source-c"><pre data-language="c">#include &lt;stdio.h&gt;
+
+#define PUTS p\
+u\
+t\
+s
+/* Line splicing is in phase 2 while macros
+ * are tokenized in phase 3 and expanded in phase 4,
+ * so the above is equivalent to #define PUTS puts
+ */
+
+int main(void)
+{
+ /* Use line splicing to call puts */ PUT\
+S\
+("Output ends here\\
+0Not printed" /* After line splicing, the remaining backslash
+ * escapes the 0, ending the string early.
+ */
+);
+}</pre></div> </div>
+</div> <div class="t-li1">
+<span class="t-li">2)</span> If a non-empty source file does not end with a newline character after this step (whether it had no newline originally, or it ended with a backslash), the behavior is undefined.</div> <h3 id="Phase_3"> Phase 3</h3> <div class="t-li1">
+<span class="t-li">1)</span> The source file is decomposed into <a href="../comment" title="c/comment">comment</a>, sequences of whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), and <i>preprocessing tokens</i>, which are the following</div> <div class="t-li2">
+<span class="t-li">a)</span> header names: <code>&lt;stdio.h&gt;</code> or <code>"myfile.h"</code>
+</div> <div class="t-li2">
+<span class="t-li">b)</span> <a href="identifier" title="c/language/identifier">identifiers</a>
+</div> <div class="t-li2">
+<span class="t-li">c)</span> preprocessing numbers, which cover <a href="integer_constant" title="c/language/integer constant">integer constants</a> and <a href="floating_constant" title="c/language/floating constant">floating constants</a>, but also cover some invalid tokens such as <code>1..E+3.foo</code> or <code>0JBK</code>
+</div> <div class="t-li2">
+<span class="t-li">d)</span> <a href="character_constant" title="c/language/character constant">character constants</a> and <a href="string_literal" title="c/language/string literal">string literals</a>
+</div> <div class="t-li2">
+<span class="t-li">e)</span> operators and punctuators, such as <code>+</code>, <code>&lt;&lt;=</code>, <code>&lt;%</code>, or <code>##</code>.</div> <div class="t-li2">
+<span class="t-li">f)</span> individual non-whitespace characters that do not fit in any other category</div> <div class="t-li1">
+<span class="t-li">2)</span> Each comment is replaced by one space character</div> <div class="t-li1">
+<span class="t-li">3)</span> Newlines are kept, and it's implementation-defined whether non-newline whitespace sequences may be collapsed into single space characters.</div> <p> If the input has been parsed into preprocessing tokens up to a given character, the next preprocessing token is generally taken to be the longest sequence of characters that could constitute a preprocessing token, even if that would cause subsequent analysis to fail. This is commonly known as <i>maximal munch</i>.</p>
+<div class="c source-c"><pre data-language="c">int foo = 1;
+// int bar = 0xE+foo; // error: invalid preprocessing number 0xE+foo
+int bar = 0xE/*Comment expands to a space*/+foo; // OK: 0xE + foo
+int baz = 0xE + foo; // OK: 0xE + foo
+int pub = bar+++baz; // OK: bar++ + baz
+int ham = bar++-++baz; // OK: bar++ - ++baz
+// int qux = bar+++++baz; // error: bar++ ++ +baz, not bar++ + ++baz
+int qux = bar+++/*Saving comment*/++baz; // OK: bar++ + ++baz</pre></div> <p>The sole exception to the maximal munch rule is:</p>
+<ul><li> Header name preprocessing tokens are only formed within a <a href="../preprocessor/include" title="c/preprocessor/include"><code> #include</code></a> <span class="t-rev-inl t-since-c23"><span>or <a href="../preprocessor/embed" title="c/preprocessor/embed"><code> #embed</code></a></span><span><span class="t-mark-rev t-since-c23">(since C23)</span></span></span> directive, <span class="t-rev-inl t-since-c23"><span>in <a href="../preprocessor/include" title="c/preprocessor/include"><code>__has_include</code></a> and <a href="../preprocessor/embed" title="c/preprocessor/embed"><code>__has_embed</code></a> expressions</span><span><span class="t-mark-rev t-since-c23">(since C23)</span></span></span> and in implementation-defined locations within a <a href="../preprocessor/impl" title="c/preprocessor/impl"><code> #pragma</code></a> directive. </li></ul> <div class="c source-c"><pre data-language="c">#define MACRO_1 1
+#define MACRO_2 2
+#define MACRO_3 3
+#define MACRO_EXPR (MACRO_1 &lt;MACRO_2&gt; MACRO_3) // OK: &lt;MACRO_2&gt; is not a header-name</pre></div> <h3 id="Phase_4"> Phase 4</h3> <div class="t-li1">
+<span class="t-li">1)</span> <a href="../preprocessor" title="c/preprocessor">Preprocessor</a> is executed.</div> <div class="t-li1">
+<span class="t-li">2)</span> Each file introduced with the <a href="../preprocessor/include" title="c/preprocessor/include">#include</a> directive goes through phases 1 through 4, recursively.</div> <div class="t-li1">
+<span class="t-li">3)</span> At the end of this phase, all preprocessor directives are removed from the source.</div> <h3 id="Phase_5"> Phase 5</h3> <div class="t-li1">
+<span class="t-li">1)</span> All characters and <a href="escape" title="c/language/escape">escape sequences</a> in <a href="character_constant" title="c/language/character constant">character constants</a> and <a href="string_literal" title="c/language/string literal">string literals</a> are converted from <i>source character set</i> to <i>execution character set</i> (which may be a multibyte character set such as UTF-8, as long as all 96 characters from the <i>basic source character set</i> listed in phase 1 have single-byte representations). If the character specified by an escape sequence isn't a member of the execution character set, the result is implementation-defined, but is guaranteed to not be a null (wide) character.</div> <p>Note: the conversion performed at this stage can be controlled by command line options in some implementations: gcc and clang use <code>-finput-charset</code> to specify the encoding of the source character set, <code>-fexec-charset</code> and <code>-fwide-exec-charset</code> to specify the encodings of the execution character set in the string literals and character constants <span class="t-rev-inl t-since-c11"><span>that don't have an encoding prefix</span><span><span class="t-mark-rev t-since-c11">(since C11)</span></span></span>.</p>
+<h3 id="Phase_6"> Phase 6</h3> <p>Adjacent <a href="string_literal" title="c/language/string literal">string literals</a> are concatenated.</p>
+<h3 id="Phase_7"> Phase 7</h3> <p>Compilation takes place: the tokens are syntactically and semantically analyzed and translated as a translation unit.</p>
+<h3 id="Phase_8"> Phase 8</h3> <p>Linking takes place: Translation units and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment (the OS).</p>
+<h3 id="References"> References</h3> <ul>
+<li> C23 standard (ISO/IEC 9899:2023): </li>
+<ul>
+<li> 5.1.1.2 Translation phases (p: TBD) </li>
+<li> 5.2.1 Character sets (p: TBD) </li>
+<li> 6.4 Lexical elements (p: TBD) </li>
+</ul>
+<li> C17 standard (ISO/IEC 9899:2018): </li>
+<ul>
+<li> 5.1.1.2 Translation phases (p: 9-10) </li>
+<li> 5.2.1 Character sets (p: 17) </li>
+<li> 6.4 Lexical elements (p: 41-54) </li>
+</ul>
+<li> C11 standard (ISO/IEC 9899:2011): </li>
+<ul>
+<li> 5.1.1.2 Translation phases (p: 10-11) </li>
+<li> 5.2.1 Character sets (p: 22-24) </li>
+<li> 6.4 Lexical elements (p: 57-75) </li>
+</ul>
+<li> C99 standard (ISO/IEC 9899:1999): </li>
+<ul>
+<li> 5.1.1.2 Translation phases (p: 9-10) </li>
+<li> 5.2.1 Character sets (p: 17-19) </li>
+<li> 6.4 Lexical elements (p: 49-66) </li>
+</ul>
+<li> C89/C90 standard (ISO/IEC 9899:1990): </li>
+<ul>
+<li> 2.1.1.2 Translation phases </li>
+<li> 2.2.1 Character sets </li>
+<li> 3.1 Lexical elements </li>
+</ul>
+</ul> <h3 id="See_also"> See also</h3> <table class="t-dsc-begin"> <tr class="t-dsc"> <td colspan="2"> <span><a href="https://en.cppreference.com/w/cpp/language/translation_phases" title="cpp/language/translation phases">C++ documentation</a></span> for <span class=""><span>Phases of translation</span></span> </td>
+</tr> </table> <div class="_attribution">
+ <p class="_attribution-p">
+ &copy; cppreference.com<br>Licensed under the Creative Commons Attribution-ShareAlike Unported License v3.0.<br>
+ <a href="https://en.cppreference.com/w/c/language/translation_phases" class="_attribution-link">https://en.cppreference.com/w/c/language/translation_phases</a>
+ </p>
+</div>