diff options
Diffstat (limited to 'devdocs/python~3.12/library%2Ftokenize.html')
| -rw-r--r-- | devdocs/python~3.12/library%2Ftokenize.html | 132 |
1 files changed, 132 insertions, 0 deletions
diff --git a/devdocs/python~3.12/library%2Ftokenize.html b/devdocs/python~3.12/library%2Ftokenize.html new file mode 100644 index 00000000..49d8fb10 --- /dev/null +++ b/devdocs/python~3.12/library%2Ftokenize.html @@ -0,0 +1,132 @@ + <span id="tokenize-tokenizer-for-python-source"></span><h1>tokenize — Tokenizer for Python source</h1> <p><strong>Source code:</strong> <a class="reference external" href="https://github.com/python/cpython/tree/3.12/Lib/tokenize.py">Lib/tokenize.py</a></p> <p>The <a class="reference internal" href="#module-tokenize" title="tokenize: Lexical scanner for Python source code."><code>tokenize</code></a> module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers”, including colorizers for on-screen displays.</p> <p>To simplify token stream handling, all <a class="reference internal" href="../reference/lexical_analysis#operators"><span class="std std-ref">operator</span></a> and <a class="reference internal" href="../reference/lexical_analysis#delimiters"><span class="std std-ref">delimiter</span></a> tokens and <a class="reference internal" href="constants#Ellipsis" title="Ellipsis"><code>Ellipsis</code></a> are returned using the generic <a class="reference internal" href="token#token.OP" title="token.OP"><code>OP</code></a> token type. The exact type can be determined by checking the <code>exact_type</code> property on the <a class="reference internal" href="../glossary#term-named-tuple"><span class="xref std std-term">named tuple</span></a> returned from <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize.tokenize()</code></a>.</p> <div class="admonition warning"> <p class="admonition-title">Warning</p> <p>Note that the functions in this module are only designed to parse syntactically valid Python code (code that does not raise when parsed using <a class="reference internal" href="ast#ast.parse" title="ast.parse"><code>ast.parse()</code></a>). The behavior of the functions in this module is <strong>undefined</strong> when providing invalid Python code and it can change at any point.</p> </div> <section id="tokenizing-input"> <h2>Tokenizing Input</h2> <p>The primary entry point is a <a class="reference internal" href="../glossary#term-generator"><span class="xref std std-term">generator</span></a>:</p> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.tokenize"> +<code>tokenize.tokenize(readline)</code> </dt> <dd> +<p>The <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a> generator requires one argument, <em>readline</em>, which must be a callable object which provides the same interface as the <a class="reference internal" href="io#io.IOBase.readline" title="io.IOBase.readline"><code>io.IOBase.readline()</code></a> method of file objects. Each call to the function should return one line of input as bytes.</p> <p>The generator produces 5-tuples with these members: the token type; the token string; a 2-tuple <code>(srow, scol)</code> of ints specifying the row and column where the token begins in the source; a 2-tuple <code>(erow, ecol)</code> of ints specifying the row and column where the token ends in the source; and the line on which the token was found. The line passed (the last tuple item) is the <em>physical</em> line. The 5 tuple is returned as a <a class="reference internal" href="../glossary#term-named-tuple"><span class="xref std std-term">named tuple</span></a> with the field names: <code>type string start end line</code>.</p> <p>The returned <a class="reference internal" href="../glossary#term-named-tuple"><span class="xref std std-term">named tuple</span></a> has an additional property named <code>exact_type</code> that contains the exact operator type for <a class="reference internal" href="token#token.OP" title="token.OP"><code>OP</code></a> tokens. For all other token types <code>exact_type</code> equals the named tuple <code>type</code> field.</p> <div class="versionchanged"> <p><span class="versionmodified changed">Changed in version 3.1: </span>Added support for named tuples.</p> </div> <div class="versionchanged"> <p><span class="versionmodified changed">Changed in version 3.3: </span>Added support for <code>exact_type</code>.</p> </div> <p><a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a> determines the source encoding of the file by looking for a UTF-8 BOM or encoding cookie, according to <span class="target" id="index-0"></span><a class="pep reference external" href="https://peps.python.org/pep-0263/"><strong>PEP 263</strong></a>.</p> </dd> +</dl> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.generate_tokens"> +<code>tokenize.generate_tokens(readline)</code> </dt> <dd> +<p>Tokenize a source reading unicode strings instead of bytes.</p> <p>Like <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a>, the <em>readline</em> argument is a callable returning a single line of input. However, <a class="reference internal" href="#tokenize.generate_tokens" title="tokenize.generate_tokens"><code>generate_tokens()</code></a> expects <em>readline</em> to return a str object rather than bytes.</p> <p>The result is an iterator yielding named tuples, exactly like <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a>. It does not yield an <a class="reference internal" href="token#token.ENCODING" title="token.ENCODING"><code>ENCODING</code></a> token.</p> </dd> +</dl> <p>All constants from the <a class="reference internal" href="token#module-token" title="token: Constants representing terminal nodes of the parse tree."><code>token</code></a> module are also exported from <a class="reference internal" href="#module-tokenize" title="tokenize: Lexical scanner for Python source code."><code>tokenize</code></a>.</p> <p>Another function is provided to reverse the tokenization process. This is useful for creating tools that tokenize a script, modify the token stream, and write back the modified script.</p> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.untokenize"> +<code>tokenize.untokenize(iterable)</code> </dt> <dd> +<p>Converts tokens back into Python source code. The <em>iterable</em> must return sequences with at least two elements, the token type and the token string. Any additional sequence elements are ignored.</p> <p>The reconstructed script is returned as a single string. The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured. The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change.</p> <p>It returns bytes, encoded using the <a class="reference internal" href="token#token.ENCODING" title="token.ENCODING"><code>ENCODING</code></a> token, which is the first token sequence output by <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a>. If there is no encoding token in the input, it returns a str instead.</p> </dd> +</dl> <p><a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a> needs to detect the encoding of source files it tokenizes. The function it uses to do this is available:</p> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.detect_encoding"> +<code>tokenize.detect_encoding(readline)</code> </dt> <dd> +<p>The <a class="reference internal" href="#tokenize.detect_encoding" title="tokenize.detect_encoding"><code>detect_encoding()</code></a> function is used to detect the encoding that should be used to decode a Python source file. It requires one argument, readline, in the same way as the <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a> generator.</p> <p>It will call readline a maximum of twice, and return the encoding used (as a string) and a list of any lines (not decoded from bytes) it has read in.</p> <p>It detects the encoding from the presence of a UTF-8 BOM or an encoding cookie as specified in <span class="target" id="index-1"></span><a class="pep reference external" href="https://peps.python.org/pep-0263/"><strong>PEP 263</strong></a>. If both a BOM and a cookie are present, but disagree, a <a class="reference internal" href="exceptions#SyntaxError" title="SyntaxError"><code>SyntaxError</code></a> will be raised. Note that if the BOM is found, <code>'utf-8-sig'</code> will be returned as an encoding.</p> <p>If no encoding is specified, then the default of <code>'utf-8'</code> will be returned.</p> <p>Use <a class="reference internal" href="#tokenize.open" title="tokenize.open"><code>open()</code></a> to open Python source files: it uses <a class="reference internal" href="#tokenize.detect_encoding" title="tokenize.detect_encoding"><code>detect_encoding()</code></a> to detect the file encoding.</p> </dd> +</dl> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.open"> +<code>tokenize.open(filename)</code> </dt> <dd> +<p>Open a file in read only mode using the encoding detected by <a class="reference internal" href="#tokenize.detect_encoding" title="tokenize.detect_encoding"><code>detect_encoding()</code></a>.</p> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.2.</span></p> </div> </dd> +</dl> <dl class="py exception"> <dt class="sig sig-object py" id="tokenize.TokenError"> +<code>exception tokenize.TokenError</code> </dt> <dd> +<p>Raised when either a docstring or expression that may be split over several lines is not completed anywhere in the file, for example:</p> <pre data-language="python">"""Beginning of +docstring +</pre> <p>or:</p> <pre data-language="python">[1, + 2, + 3 +</pre> </dd> +</dl> </section> <section id="command-line-usage"> <span id="tokenize-cli"></span><h2>Command-Line Usage</h2> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.3.</span></p> </div> <p>The <a class="reference internal" href="#module-tokenize" title="tokenize: Lexical scanner for Python source code."><code>tokenize</code></a> module can be executed as a script from the command line. It is as simple as:</p> <pre data-language="sh">python -m tokenize [-e] [filename.py] +</pre> <p>The following options are accepted:</p> <dl class="std option"> <dt class="sig sig-object std" id="cmdoption-tokenize-h"> +<code>-h, --help</code> </dt> <dd> +<p>show this help message and exit</p> </dd> +</dl> <dl class="std option"> <dt class="sig sig-object std" id="cmdoption-tokenize-e"> +<code>-e, --exact</code> </dt> <dd> +<p>display token names using the exact type</p> </dd> +</dl> <p>If <code>filename.py</code> is specified its contents are tokenized to stdout. Otherwise, tokenization is performed on stdin.</p> </section> <section id="examples"> <h2>Examples</h2> <p>Example of a script rewriter that transforms float literals into Decimal objects:</p> <pre data-language="python">from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP +from io import BytesIO + +def decistmt(s): + """Substitute Decimals for floats in a string of statements. + + >>> from decimal import Decimal + >>> s = 'print(+21.3e-5*-.1234/81.7)' + >>> decistmt(s) + "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))" + + The format of the exponent is inherited from the platform C library. + Known cases are "e-007" (Windows) and "e-07" (not Windows). Since + we're only showing 12 digits, and the 13th isn't close to 5, the + rest of the output should be platform-independent. + + >>> exec(s) #doctest: +ELLIPSIS + -3.21716034272e-0...7 + + Output from calculations with Decimal should be identical across all + platforms. + + >>> exec(decistmt(s)) + -3.217160342717258261933904529E-7 + """ + result = [] + g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string + for toknum, tokval, _, _, _ in g: + if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens + result.extend([ + (NAME, 'Decimal'), + (OP, '('), + (STRING, repr(tokval)), + (OP, ')') + ]) + else: + result.append((toknum, tokval)) + return untokenize(result).decode('utf-8') +</pre> <p>Example of tokenizing from the command line. The script:</p> <pre data-language="python">def say_hello(): + print("Hello, World!") + +say_hello() +</pre> <p>will be tokenized to the following output where the first column is the range of the line/column coordinates where the token is found, the second column is the name of the token, and the final column is the value of the token (if any)</p> <pre data-language="shell">$ python -m tokenize hello.py +0,0-0,0: ENCODING 'utf-8' +1,0-1,3: NAME 'def' +1,4-1,13: NAME 'say_hello' +1,13-1,14: OP '(' +1,14-1,15: OP ')' +1,15-1,16: OP ':' +1,16-1,17: NEWLINE '\n' +2,0-2,4: INDENT ' ' +2,4-2,9: NAME 'print' +2,9-2,10: OP '(' +2,10-2,25: STRING '"Hello, World!"' +2,25-2,26: OP ')' +2,26-2,27: NEWLINE '\n' +3,0-3,1: NL '\n' +4,0-4,0: DEDENT '' +4,0-4,9: NAME 'say_hello' +4,9-4,10: OP '(' +4,10-4,11: OP ')' +4,11-4,12: NEWLINE '\n' +5,0-5,0: ENDMARKER '' +</pre> <p>The exact token type names can be displayed using the <a class="reference internal" href="#cmdoption-tokenize-e"><code>-e</code></a> option:</p> <pre data-language="shell">$ python -m tokenize -e hello.py +0,0-0,0: ENCODING 'utf-8' +1,0-1,3: NAME 'def' +1,4-1,13: NAME 'say_hello' +1,13-1,14: LPAR '(' +1,14-1,15: RPAR ')' +1,15-1,16: COLON ':' +1,16-1,17: NEWLINE '\n' +2,0-2,4: INDENT ' ' +2,4-2,9: NAME 'print' +2,9-2,10: LPAR '(' +2,10-2,25: STRING '"Hello, World!"' +2,25-2,26: RPAR ')' +2,26-2,27: NEWLINE '\n' +3,0-3,1: NL '\n' +4,0-4,0: DEDENT '' +4,0-4,9: NAME 'say_hello' +4,9-4,10: LPAR '(' +4,10-4,11: RPAR ')' +4,11-4,12: NEWLINE '\n' +5,0-5,0: ENDMARKER '' +</pre> <p>Example of tokenizing a file programmatically, reading unicode strings instead of bytes with <a class="reference internal" href="#tokenize.generate_tokens" title="tokenize.generate_tokens"><code>generate_tokens()</code></a>:</p> <pre data-language="python">import tokenize + +with tokenize.open('hello.py') as f: + tokens = tokenize.generate_tokens(f.readline) + for token in tokens: + print(token) +</pre> <p>Or reading bytes directly with <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a>:</p> <pre data-language="python">import tokenize + +with open('hello.py', 'rb') as f: + tokens = tokenize.tokenize(f.readline) + for token in tokens: + print(token) +</pre> </section> <div class="_attribution"> + <p class="_attribution-p"> + © 2001–2023 Python Software Foundation<br>Licensed under the PSF License.<br> + <a href="https://docs.python.org/3.12/library/tokenize.html" class="_attribution-link">https://docs.python.org/3.12/library/tokenize.html</a> + </p> +</div> |
