summaryrefslogtreecommitdiff
path: root/devdocs/python~3.12/library%2Ftokenize.html
diff options
context:
space:
mode:
authorCraig Jennings <c@cjennings.net>2024-04-07 13:41:34 -0500
committerCraig Jennings <c@cjennings.net>2024-04-07 13:41:34 -0500
commit754bbf7a25a8dda49b5d08ef0d0443bbf5af0e36 (patch)
treef1190704f78f04a2b0b4c977d20fe96a828377f1 /devdocs/python~3.12/library%2Ftokenize.html
new repository
Diffstat (limited to 'devdocs/python~3.12/library%2Ftokenize.html')
-rw-r--r--devdocs/python~3.12/library%2Ftokenize.html132
1 files changed, 132 insertions, 0 deletions
diff --git a/devdocs/python~3.12/library%2Ftokenize.html b/devdocs/python~3.12/library%2Ftokenize.html
new file mode 100644
index 00000000..49d8fb10
--- /dev/null
+++ b/devdocs/python~3.12/library%2Ftokenize.html
@@ -0,0 +1,132 @@
+ <span id="tokenize-tokenizer-for-python-source"></span><h1>tokenize — Tokenizer for Python source</h1> <p><strong>Source code:</strong> <a class="reference external" href="https://github.com/python/cpython/tree/3.12/Lib/tokenize.py">Lib/tokenize.py</a></p> <p>The <a class="reference internal" href="#module-tokenize" title="tokenize: Lexical scanner for Python source code."><code>tokenize</code></a> module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers”, including colorizers for on-screen displays.</p> <p>To simplify token stream handling, all <a class="reference internal" href="../reference/lexical_analysis#operators"><span class="std std-ref">operator</span></a> and <a class="reference internal" href="../reference/lexical_analysis#delimiters"><span class="std std-ref">delimiter</span></a> tokens and <a class="reference internal" href="constants#Ellipsis" title="Ellipsis"><code>Ellipsis</code></a> are returned using the generic <a class="reference internal" href="token#token.OP" title="token.OP"><code>OP</code></a> token type. The exact type can be determined by checking the <code>exact_type</code> property on the <a class="reference internal" href="../glossary#term-named-tuple"><span class="xref std std-term">named tuple</span></a> returned from <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize.tokenize()</code></a>.</p> <div class="admonition warning"> <p class="admonition-title">Warning</p> <p>Note that the functions in this module are only designed to parse syntactically valid Python code (code that does not raise when parsed using <a class="reference internal" href="ast#ast.parse" title="ast.parse"><code>ast.parse()</code></a>). The behavior of the functions in this module is <strong>undefined</strong> when providing invalid Python code and it can change at any point.</p> </div> <section id="tokenizing-input"> <h2>Tokenizing Input</h2> <p>The primary entry point is a <a class="reference internal" href="../glossary#term-generator"><span class="xref std std-term">generator</span></a>:</p> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.tokenize">
+<code>tokenize.tokenize(readline)</code> </dt> <dd>
+<p>The <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a> generator requires one argument, <em>readline</em>, which must be a callable object which provides the same interface as the <a class="reference internal" href="io#io.IOBase.readline" title="io.IOBase.readline"><code>io.IOBase.readline()</code></a> method of file objects. Each call to the function should return one line of input as bytes.</p> <p>The generator produces 5-tuples with these members: the token type; the token string; a 2-tuple <code>(srow, scol)</code> of ints specifying the row and column where the token begins in the source; a 2-tuple <code>(erow, ecol)</code> of ints specifying the row and column where the token ends in the source; and the line on which the token was found. The line passed (the last tuple item) is the <em>physical</em> line. The 5 tuple is returned as a <a class="reference internal" href="../glossary#term-named-tuple"><span class="xref std std-term">named tuple</span></a> with the field names: <code>type string start end line</code>.</p> <p>The returned <a class="reference internal" href="../glossary#term-named-tuple"><span class="xref std std-term">named tuple</span></a> has an additional property named <code>exact_type</code> that contains the exact operator type for <a class="reference internal" href="token#token.OP" title="token.OP"><code>OP</code></a> tokens. For all other token types <code>exact_type</code> equals the named tuple <code>type</code> field.</p> <div class="versionchanged"> <p><span class="versionmodified changed">Changed in version 3.1: </span>Added support for named tuples.</p> </div> <div class="versionchanged"> <p><span class="versionmodified changed">Changed in version 3.3: </span>Added support for <code>exact_type</code>.</p> </div> <p><a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a> determines the source encoding of the file by looking for a UTF-8 BOM or encoding cookie, according to <span class="target" id="index-0"></span><a class="pep reference external" href="https://peps.python.org/pep-0263/"><strong>PEP 263</strong></a>.</p> </dd>
+</dl> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.generate_tokens">
+<code>tokenize.generate_tokens(readline)</code> </dt> <dd>
+<p>Tokenize a source reading unicode strings instead of bytes.</p> <p>Like <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a>, the <em>readline</em> argument is a callable returning a single line of input. However, <a class="reference internal" href="#tokenize.generate_tokens" title="tokenize.generate_tokens"><code>generate_tokens()</code></a> expects <em>readline</em> to return a str object rather than bytes.</p> <p>The result is an iterator yielding named tuples, exactly like <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a>. It does not yield an <a class="reference internal" href="token#token.ENCODING" title="token.ENCODING"><code>ENCODING</code></a> token.</p> </dd>
+</dl> <p>All constants from the <a class="reference internal" href="token#module-token" title="token: Constants representing terminal nodes of the parse tree."><code>token</code></a> module are also exported from <a class="reference internal" href="#module-tokenize" title="tokenize: Lexical scanner for Python source code."><code>tokenize</code></a>.</p> <p>Another function is provided to reverse the tokenization process. This is useful for creating tools that tokenize a script, modify the token stream, and write back the modified script.</p> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.untokenize">
+<code>tokenize.untokenize(iterable)</code> </dt> <dd>
+<p>Converts tokens back into Python source code. The <em>iterable</em> must return sequences with at least two elements, the token type and the token string. Any additional sequence elements are ignored.</p> <p>The reconstructed script is returned as a single string. The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured. The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change.</p> <p>It returns bytes, encoded using the <a class="reference internal" href="token#token.ENCODING" title="token.ENCODING"><code>ENCODING</code></a> token, which is the first token sequence output by <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a>. If there is no encoding token in the input, it returns a str instead.</p> </dd>
+</dl> <p><a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a> needs to detect the encoding of source files it tokenizes. The function it uses to do this is available:</p> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.detect_encoding">
+<code>tokenize.detect_encoding(readline)</code> </dt> <dd>
+<p>The <a class="reference internal" href="#tokenize.detect_encoding" title="tokenize.detect_encoding"><code>detect_encoding()</code></a> function is used to detect the encoding that should be used to decode a Python source file. It requires one argument, readline, in the same way as the <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a> generator.</p> <p>It will call readline a maximum of twice, and return the encoding used (as a string) and a list of any lines (not decoded from bytes) it has read in.</p> <p>It detects the encoding from the presence of a UTF-8 BOM or an encoding cookie as specified in <span class="target" id="index-1"></span><a class="pep reference external" href="https://peps.python.org/pep-0263/"><strong>PEP 263</strong></a>. If both a BOM and a cookie are present, but disagree, a <a class="reference internal" href="exceptions#SyntaxError" title="SyntaxError"><code>SyntaxError</code></a> will be raised. Note that if the BOM is found, <code>'utf-8-sig'</code> will be returned as an encoding.</p> <p>If no encoding is specified, then the default of <code>'utf-8'</code> will be returned.</p> <p>Use <a class="reference internal" href="#tokenize.open" title="tokenize.open"><code>open()</code></a> to open Python source files: it uses <a class="reference internal" href="#tokenize.detect_encoding" title="tokenize.detect_encoding"><code>detect_encoding()</code></a> to detect the file encoding.</p> </dd>
+</dl> <dl class="py function"> <dt class="sig sig-object py" id="tokenize.open">
+<code>tokenize.open(filename)</code> </dt> <dd>
+<p>Open a file in read only mode using the encoding detected by <a class="reference internal" href="#tokenize.detect_encoding" title="tokenize.detect_encoding"><code>detect_encoding()</code></a>.</p> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.2.</span></p> </div> </dd>
+</dl> <dl class="py exception"> <dt class="sig sig-object py" id="tokenize.TokenError">
+<code>exception tokenize.TokenError</code> </dt> <dd>
+<p>Raised when either a docstring or expression that may be split over several lines is not completed anywhere in the file, for example:</p> <pre data-language="python">"""Beginning of
+docstring
+</pre> <p>or:</p> <pre data-language="python">[1,
+ 2,
+ 3
+</pre> </dd>
+</dl> </section> <section id="command-line-usage"> <span id="tokenize-cli"></span><h2>Command-Line Usage</h2> <div class="versionadded"> <p><span class="versionmodified added">New in version 3.3.</span></p> </div> <p>The <a class="reference internal" href="#module-tokenize" title="tokenize: Lexical scanner for Python source code."><code>tokenize</code></a> module can be executed as a script from the command line. It is as simple as:</p> <pre data-language="sh">python -m tokenize [-e] [filename.py]
+</pre> <p>The following options are accepted:</p> <dl class="std option"> <dt class="sig sig-object std" id="cmdoption-tokenize-h">
+<code>-h, --help</code> </dt> <dd>
+<p>show this help message and exit</p> </dd>
+</dl> <dl class="std option"> <dt class="sig sig-object std" id="cmdoption-tokenize-e">
+<code>-e, --exact</code> </dt> <dd>
+<p>display token names using the exact type</p> </dd>
+</dl> <p>If <code>filename.py</code> is specified its contents are tokenized to stdout. Otherwise, tokenization is performed on stdin.</p> </section> <section id="examples"> <h2>Examples</h2> <p>Example of a script rewriter that transforms float literals into Decimal objects:</p> <pre data-language="python">from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
+from io import BytesIO
+
+def decistmt(s):
+ """Substitute Decimals for floats in a string of statements.
+
+ &gt;&gt;&gt; from decimal import Decimal
+ &gt;&gt;&gt; s = 'print(+21.3e-5*-.1234/81.7)'
+ &gt;&gt;&gt; decistmt(s)
+ "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
+
+ The format of the exponent is inherited from the platform C library.
+ Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
+ we're only showing 12 digits, and the 13th isn't close to 5, the
+ rest of the output should be platform-independent.
+
+ &gt;&gt;&gt; exec(s) #doctest: +ELLIPSIS
+ -3.21716034272e-0...7
+
+ Output from calculations with Decimal should be identical across all
+ platforms.
+
+ &gt;&gt;&gt; exec(decistmt(s))
+ -3.217160342717258261933904529E-7
+ """
+ result = []
+ g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
+ for toknum, tokval, _, _, _ in g:
+ if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
+ result.extend([
+ (NAME, 'Decimal'),
+ (OP, '('),
+ (STRING, repr(tokval)),
+ (OP, ')')
+ ])
+ else:
+ result.append((toknum, tokval))
+ return untokenize(result).decode('utf-8')
+</pre> <p>Example of tokenizing from the command line. The script:</p> <pre data-language="python">def say_hello():
+ print("Hello, World!")
+
+say_hello()
+</pre> <p>will be tokenized to the following output where the first column is the range of the line/column coordinates where the token is found, the second column is the name of the token, and the final column is the value of the token (if any)</p> <pre data-language="shell">$ python -m tokenize hello.py
+0,0-0,0: ENCODING 'utf-8'
+1,0-1,3: NAME 'def'
+1,4-1,13: NAME 'say_hello'
+1,13-1,14: OP '('
+1,14-1,15: OP ')'
+1,15-1,16: OP ':'
+1,16-1,17: NEWLINE '\n'
+2,0-2,4: INDENT ' '
+2,4-2,9: NAME 'print'
+2,9-2,10: OP '('
+2,10-2,25: STRING '"Hello, World!"'
+2,25-2,26: OP ')'
+2,26-2,27: NEWLINE '\n'
+3,0-3,1: NL '\n'
+4,0-4,0: DEDENT ''
+4,0-4,9: NAME 'say_hello'
+4,9-4,10: OP '('
+4,10-4,11: OP ')'
+4,11-4,12: NEWLINE '\n'
+5,0-5,0: ENDMARKER ''
+</pre> <p>The exact token type names can be displayed using the <a class="reference internal" href="#cmdoption-tokenize-e"><code>-e</code></a> option:</p> <pre data-language="shell">$ python -m tokenize -e hello.py
+0,0-0,0: ENCODING 'utf-8'
+1,0-1,3: NAME 'def'
+1,4-1,13: NAME 'say_hello'
+1,13-1,14: LPAR '('
+1,14-1,15: RPAR ')'
+1,15-1,16: COLON ':'
+1,16-1,17: NEWLINE '\n'
+2,0-2,4: INDENT ' '
+2,4-2,9: NAME 'print'
+2,9-2,10: LPAR '('
+2,10-2,25: STRING '"Hello, World!"'
+2,25-2,26: RPAR ')'
+2,26-2,27: NEWLINE '\n'
+3,0-3,1: NL '\n'
+4,0-4,0: DEDENT ''
+4,0-4,9: NAME 'say_hello'
+4,9-4,10: LPAR '('
+4,10-4,11: RPAR ')'
+4,11-4,12: NEWLINE '\n'
+5,0-5,0: ENDMARKER ''
+</pre> <p>Example of tokenizing a file programmatically, reading unicode strings instead of bytes with <a class="reference internal" href="#tokenize.generate_tokens" title="tokenize.generate_tokens"><code>generate_tokens()</code></a>:</p> <pre data-language="python">import tokenize
+
+with tokenize.open('hello.py') as f:
+ tokens = tokenize.generate_tokens(f.readline)
+ for token in tokens:
+ print(token)
+</pre> <p>Or reading bytes directly with <a class="reference internal" href="#tokenize.tokenize" title="tokenize.tokenize"><code>tokenize()</code></a>:</p> <pre data-language="python">import tokenize
+
+with open('hello.py', 'rb') as f:
+ tokens = tokenize.tokenize(f.readline)
+ for token in tokens:
+ print(token)
+</pre> </section> <div class="_attribution">
+ <p class="_attribution-p">
+ &copy; 2001&ndash;2023 Python Software Foundation<br>Licensed under the PSF License.<br>
+ <a href="https://docs.python.org/3.12/library/tokenize.html" class="_attribution-link">https://docs.python.org/3.12/library/tokenize.html</a>
+ </p>
+</div>