Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize generate_tokens regression in CPython 3.12 #111224

Closed
Erotemic opened this issue Oct 23, 2023 · 11 comments
Closed

Tokenize generate_tokens regression in CPython 3.12 #111224

Erotemic opened this issue Oct 23, 2023 · 11 comments
Labels
topic-parser type-bug An unexpected behavior, bug, or error

Comments

@Erotemic
Copy link

Erotemic commented Oct 23, 2023

Bug report

Bug description:

I've noticed a regression when adding 3.12 support to xdoctest.

The following MWE has different behavior on 3.11 and 3.12.

import tokenize
lines = ['3, 4]', 'print(len(x))']
iterable = (line for line in lines if line)


def _readline():
    return next(iterable)

for t in tokenize.generate_tokens(_readline):
    print(t)

On 3.11 and earlier versions this will result in a tokenize.TokenError being raised:

TokenInfo(type=2 (NUMBER), string='3', start=(1, 0), end=(1, 1), line='3, 4]')
TokenInfo(type=54 (OP), string=',', start=(1, 1), end=(1, 2), line='3, 4]')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 3), end=(1, 4), line='3, 4]')
TokenInfo(type=54 (OP), string=']', start=(1, 4), end=(1, 5), line='3, 4]')
TokenInfo(type=1 (NAME), string='print', start=(2, 0), end=(2, 5), line='print(len(x))')
TokenInfo(type=54 (OP), string='(', start=(2, 5), end=(2, 6), line='print(len(x))')
TokenInfo(type=1 (NAME), string='len', start=(2, 6), end=(2, 9), line='print(len(x))')
TokenInfo(type=54 (OP), string='(', start=(2, 9), end=(2, 10), line='print(len(x))')
TokenInfo(type=1 (NAME), string='x', start=(2, 10), end=(2, 11), line='print(len(x))')
TokenInfo(type=54 (OP), string=')', start=(2, 11), end=(2, 12), line='print(len(x))')
TokenInfo(type=54 (OP), string=')', start=(2, 12), end=(2, 13), line='print(len(x))')
Traceback (most recent call last):
  File "/home/joncrall/code/xdoctest/dev/tokenize_mwe.py", line 10, in <module>
    for t in tokenize.generate_tokens(_readline):
  File "/home/joncrall/.pyenv/versions/3.11.2/lib/python3.11/tokenize.py", line 525, in _tokenize
    raise TokenError("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (3, 0))

However, on 3.12, this no longer raises an error:

Instead I get:

TokenInfo(type=2 (NUMBER), string='3', start=(1, 0), end=(1, 1), line='3, 4]')
TokenInfo(type=55 (OP), string=',', start=(1, 1), end=(1, 2), line='3, 4]')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 3), end=(1, 4), line='3, 4]')
TokenInfo(type=55 (OP), string=']', start=(1, 4), end=(1, 5), line='3, 4]')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line='3, 4]')
TokenInfo(type=1 (NAME), string='print', start=(2, 0), end=(2, 5), line='print(len(x))')
TokenInfo(type=55 (OP), string='(', start=(2, 5), end=(2, 6), line='print(len(x))')
TokenInfo(type=1 (NAME), string='len', start=(2, 6), end=(2, 9), line='print(len(x))')
TokenInfo(type=55 (OP), string='(', start=(2, 9), end=(2, 10), line='print(len(x))')
TokenInfo(type=1 (NAME), string='x', start=(2, 10), end=(2, 11), line='print(len(x))')
TokenInfo(type=55 (OP), string=')', start=(2, 11), end=(2, 12), line='print(len(x))')
TokenInfo(type=55 (OP), string=')', start=(2, 12), end=(2, 13), line='print(len(x))')
TokenInfo(type=4 (NEWLINE), string='', start=(2, 13), end=(2, 14), line='print(len(x))')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')

This is a problem for xdoctest because it uses tokenize to determine if a statement is "balanced" (i.e. if it is part of a line continuation or not). This is the magic I use to autodetect PS1 vs PS2 lines and prevent users from needing to manually specify if a line is a continuation or not.

Looking through the release and migration notes, I don't see anything that would indicate that this new behavior is introduced, so I suspect it is a bug. I'm sorry I didn't catch this before the 3.12 release. I've been busy.

If this is not a bug and an intended change, then it should be documented (please link to the relevant section if I missed it). If there is a way to work around this so xdoctest works on 3.12.0 that would be helpful. (It's probably time some of the parsing code got a rewrite anyway).

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux

@Erotemic Erotemic added the type-bug An unexpected behavior, bug, or error label Oct 23, 2023
@sobolevn
Copy link
Member

cc @pablogsal @lysnikolaou

@terryjreedy
Copy link
Member

lines = list(lines) does not do anything when lines is a list, as here. Confusing in a claimed MWE.

For 3.12, the tokenize function was rewritten to use the real C-coded tokenize function instead of Python code intended to mimic the C function. "Looking through the release and migration notes,": did you look through the changelog? for 'token'? There were numerous bugfix issues that would not be mentioned in What's New.

@Erotemic
Copy link
Author

Erotemic commented Oct 23, 2023

Sorry about the extra line, I'll remove it.

Of the patches in the 3.12 changelog I've collected the ones that reference tokenize. I don't see any obvious notes indicating this should have changed, but if the implementation changed from Python to C, then it could be the case that the C variant never errored in the above case.

gh-105564: Don’t include artificil newlines in the line attribute of tokens in the APIs of the tokenize module. Patch by Pablo Galindo

gh-105549: Tokenize separately NUMBER and NAME tokens that are not ambiguous. Patch by Pablo Galindo.

gh-105390: Correctly raise tokenize.TokenError exceptions instead of SyntaxError for tokenize errors such as incomplete input. Patch by Pablo Galindo

gh-105259: Don’t include newline character for trailing NEWLINE tokens emitted in the tokenize module. Patch by Pablo Galindo

gh-105324: Fix the main function of the tokenize module when reading from sys.stdin. Patch by Pablo Galindo

gh-105017: Show CRLF lines in the tokenize string attribute in both NL and NEWLINE tokens. Patch by Marta Gómez.

gh-105017: Do not include an additional final NL token when parsing files having CRLF lines. Patch by Marta Gómez.

gh-104976: Ensure that trailing DEDENT tokenize.TokenInfo objects emitted by the tokenize module are reported as in Python 3.11. Patch by Pablo Galindo

gh-104972: Ensure that the line attribute in tokenize.TokenInfo objects in the tokenize module are always correct. Patch by Pablo Galindo

gh-104825: Tokens emitted by the tokenize module do not include an implicit \n character in the line attribute anymore. Patch by Pablo Galindo

gh-102856: Implement PEP 701 changes in the tokenize module. Patch by Marta Gómez Macías and Pablo Galindo Salgado

gh-102856: Implement the required C tokenizer changes for PEP 701. Patch by Pablo Galindo Salgado, Lysandros Nikolaou, Batuhan Taskaya, Marta Gómez Macías and sunmy2019.

gh-99891: Fix a bug in the tokenizer that could cause infinite recursion when showing syntax warnings that happen in the first line of the source. Patch by Pablo Galindo

gh-99581: Fixed a bug that was causing a buffer overflow if the tokenizer copies a line missing the newline caracter from a file that is as long as the available tokenizer buffer. Patch by Pablo galindo

gh-97997: Add running column offset to the tokenizer state to avoid calculating AST column information with pointer arithmetic.

gh-97973: Modify the tokenizer to return all necessary information the parser needs to set location information in the AST nodes, so that the parser does not have to calculate those doing pointer arithmetic.

gh-94360: Fixed a tokenizer crash when reading encoded files with syntax errors from stdin with non utf-8 encoded text. Patch by Pablo Galindo

EDIT: Thanks for the pointer about the change in implementation from Python -> C. I was able to patch xdoctest by including a vendored copy of tokenize.py from Python 3.11. This will ensure the library works with 3.12.0. However, it would be good to determine if this new behavior is indended or if the C code should also raise an error in the same instance.

@pablogsal
Copy link
Member

Thanks for opening this issue, I will take a look but unfortunately there isn't much we can promise here. Notice this warning on the tokenize docs: https://docs.python.org/3/library/tokenize.html

Screenshot 2023-10-24 at 11 07 24

Warning Note that the functions in this module are only designed to parse syntactically valid Python code (code that does not raise when parsed using ast.parse()). The behavior of the functions in this module is undefined when providing invalid Python code and it can change at any point.

As per the warning this means that the behavior of the functions is not defined when you pass code that's not syntactically valid, which includes this case.

Guaranteeing anything for invalid code is very hard for us because different users have different expectations. Indeed, some users have asked us to not raise in this case because they want to still receive the tokens even if they are unbalanced. Doing what you are suggesting will now break another set of users that don't expect us to raise.

@pablogsal
Copy link
Member

This will ensure the library works with 3.12.0.

Notice this means that you won't be able to parse PEP 701 based code with that tokenize implementation

@pablogsal
Copy link
Member

This is a problem for xdoctest because it uses tokenize to determine if a statement is "balanced" (i.e. if it is part of a line continuation or not). This is the magic I use to autodetect PS1 vs PS2 lines and prevent users from needing to manually specify if a line is a continuation or not.

This may not be the best tool for your use case, because is still somehow brittle. You can manually check if your statement is balanced by checking the tokens and check the paren balance yourself (adding +1 on opening of brackets and -1 on close). You can also consider the codeop module to check if the statement is balanced:

>>> codeop.compile_command("[3, 4")
>>> codeop.compile_command("[3, 4]")
<code object <module> at 0x104de6590, file "<input>", line 1>

this raises if is not balanced on the other end:

>>> codeop.compile_command("3, 4]")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pgalindo3/.pyenv/versions/3.12.0/lib/python3.12/codeop.py", line 107, in compile_command
    return _maybe_compile(_compile, source, filename, symbol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/.pyenv/versions/3.12.0/lib/python3.12/codeop.py", line 73, in _maybe_compile
    return compiler(source, filename, symbol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/.pyenv/versions/3.12.0/lib/python3.12/codeop.py", line 86, in _compile
    return compile(source, filename, symbol, PyCF_DONT_IMPLY_DEDENT | PyCF_ALLOW_INCOMPLETE_INPUT)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<input>", line 1
    3, 4]
        ^
SyntaxError: unmatched ']'

@pablogsal
Copy link
Member

Yeah, confirmed that raising in this cade breaks some existing code that was adapted after 3.12 was released, so I propose to close this as won't fix. @lysnikolaou what do you think?

@Erotemic
Copy link
Author

Erotemic commented Oct 24, 2023

Notice this means that you won't be able to parse PEP 701 based code with that tokenize implementation

Right... that means I do need to find a new solution. That's my bad for relying on undefined behavior.

This needs to work not only for braces / brakets, but also for quotes. Also the current solution will mark 'try: raise Exception' as balanced (and the current code does seem to rely on this behavior). I wrote this so long ago, I'm forgetting all the cases, maybe the exception case is only failing a contrived test and it's not actually necessary.

What I ultimately need to do is: given a set of text lines that is probably valid Python code, label each line as starting its own statement (a PS1 line), or that it is continuing a previous statement (a PS2 line). Python has come a long way since I first implemented this in 3.7 and had to deal with issue16806, so maybe the current tooling (and better line number management) will help clean up the xdoctest parsing code.

@pablogsal
Copy link
Member

Technically speaking there is a map that will be able to identify PS1 lines with only the line as input but there is no such thing for PS2 because that depends on whatever has been written as a PS1 line. You can label them as "potentially" PS2 lines, but that will be very brittle. You can identify PS2 lines if you have something that can look at previous lines.

@lysnikolaou
Copy link
Contributor

Yeah, confirmed that raising in this cade breaks some existing code that was adapted after 3.12 was released, so I propose to close this as won't fix.

Agreed. Especially since we did the work to make sure that cases like do not raise due to PyCQA tools etc.

Thanks for the report @Erotemic and sorry we can't do more.

@tfmorris
Copy link

FWIW Python 3.12 broke web.py / OpenLibrary as well. webpy/webpy#784

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

7 participants