Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Superfluous call to __init__ on str(x)/bytes(x) when __new__ returns an instance of str/bytes' subclass #104231

Open
EZForever opened this issue May 6, 2023 · 18 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@EZForever
Copy link

EZForever commented May 6, 2023

Bug report

According to the documentation on object.__new__:

If __new__() is invoked during object construction and it returns an instance of cls, then the new instance’s __init__() method will be invoked like __init__(self[, ...]), where self is the new instance and the remaining arguments are the same as were passed to the object constructor.

If __new__() does not return an instance of cls, then the new instance’s __init__() method will not be invoked.

__new__() is intended mainly to allow subclasses of immutable types (like int, str, or tuple) to customize instance creation.

And, indeed, packages exist that utilizes this feature to provide str-compatible custom types. However, when trying to get a "string representation" of a str-compatible instance like such with str(x), the following happens:

The problem is the return value of x.__str__ can be already initialized, or even have a __init__ signature incompatible with str. This is not an issue for str itself since str.__init__ does nothing and have a wildcard signature (according to tests I've done), but it is trivial to have a custom (and incompatible) __init__ and break things.

Proof of concept that shows __init__ was called the second time by str():

class mycls:
    def __init__(self, text):
        print('mycls.__init__', type(self), id(self), repr(text))
        self.text = text

    def __str__(self):
        print('mycls.__str__', type(self), id(self))
        return self.text

class mystr(str):
    def __new__(cls, obj):
        print('mystr.__new__', cls, repr(obj))
        return super().__new__(cls, obj) # Python 2: return str.__new__(cls, obj)

    def __init__(self, obj):
        print('mystr.__init__', type(self), id(self), repr(obj))
        super().__init__() # Python2: super(str, self).__init__()

out = str(mycls(mystr('hello')))
print('out', type(out), id(out), repr(out))

Sample output on Python 3.9:

mystr.__new__ <class '__main__.mystr'> 'hello'
mystr.__init__ <class '__main__.mystr'> 2019206422080 'hello'
mycls.__init__ <class '__main__.mycls'> 2019211172304 'hello'
mycls.__str__ <class '__main__.mycls'> 2019211172304
mystr.__init__ <class '__main__.mystr'> 2019206422080 <__main__.mycls object at 0x000001D6225D59D0>
out <class '__main__.mystr'> 2019206422080 'hello'

A real-world example that breaks tomlkit:

import tomlkit # pip install tomlkit==0.11.8

class mycls:
    def __init__(self, text):
        self.text = text

    def __str__(self):
        return self.text

value = tomlkit.value('"hello"')
print(type(value))

instance = mycls(value)
print(instance)
print(str(instance))

Sample output:

<class 'tomlkit.items.String'>
hello
Traceback (most recent call last):
  File "C:\bug\poc2.py", line 16, in <module>
    print(str(instance))
TypeError: __init__() missing 3 required positional arguments: '_', 'original', and 'trivia'

This behavior is introduced by commit 8ace1ab 22 years ago, released in Python 2.3 and kept to the day.

A possible solution is to check for the exact class (instead of with subclasses) in type.__call__, however I'm not sure if this behavior is compliant with the documentation. Change str.__new__ to only allow str (and not its subclasses) to be returned by __str__ could also workaround this issue, but may break even more stuffs.

Your environment

  • CPython versions tested on: 2.7.18, 3.9.12, 3.11.3
  • Operating system and architecture: Windows 10.0.19045 x64, Debian Linux 11 x64, Fedora IoT 37 x64

Linked PRs

@EZForever EZForever added the type-bug An unexpected behavior, bug, or error label May 6, 2023
@arhadthedev arhadthedev added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label May 6, 2023
@EZForever EZForever changed the title Superfluous call to __init__ on instancing a class when __new__ returns a subclass instance Superfluous call to __init__ on instantiating a class when __new__ returns a subclass instance May 6, 2023
@terryjreedy
Copy link
Member

terryjreedy commented May 6, 2023

@gvanrossum This issue is about long-standing __new__ behavior you committed.

@gvanrossum
Copy link
Member

All the examples are using __str__. Is that essential to reproduce the behavior? If it isn't, an example without str() would be handy, just to reason about the issue without distraction. @EZForever

@EZForever
Copy link
Author

EZForever commented May 6, 2023

Reproducible with bytes()/__bytes__, as expected:

class mycls:
    def __init__(self, text):
        print('mycls.__init__', type(self), id(self), repr(text))
        self.text = text

    def __bytes__(self):
        print('mycls.__bytes__', type(self), id(self))
        return self.text

class mybytes(bytes):
    def __new__(cls, obj):
        print('mybytes.__new__', cls, repr(obj))
        return super().__new__(cls, obj)

    def __init__(self, obj):
        print('mybytes.__init__', type(self), id(self), repr(obj))
        super().__init__()

out = bytes(mycls(mybytes(b'hello')))
print('out', type(out), id(out), repr(out))

In theory the type.__call__ check will pass (and cause problems) as long as the object returned by foo.__new__ is an instance of a subclass of foo. I'm trying to construct an example without using builtin types but without success; still wrapping my brain around how __new__ works here.

Here's an example using no builtin types, demonstrating the second __init__ call and things break:

class foo:
    def __new__(cls, obj):
        print('foo.__new__', cls, repr(obj))
        if isinstance(obj, foo):
            return obj # Mimicking `__str__` returning an instance of `str`'s subclass
        else:
            return super().__new__(cls)

    def __init__(self, obj):
        print('foo.__init__', repr(self), repr(obj))

class bar1(foo):
    def __new__(cls, obj):
        print('bar1.__new__', cls, repr(obj))
        return super().__new__(cls, obj)

    def __init__(self, obj):
        print('bar1.__init__', repr(self), repr(obj))

class bar2(foo):
    def __new__(cls, obj1, obj2):
        print('bar2.__new__', cls, repr(obj1), repr(obj2))
        return super().__new__(cls, obj1)

    def __init__(self, obj1, obj2):
        print('bar2.__init__', repr(self), repr(obj1), repr(obj2))

out1 = foo(bar1(123))
print('out1', repr(out1))

print()

out2 = foo(bar2(123, 456))
print('out2', repr(out2))

Sample output on Python 3.11:

bar1.__new__ <class '__main__.bar1'> 123
foo.__new__ <class '__main__.bar1'> 123
bar1.__init__ <__main__.bar1 object at 0x000002348BE7A510> 123
foo.__new__ <class '__main__.foo'> <__main__.bar1 object at 0x000002348BE7A510>
bar1.__init__ <__main__.bar1 object at 0x000002348BE7A510> <__main__.bar1 object at 0x000002348BE7A510>
out1 <__main__.bar1 object at 0x000002348BE7A510>

bar2.__new__ <class '__main__.bar2'> 123 456
foo.__new__ <class '__main__.bar2'> 123
bar2.__init__ <__main__.bar2 object at 0x000002348BE78150> 123 456
foo.__new__ <class '__main__.foo'> <__main__.bar2 object at 0x000002348BE78150>
Traceback (most recent call last):
  File "C:\bug\poc.alt.py", line 35, in <module>
    out2 = foo(bar2(123, 456))
           ^^^^^^^^^^^^^^^^^^^
TypeError: bar2.__init__() missing 1 required positional argument: 'obj2'

@sunmy2019
Copy link
Member

sunmy2019 commented May 6, 2023

The behavior here is caused by the inconsistency of the:

  1. unicode_new_impl -> PyObject_Str returning an instance of subclass str,
  2. tp_call expects an instance of the exact type str, otherwise, it thinks it's an uninitialized instance.

str is a type, but str(...) sometimes is an initialized instance through __str__ not having type str.

This is indeed a bug. Change 1. breaks things, and change 2. means adding some ad-hoc rules for str and bytes.


Personal thoughts:

I am in favor of changing 1., by adding one more conversion in unicode_new_impl. By doing so, str(...) truly creates a str.

But on the other hand, changing 2. affects users' code in a minimal way. People may consider str(x) as an alias of x.__str__(), not the str(x.__str__()). Just like the analogy of dir(x) is x.__dir__(), len(x) is x.__len__().

@gvanrossum
Copy link
Member

Then please tone down the issue title to limit it to str and bytes. The general behavior is as intended -- when constructing an instance of C, and C.__new__ returns an instance of a subclass of C, that instance's __init__ is called. Even if you think that's a counter-productive rule, it has been documented this way for a long time and it just isn't worth risking to break user code that relies on it (and there always is such code).

Alas, I am no expert on the Unicode or bytes implementations. Let's see if @serhiy-storchaka is interested.

@gvanrossum
Copy link
Member

gvanrossum commented May 6, 2023

@sunmy2019 I don’t think we can change (1). The expectation that str(x) returns an exact str instance should not be violated.


Also we should stop editing comments to add new information/opinions.

@EZForever EZForever changed the title Superfluous call to __init__ on instantiating a class when __new__ returns a subclass instance Superfluous call to __init__ on str(x)/bytes(x) when __new__ returns an instance of str/bytes' subclass May 6, 2023
@EZForever
Copy link
Author

Acknowledged. Changed the title.

@sunmy2019
Copy link
Member

The expectation that str(x) returns an exact str instance should not be violated.

Changing (1), I mean:
Add one more conversion, so that str(x) has the exact type of str instead of its sub-class.

Changing (2), I mean:
Add ad-hoc rules, so that the __str__ or __bytes__ case will not call __init__ on sub-class any more.

@gvanrossum
Copy link
Member

Maybe you can submit pull requests for each option so we can compare.

@serhiy-storchaka
Copy link
Member

Similar issues were discussed for numeric types. Added @mdickinson as an expert.

It was decided that __index__() and __int__() should always return an instance of exact int, not int subclass, and the similar for __float__() and __complex__(). Now PyNumber_Index() emits a warning if __index__() returns an instance of exact int subclass, in future it will be error.

Example:

class mycls:
    def __init__(self, value):
        print('mycls.__init__', type(self), id(self), repr(value))
        self.value = value
    def __index__(self):
        print('mycls.__index__', type(self), id(self))
        return self.value

class myint(int):
    def __new__(cls, obj):
        print('myint.__new__', cls, repr(obj))
        return super().__new__(cls, obj)
    def __init__(self, obj):
        print('myint.__init__', type(self), id(self), repr(obj))
        super().__init__()

out = int(mycls(myint(42)))
print('out', type(out), id(out), repr(out))

Output:

myint.__new__ <class '__main__.myint'> 42
myint.__init__ <class '__main__.myint'> 140635079373472 42
mycls.__init__ <class '__main__.mycls'> 140635079336240 42
mycls.__index__ <class '__main__.mycls'> 140635079336240
<stdin>:1: DeprecationWarning: __index__ returned non-int (type myint).  The ability to return an instance of a strict subclass of int is deprecated, and may be removed in a future version of Python.
out <class 'int'> 94419303744280 42

I am not sure that it was the best solution (an alternative is PyNumber_Index() silently converting an int subclass into int, it would break less code), but what's done is done.

Now, for consistency, we need to deprecate __str__() returning str subclass and __bytes__() returning bytes subclass. Or change the decision for numeric types if we add an implicit conversion for str and bytes.

@sunmy2019
Copy link
Member

sunmy2019 commented May 6, 2023

Maybe you can submit pull requests for each option so we can compare.

I opened a draft PR for changing (1) #104247. Turns out that changing (2) is not feasible without breaking some APIs. Currently, there is no easy way to find out if the str is from __str__ or from a direct creation.

I am not sure that it was the best solution (an alternative is PyNumber_Index() silently converting an int subclass into int, it would break less code), but what's done is done.

This also makes sense. I think the agreement here would be str created from __str__ must be an exact str object.

@mdickinson
Copy link
Member

FWIW, if consistency is deemed important here I wouldn't have much objection to __index__ and __int__ being weakened to allow returning instances of subclasses of int, so long as int itself (and PyNumber_Long, operator.index, and PyNumber_Index) keep their guarantee of returning something of exact type int.

@gvanrossum
Copy link
Member

FWIW, if consistency is deemed important here I wouldn't have much objection to __index__ and __int__ being weakened to allow returning instances of subclasses of int, so long as int itself (and PyNumber_Long, operator.index, and PyNumber_Index) keep their guarantee of returning something of exact type int.

So that would mean dropping the existing deprecation again. I have a gut reaction against such reversals, but we should think a bit harder about what's the best thing to do here. Should we just add the deprecation to str() and bytes()?

Also, I should probably research this but I'm lazy -- is the current deprecation issued from int() or from the wrapper for __int__/__index__? We should be sure to match whatever it does for __str__ and __bytes__. We should also double check that the ultimate fix (forcing str() and perhaps __str__() to always return an exact str instance) makes the OP happy.

@EZForever
Copy link
Author

EZForever commented May 9, 2023

is the current deprecation issued from int() or from the wrapper for __int__/__index__?

The deprecation warnings are issued from PyNumber_Long, PyNumber_Index, PyNumber_Float and try_complex_special_method, all called by __new__ of corresponding types on checking the return value of __int__, __index__, etc. So to keep consistency we should add the warnings to str() and bytes(), respectively.

We should also double check that the ultimate fix (forcing str() and perhaps __str__() to always return an exact str instance) makes the OP happy.

I don't have any preference on the given solutions, sice all of them are compliant to the documentation and will break existing code in some way or another. If I have to choose, I'm in favor of the idea presented by the current PR - that is, make str() and bytes() more constructor-like by implicit conversion or issuing deprecation warnings. Although this solution is also not perfect IMO, since it breaks the assumption that x.__str__() is equivalent to str(x), etc., but no guarantee was given on that in the first place.

However I do wish to suggest that we clarify the documentation on the usage of subclasses in Python data models. IMHO the current wording on actual types used by the dunder methods is ambiguous to say the least. To quote the documentation on __int__, its return value should be "of appropriate type", however the subclasses of int, etc. may also be seen as appropriate under some circumstances, which is actually deprecated behavior not mentioned in the documentation. I also believe that by improving the wording we can reduce bugs like this in the future, and may even help the development of other Python implementations.

@gvanrossum
Copy link
Member

Okay, then I am slightly leaning towards changing str() and bytes() to implement a similar deprecation as int() etc. I think this should be done on the same PR (it's fine to add to the existing PR).

A separate PR with doc changes for subclass usage in the data model would also be welcome.

Let's aim for Python 3.13.

@sunmy2019
Copy link
Member

I am slightly leaning towards changing str() and bytes() to implement a similar deprecation as int()

I will do that. And I think we should also clarify in the warning that "if not doing so, the behavior could be wrong".

I will drop the implicit conversion.

Let's aim for Python 3.13.

Great, this will give us plenty of time.

@sunmy2019
Copy link
Member

Created #108814

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Nov 30, 2023
…d bytes() (pythonGH-112551)

(cherry picked from commit 2223899)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Nov 30, 2023
…d bytes() (pythonGH-112551)

(cherry picked from commit 2223899)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@terryjreedy
Copy link
Member

@serhiy-storchaka Your backports are ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

7 participants