Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-32492: 1.6x speed up in namedtuple attribute access using C fast-path #10495

Merged
merged 14 commits into from
Dec 30, 2018

Conversation

pablogsal
Copy link
Member

@pablogsal pablogsal commented Nov 13, 2018

Timing benchmarks

Attribute Access

import perf

runner = perf.Runner()
runner.timeit("a.x",
              stmt="a.x",
              setup="import collections;A=collections.namedtuple('A','x')")
./python -m perf compare_to old.json new.json -v
Mean +- std dev: [old] 280 ns +- 3 ns -> [new] 111 ns +- 1 ns: 2.52x faster

Apparently, there is a regression in the current master. This is the comparison against 3.7:

Mean +- std dev: [old] 176 ns +- 2 ns -> [new] 110 ns +- 2 ns: 1.61x faster (-38%)
Significant (t=177.69)

Creation

(Just to check that creation is not slower)

import perf

runner = perf.Runner()
runner.timeit("collections.namedtuple('A','x')",
              stmt="collections.namedtuple('A','x')",
              setup="import collections")
Mean +- std dev: [old_creation] 209 us +- 3 us -> [new_creation] 207 us +- 4 us: 1.01x faster
import perf

runner = perf.Runner()
runner.timeit("A(2324)",
              stmt="A(2324)",
              setup="import collections;A=collections.namedtuple('A','x')")
Mean +- std dev: [old_creation_obj] 1.41 us +- 0.03 us -> [new_creation_obj] 1.41 us +- 0.02 us: 1.00x faster (-0%)

Cache efficiency

Baseline

❯ perf stat -r 200 -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./python -c "
import collections
A = collections.namedtuple('A','x');a = A(42)
for _ in range(100):
    some_var = a.x
"""


 Performance counter stats for './python -c
import collections
A = collections.namedtuple('A','x');a = A(42)
for _ in range(100):
    some_var = a.x
' (200 runs):

         1,469,290      cache-references:u                                            ( +-  0.26% )
            20,240      cache-misses:u            #    1.378 % of all cache refs      ( +-  8.58% )
       146,812,273      cycles:u                                                      ( +-  0.24% )
       201,131,089      instructions:u            #    1.37  insn per cycle           ( +-  0.01% )
        40,257,360      branches:u                                                    ( +-  0.01% )
             1,175      faults:u                                                      ( +-  0.01% )
                 0      migrations:u

          0.050526 +- 0.000281 seconds time elapsed  ( +-  0.56% )

Patched

❯ perf stat -r 200 -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./python -c "
import collections
A = collections.namedtuple('A','x');a = A(42)
for _ in range(100):
    some_var = a.x
"""                                

 Performance counter stats for './python -c
import collections
A = collections.namedtuple('A','x');a = A(42)
for _ in range(100):
    some_var = a.x
' (200 runs):

         1,471,736      cache-references:u                                            ( +-  0.11% )
             7,196      cache-misses:u            #    0.489 % of all cache refs      ( +-  6.94% )
       145,004,120      cycles:u                                                      ( +-  0.07% )
       201,182,075      instructions:u            #    1.39  insn per cycle           ( +-  0.01% )
        40,219,107      branches:u                                                    ( +-  0.01% )
             1,174      faults:u                                                      ( +-  0.01% )
                 0      migrations:u

          0.048499 +- 0.000222 seconds time elapsed  ( +-  0.44% )

https://bugs.python.org/issue32492

Py_INCREF(self);
return self;
}
result = PyTuple_GetItem(obj, ((_tuplegetterobject*)self)->index);
Copy link
Member Author

@pablogsal pablogsal Nov 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a PyTuple_Check here, but as this class is private to the collections module,
we can assume that the contract is that it has to be used in a tuple-like class. Notice that this stills throws SystemError: Objects/tupleobject.c:152: bad argument to internal function if is used incorrectly.

Being said that, I am happy to add the check if people think otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can be triggered from Python, we should avoid SystemError and crashes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a PyTuple_Check that throws a TypeError in 21be735.

PyObject *prop_del;
PyObject *prop_doc;
int getter_doc;
} propertyobject;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be in the limited API. And why it is added here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the new descriptor needs to inherit from PyProperty_Type and the object type (propertyobject) was not exposed in the headerfile, making impossible to inherit from it. I placed the object definition (propertyobject) together with the Type definition (PyProperty_Type).

Where is the best place to place the definition?

cache[index] = itemgetter_object, doc
class_namespace[name] = property(itemgetter_object, doc=doc)
tuplegetter_object = _tuplegetter(index, doc=doc)
cache[index] = tuplegetter_object, doc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__doc__ is a writable attribute, and descriptors in different namedtuples can have different docs. This is why itemgetter objects was cached instead property objects.

A = namedtuple('A', 'x y')
B = namedtuple('B', 'x y')
A.x.__doc__ = 'foo'
B.x.__doc__ = 'bar'
assert A.x.__doc__ = 'foo'
assert B.x.__doc__ = 'bar'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am caching only the docstring in 21be735

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test based on the above example.

Py_INCREF(self);
return self;
}
result = PyTuple_GetItem(obj, ((_tuplegetterobject*)self)->index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can be triggered from Python, we should avoid SystemError and crashes.

@@ -0,0 +1,2 @@
Speed up :class:`namedtuple` attribute access by 2.5x using a C fast-path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make comparison not with the current master, but with 3.7. There is a regression in master.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mean +- std dev: [old] 176 ns +- 2 ns -> [new] 110 ns +- 2 ns: 1.61x faster (-38%)
Significant (t=177.69)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore the number 1.6x should be used in the documentation.

@pablogsal pablogsal changed the title bpo-32492: 2.5x speed up in namedtuple attribute access using C fast-path bpo-32492: 1.6x speed up in namedtuple attribute access using C fast-path Nov 13, 2018
@vstinner
Copy link
Member

vstinner commented Nov 13, 2018

Sorry, I didn't check your implementation, but did you consider to reuse existing structseq type to implement namedtuple? https://bugs.python.org/issue28638#msg298499 Last time I ran a microbenchmark, structseq was 1.9x faster than namedtuple to get an attribute by name.

In the meanwhile, I removed property_descr_get() micro-optimization because it wasn't correct and caused 3 different crashed, bpo-30156, commit e972c13. So I get that structseq is now even faster than namedtuple to get an attribute :-)

@pablogsal
Copy link
Member Author

pablogsal commented Nov 13, 2018

Sorry, I didn't check your implementation, but did you consider to reuse existing structseq type to implement namedtuple? https://bugs.python.org/issue28638#msg298499 Last time I ran a microbenchmark, structseq was 1.9x faster than namedtuple to get an attribute by name.

Hummm I did not consider this, but that will involve more significant and fundamental changes than this Pull Request. Also, apparently there is this issue that Josh Rosenberg ran into when implementing the idea. I am happy to give it a go if people agree that is a good idea :) But I think this we can start with this Pull Request as is simpler and it gives some immediate speedup.

@serhiy-storchaka
Copy link
Member

You do not need a subclass of property. You need just a descriptor.

Look also at __slots__ implementation.

@pablogsal
Copy link
Member Author

@serhiy-storchaka Thanks! I will take a look into that. Independently, if we don't move the property object to the header file, is not possible to subclass property in C. What do you think we should do with that?

@@ -0,0 +1,2 @@
Speed up :class:`namedtuple` attribute access by 2.5x using a C fast-path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore the number 1.6x should be used in the documentation.

@@ -454,12 +459,13 @@ def __getnewargs__(self):
cache = _nt_itemgetters
for index, name in enumerate(field_names):
try:
itemgetter_object, doc = cache[index]
doc = cache[index]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please measure the effect of caching docstrings. Adding this cache for docstrings sped up namedtuple type creation by 10% in former implementation, but removing the cache for itemgetters should reduce the benefit. If it is too small, it may be not worth to use the cache at all.

Copy link
Member Author

@pablogsal pablogsal Nov 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This are the results for:

from collections import namedtuple; names = ['field%d' % i for i in range(1000)]" -- "namedtuple('A', names)"
❯ ./python -m perf compare_to  with_caching.json without_caching.json
Mean +- std dev: [with_caching] 7.88 ms +- 0.12 ms -> [without_caching] 8.24 ms +- 0.04 ms: 1.05x slower (+5%)

Is 5% slower without the cache.

PyObject_HEAD
Py_ssize_t index;
PyObject* doc;
} _tuplegetterobject;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reuse PyMemberDescrObject or PyGetSetDescrObject here, without creating a new type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hummm.... I don't see an obvious way to do that. We still need a custom descriptor protocol to access the namedtuple object and a mutable doc field.

Modules/_collectionsmodule.c Outdated Show resolved Hide resolved
Modules/_collectionsmodule.c Outdated Show resolved Hide resolved
Modules/_collectionsmodule.c Outdated Show resolved Hide resolved
Modules/_collectionsmodule.c Outdated Show resolved Hide resolved
return self;
}
if (!PyTuple_Check(obj)){
PyErr_SetString(PyExc_TypeError, "_tuplegetter must be used with tuples");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to avoid using private class names in error messages.

Common error message looks like:

>>> int.numerator.__get__([])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: descriptor 'numerator' for 'int' objects doesn't apply to 'list' object

It needs additional information: the name of the namedtuple type and the name of the filed. If it is hard to add this information, the error message can use more general words.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: descriptor for index '1' for tuple subclasses doesn't apply to 'A' object

cache[index] = itemgetter_object, doc
class_namespace[name] = property(itemgetter_object, doc=doc)
tuplegetter_object = _tuplegetter(index, doc=doc)
cache[index] = tuplegetter_object, doc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test based on the above example.

cache[index] = doc

tuplegetter_object = _tuplegetter(index, doc=doc)
class_namespace[name] = tuplegetter_object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can tuplegetter object be cached?

try:
    tuplegetter_object = cache[index]
except KeyError:
    tuplegetter_object = cache[index] = _tuplegetter(index, doc=f'Alias for field number {index}')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, no because the docstrings are mutable. Check Serhiy comment:

#10495 (comment)

@rhettinger rhettinger self-assigned this Nov 26, 2018
@serhiy-storchaka
Copy link
Member

Try to make constructor arguments positional-only and repeat benchmarks for creating a namedtuple type. I think this can save several percents of creation time.

@pablogsal
Copy link
Member Author

pablogsal commented Nov 27, 2018

@serhiy-storchaka Here are the results (commit e5bca1d):

import perf

runner = perf.Runner()
runner.timeit("collections.namedtuple('A','x')", 
stmt="collections.namedtuple('A','x')", 
setup="import collections")
❯ ./python -m perf compare_to  ../cpython_baseline/old_creation.json new_creation.json
Mean +- std dev: [old_creation] 107 us +- 3 us -> [new_creation] 103 us +- 2 us: 1.03x faster (-4%)

@rhettinger
Copy link
Contributor

This patch looks great. Thanks for the effort to get this done :-)

Before this gets committed, please make a couple of improvements.

1. The _tuplegetter() API needs to more fully emulate property():

>>> set(dir(property)) - set(dir(_tuplegetter))
{'__delete__', 'fdel', 'deleter', '__isabstractmethod__', 'setter', '__set__', 'getter', 'fget', 'fset'}

Part of the reason is that we want tuplegetter() to be a drop in substitute, supporting whatever interactions users have had with it before now (this is an old API). Another reason is that tuplegetter() needs to be recognized as a data descriptor so that its docstrings show-up in the output of help().

Formerly, running >>> help(namedtuple('Point', ['x', 'y'])(10, 20)) would produce:

 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  x
 |      Alias for field number 0
 |  
 |  y
 |      Alias for field number 1

Now we get:

 |  Methods defined here:
 |
 |  x = <_collections._tuplegetter object>
 |  y = <_collections._tuplegetter object>

2. The code in tuplegetterdescr_get can be made tighter by using PyTuple_GET_SIZE() and PyTuple_GET_ITEM() instead of PyTuple_GetItem(). That saves the function call overhead and a redundant duplicate PyTuple_Check (the second check is 100% branch predictable which is good, but still incurs two chained memory accesses).

In running timings, we should not only benchmark 1.6x to 2.5 improvement, but also compare against regular attribute access to an instance of a class that defines __slots__. Ideally tuplegetter() should be almost as fast as member objects since both do almost exactly the same work (indexing into a tuple should be only slightly slower than into slots).

@serhiy-storchaka
Copy link
Member

To be recognized as a data descriptor tuplegetter needs to implement __set__.

I do not think that _tuplegetter should fully emulate property. It is enough if implement the common API of properties and data descriptors.

>>> sorted(set(dir(int.numerator)) - set(dir(_tuplegetter)))
['__delete__', '__name__', '__objclass__', '__qualname__', '__set__']
>>> class A: __slots__ = 'x'
... 
>>> sorted(set(dir(A.x)) - set(dir(_tuplegetter)))
['__delete__', '__name__', '__objclass__', '__qualname__', '__set__']

@pablogsal
Copy link
Member Author

pablogsal commented Dec 28, 2018

@rhettinger @serhiy-storchaka I am not sure what is the best path to follow. To make this PR simpler, what do you think about just reverting commit 1e14509 so tuplegetter inherits from property setting all the attributes. This "emulates" property but I cannot see any obvious downside and it makes the implementation much cleaner and (IMHO) maintainable. Manually implementing all the property methods here seems to me like raising a lot the maintenance burden, without mentioning future divergence with property.

@serhiy-storchaka Although tuplegetter would be fine just implementing the common API of data descriptors, people may be using the old properties in the namedtuple directly, accessing some particular fields that only properties have, so not implementing them may be a regression, right?

@serhiy-storchaka
Copy link
Member

I do not see sense in full emulating a property, and in any case your past versions did not do this.

Attributes setter and deleter are used only for defining the setter and the deleter in the class definition. _tuplegetter will not be used in such way.

__isabstractmethod__ does not make sense since _tuplegetter is not abstract.

fget, fset and fdel were not provided. In any case the user should not depend on such implementation detail. For getting a getter for the specific attribute they should use operator.attrgetter or trivial lambda.

We are at the pre-alpha stage. If some code will be broken by this change, we have enough time to fix it.

@pablogsal
Copy link
Member Author

pablogsal commented Dec 28, 2018

@serhiy-storchaka So you propose to implement:

['__delete__', '__name__', '__objclass__', '__qualname__', '__set__']

Is that correct?

@serhiy-storchaka
Copy link
Member

Try to implement just __set__ and __delete__. If this is not enough for pydoc, implement more.

@pablogsal
Copy link
Member Author

After db3ffcd:

>>> from collections import namedtuple
>>> help(namedtuple('Point', ['x', 'y'])(10, 20))

|  Data descriptors defined here:
|
|  x
|      Alias for field number 0
|
|  y
|      Alias for field number 1
|

>>> set(dir(property)) - set(dir(_tuplegetter))
{'fget', 'deleter', '__isabstractmethod__', 'getter', 'setter', 'fdel', 'fset'}

@pablogsal
Copy link
Member Author

pablogsal commented Dec 29, 2018

Benchmark agains a class definning __slots__ and tuples:

import perf

runner = perf.Runner()

runner.timeit("namedtuple",
        stmt="a.x",
        setup="""\
import collections
a = collections.namedtuple('A', ['x'])(3)
""")

runner.timeit("slots",
        stmt="b.x",
        setup="""\
class B:
    __slots__ = ("x",)

    def __init__(self, x):
        self.x = x
b = B(3)
""")

runner.timeit("tuple",
        stmt="b[0]",
        setup="""\
b = (3,)
""")

Results (no PGO):

./python ../experiment.py
.....................
namedtuple: Mean +- std dev: 34.7 ns +- 0.6 ns
.....................
slots: Mean +- std dev: 38.3 ns +- 1.8 ns
.....................
tuple: Mean +- std dev: 34.6 ns +- 0.2 ns

It turns that the latest _tuplegetter is 8% faster than __slots__ and basically the same as the tuple.

I ran some experiments regarding the inlining of PyTuple_GetItem and even without PGO is unoticeable under -O3 optimization. The x86 for the function call diff is:

        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        ...
        cmpq❘---%rdx, %rsi
        cmpq❘---%rax, 32(%rsi)

The two cmpq are smashed by the branch predictor and the stack allocation (the push and the two movs and subsequent) are almost negligible as the stack of tuplegetterdescr_get is reused. On the other hand, under O2 and less this changes and you can notice a small jitter in the benchmarks, so I think is a good idea to inline the call to PyTuple_GetItem as @rhettinger recommended.

if (value == NULL) {
return PyObject_DelItem(obj, index);
}
return PyObject_SetItem(obj, index, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just raise an AttributeError similar to errors for other read-only attributes:

>>> 1 .numerator = 2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: attribute 'numerator' of 'int' objects is not writable
>>> sys.version_info.major = 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: readonly attribute
>>> sched.Event(0, 0, None, (), {}).time = 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: can't set attribute

}

result = PyTuple_GET_ITEM(obj, index);
Py_XINCREF(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Py_INCREF() should be used here, not Py_XINCREF().

Copy link
Member Author

@pablogsal pablogsal Dec 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, now that we inline PyTuple_GetItem, the index cannot be out of bounds after the check and therefore it will always return something. Good catch!

@rhettinger
Copy link
Contributor

FWIW, my timings show a significant improvement (more than 2x) and that named tuple attribute access is now on-par with access to member objects created by _slots_.

Nice work.

@rhettinger rhettinger merged commit 3f5fc70 into python:master Dec 30, 2018
@bedevere-bot
Copy link

@rhettinger: Please replace # with GH- in the commit message next time. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants