Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-102613: Improve performance of pathlib.Path.rglob() #104244

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented May 6, 2023

Stop de-duplicating results in _RecursiveWildcardSelector. A new _DoubleRecursiveWildcardSelector class is introduced which performs de-duplication, but this is used only for patterns with multiple non-adjacent ** segments, such as path.glob('**/foo/**'). By avoiding the use of a set in most cases, PurePath.__hash__() is not called, and so paths do not need to be parsed and (case-) normalised.

Also merge adjacent ** segments in patterns.

Timings:

$ ./python -m timeit -s 'from pathlib import Path; p = Path()' 'list(p.glob("**/*"))'
1 loop, best of 5: 197 msec per loop   # before
2 loops, best of 5: 146 msec per loop  # after
--> 35% faster
$ ./python -m timeit -s 'from pathlib import Path; p = Path()' 'list(p.glob("**/**/*"))'
1 loop, best of 5: 1.77 sec per loop   # before
2 loops, best of 5: 146 msec per loop  # after
--> 12x faster
$ ./python -m timeit -s 'from pathlib import Path; p = Path()' 'list(p.glob("**/*/**"))'
1 loop, best of 5: 738 msec per loop   # before
1 loop, best of 5: 731 msec per loop   # after
--> about the same

Stop de-duplicating results in `_RecursiveWildcardSelector`. A new
`_DoubleRecursiveWildcardSelector` class is introduced which performs
de-duplication, but this is used _only_ for patterns with multiple
non-adjacent `**` segments, such as `path.glob('**/foo/**')`. By avoiding
the use of a set, `PurePath.__hash__()` is not called, and so paths do not
need to be parsed and (case-) normalised.

Also merge adjacent '**' segments in patterns.
Lib/pathlib.py Outdated Show resolved Hide resolved
@barneygale barneygale merged commit c0ece3d into python:main May 7, 2023
jbower-fb pushed a commit to jbower-fb/cpython-jbowerfb that referenced this pull request May 8, 2023
…nGH-104244)

Stop de-duplicating results in `_RecursiveWildcardSelector`. A new
`_DoubleRecursiveWildcardSelector` class is introduced which performs
de-duplication, but this is used _only_ for patterns with multiple
non-adjacent `**` segments, such as `path.glob('**/foo/**')`. By avoiding
the use of a set, `PurePath.__hash__()` is not called, and so paths do not
need to be stringified and case-normalised.

Also merge adjacent '**' segments in patterns.
carljm added a commit to carljm/cpython that referenced this pull request May 9, 2023
* main: (47 commits)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  pythonGH-104145: Use fully-qualified cross reference types for the bisect module (python#104172)
  pythongh-103193: Improve `getattr_static` test coverage (python#104286)
  Trim trailing whitespace and test on CI (python#104275)
  pythongh-102500: Remove mention of bytes shorthand (python#104281)
  pythongh-97696: Improve and fix documentation for asyncio eager tasks (python#104256)
  pythongh-99108: Replace SHA3 implementation HACL* version (python#103597)
  pythongh-104273: Remove redundant len() calls in argparse function (python#104274)
  pythongh-64660: Don't hardcode Argument Clinic return converter result variable name (python#104200)
  pythongh-104265 Disallow instantiation of `_csv.Reader` and `_csv.Writer` (python#104266)
  pythonGH-102613: Improve performance of `pathlib.Path.rglob()` (pythonGH-104244)
  pythongh-103650: Fix perf maps address format (python#103651)
  pythonGH-89812: Churn `pathlib.Path` methods (pythonGH-104243)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants