Skip to content

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Apr 6, 2024

Move pathlib globbing implementation into a new private class: glob._Globber. This class implements fast string-based globbing. It's called by pathlib.Path.glob(), which then converts strings back to path objects.

In the private pathlib ABCs, add a pathlib._abc.Globber subclass that works with PathBase objects rather than strings, and calls user-defined path methods like PathBase.stat() rather than os.stat().

This sets the stage for two more improvements:

Timings:

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/*'))"
1000 loops, best of 5: 392 usec per loop
1000 loops, best of 5: 365 usec per loop
# --> 1.07x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/*.py'))"
1000 loops, best of 5: 393 usec per loop
1000 loops, best of 5: 371 usec per loop
# --> 1.06x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**'))"
50 loops, best of 5: 9.46 msec per loop
50 loops, best of 5: 9.06 msec per loop
# --> 1.04x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/'))"
50 loops, best of 5: 4.98 msec per loop
50 loops, best of 5: 5.15 msec per loop
# --> 1.03x slower (!)

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/*'))"
20 loops, best of 5: 14 msec per loop
20 loops, best of 5: 12.9 msec per loop
# --> 1.09x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/*.py'))"
20 loops, best of 5: 12.2 msec per loop
20 loops, best of 5: 11.4 msec per loop
# --> 1.07x faster

Move pathlib globbing implementation to a new module and class:
`pathlib._glob.Globber`. This class implements fast string-based globbing.
It's called by `pathlib.Path.glob()`, which then converts strings back to
path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that
works with `PathBase` objects rather than strings, and calls user-defined
path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).
@barneygale
Copy link
Contributor Author

This is the first PR in a series that will hopefully unify the globbing implementations in the pathlib and glob modules, and speed both up in the process.

@barneygale
Copy link
Contributor Author

barneygale commented Apr 7, 2024

Hey @serhiy-storchaka, does this PR look alright to you? Not requesting a detailed review, more of a sanity check, given you've looked after the glob module for the last few years.

This PR doesn't affect glob.[i]glob(), but it does move pathlib's globbing implementation into glob.py.

Thank you.

@barneygale
Copy link
Contributor Author

I'll merge this now as it's important for #115060, which I'm hoping to get done in time for 3.13 beta 1.

But I'll leave glob.glob() and glob.iglob() unchanged in 3.13; any PRs I make will target 3.14.

@barneygale barneygale merged commit 6258844 into python:main Apr 10, 2024
barneygale added a commit to barneygale/cpython that referenced this pull request Apr 10, 2024
Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new
`glob._Globber.walk()` classmethod works with strings internally, which is
a little faster than generating `Path` objects and keeping them normalized.
The `pathlib.Path.walk()` method converts the strings back to path objects.

In the private pathlib ABCs, our existing subclass of `_Globber` ensures
that `PathBase` instances are used throughout.

Follow-up to python#117589.
barneygale added a commit that referenced this pull request Apr 11, 2024
…17726)

Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new
`glob._Globber.walk()` classmethod works with strings internally, which is
a little faster than generating `Path` objects and keeping them normalized.
The `pathlib.Path.walk()` method converts the strings back to path objects.

In the private pathlib ABCs, our existing subclass of `_Globber` ensures
that `PathBase` instances are used throughout.

Follow-up to #117589.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
…gs (python#117589)

Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing.

No change to the implementations of `glob.glob()` and `glob.iglob()`.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
…gs (python#117726)

Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new
`glob._Globber.walk()` classmethod works with strings internally, which is
a little faster than generating `Path` objects and keeping them normalized.
The `pathlib.Path.walk()` method converts the strings back to path objects.

In the private pathlib ABCs, our existing subclass of `_Globber` ensures
that `PathBase` instances are used throughout.

Follow-up to python#117589.
cjwatson added a commit to cjwatson/pypandoc that referenced this pull request Dec 8, 2024
As of python/cpython#117589 (at least),
`Path.glob` returns an `Iterator` rather than `Generator` (which
inherits from `Iterator`).  `convert_file` doesn't need to care about
this distinction; it can reasonably accept both.

This previously caused a test failure along these lines:

  ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________

  self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob>

      def test_basic_conversion_from_file_pattern_pathlib_glob(self):
          received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower()
  >       received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower()

  tests.py:654:
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

  source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True
  sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True
  [...]
          if not _identify_path(discovered_source_files):
  >           raise RuntimeError("source_file is not a valid path")
  E           RuntimeError: source_file is not a valid path

  pypandoc/__init__.py:201: RuntimeError
cjwatson added a commit to cjwatson/typeshed that referenced this pull request Dec 8, 2024
Since python/cpython#117589 (at least),
`Path.glob` and `Path.rglob` return an `Iterator` rather than a
`Generator`.
cjwatson added a commit to cjwatson/typeshed that referenced this pull request Dec 8, 2024
Since python/cpython#117589 (at least),
`Path.glob` and `Path.rglob` return an `Iterator` rather than a
`Generator`.
JessicaTegner added a commit to JessicaTegner/pypandoc that referenced this pull request Jan 8, 2025
As of python/cpython#117589 (at least),
`Path.glob` returns an `Iterator` rather than `Generator` (which
inherits from `Iterator`).  `convert_file` doesn't need to care about
this distinction; it can reasonably accept both.

This previously caused a test failure along these lines:

  ______________________________________________________ TestPypandoc.test_basic_conversion_from_file_pattern_pathlib_glob _______________________________________________________

  self = <tests.TestPypandoc testMethod=test_basic_conversion_from_file_pattern_pathlib_glob>

      def test_basic_conversion_from_file_pattern_pathlib_glob(self):
          received_from_str_filename_input = pypandoc.convert_file("./*.md", 'html').lower()
  >       received_from_path_filename_input = pypandoc.convert_file(Path(".").glob("*.md"), 'html').lower()

  tests.py:654:
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

  source_file = <map object at 0x7f83952c9420>, to = 'html', format = None, extra_args = (), encoding = 'utf-8', outputfile = None, filters = None, verify_format = True
  sandbox = False, cworkdir = '/home/cjwatson/src/python/pypandoc', sort_files = True
  [...]
          if not _identify_path(discovered_source_files):
  >           raise RuntimeError("source_file is not a valid path")
  E           RuntimeError: source_file is not a valid path

  pypandoc/__init__.py:201: RuntimeError

Co-authored-by: Jessica Tegner <jessica@jessicategner.com>
srittau pushed a commit to python/typeshed that referenced this pull request Feb 28, 2025
Since python/cpython#117589 (at least),
`Path.glob` and `Path.rglob` return an `Iterator` rather than a
`Generator`.
mmingyu pushed a commit to mmingyu/typeshed that referenced this pull request May 16, 2025
Since python/cpython#117589 (at least),
`Path.glob` and `Path.rglob` return an `Iterator` rather than a
`Generator`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant