Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Use cat_ranges in fsspec source #1162

Merged
merged 3 commits into from
Mar 20, 2024
Merged

Conversation

nsmith-
Copy link
Collaborator

@nsmith- nsmith- commented Mar 5, 2024

This allows filesystem implementations to use possibly more optimal request strategies, such as fsspec-xrootd's use of vector_read.

It's not clear to me if all fsspec implementations support this, and if it will generally perform better or not. At least for xrootd it is an improvement.

This allows filesystem implementations to use possibly more optimal
request strategies, such as fsspec-xrootd's use of vector_read.
@nsmith- nsmith- changed the title Use cat_ranges in fsspec source feat: Use cat_ranges in fsspec source Mar 5, 2024
@jpivarski
Copy link
Member

@lobis found that not all fsspec implementations support cat_ranges—I remember now that this did come up. If we have to check before we use it, is there a way to do that, @martindurant?

@nsmith-
Copy link
Collaborator Author

nsmith- commented Mar 5, 2024

Well I suppose hasattr(fs, "_cat_ranges") is a quick option

@nsmith- nsmith- requested a review from lobis March 5, 2024 23:09
@nsmith-
Copy link
Collaborator Author

nsmith- commented Mar 5, 2024

The vector read is actually broken in fsspec-xrootd, I fixed it in CoffeaTeam/fsspec-xrootd#56, so that is (should be) necessary for the network tests to pass

@lobis
Copy link
Collaborator

lobis commented Mar 6, 2024

@lobis found that not all fsspec implementations support cat_ranges—I remember now that this did come up. If we have to check before we use it, is there a way to do that, @martindurant?

Well I suppose hasattr(fs, "_cat_ranges") is a quick option

If I remember correctly there was a reason why I chose not to use _cat_ranges, not because it was not implemented in all sources.

I think it's because fetching all chunks in a single call is worse than being able to fetch each individually, even if both fetch concurrently. When you use the individual _cat_range you can process the chunk while some other is being fetched, while if you use a single call to _cat_ranges you need to wait until all are fetched.

@agoose77
Copy link
Collaborator

agoose77 commented Mar 6, 2024

@lobis @nsmith- I've not spent much time thinking about this, but is there a benefit to batching range requests into units of _cat_ranges?

@martindurant
Copy link

Agreed that _cat_ranges has no benefit over a bunch of _cat_file calls if you are running them concurrently in async land yourself. There is the potential to merge adjacent/nearly calls to cut down the total number of requests, but this is only implemented in referenceFS (in cat(), actually, to decide what ranges to get). cat_ranges (no leading underscore) is very much better, though, than many cat_files, since each of these calls would block.

We could:

  • implement a fallback cat_ranges in AbstractFileSystem to maintain compatibility
  • merge ranges in _cat_ranges when possible

you can process the chunk while some other is being fetched,

Not strictly true: you can process a chunk while another is waiting (latency period), but it can't actively read at the same time. All downloads involve some amount of GIL-holding decode. This is why IO pipelining like this, if it were to be implemented async, should probably not be in python. fsspec actually runs async in a thread, so you might get thrashing if there are many CPU-heavy tasks too.

@nsmith-
Copy link
Collaborator Author

nsmith- commented Mar 6, 2024

Agreed that _cat_ranges has no benefit over a bunch of _cat_file calls if you are running them concurrently in async land yourself.

Also agreed, modulo the deficiencies of the xrootd protocol, which requires setting up a remote "handle" per fetch, which adds overhead and server load. I think if we cache these "handles" the same way as other protocols cache connections, we'll be fine in this regard, but it needs to be measured. That is the subject of CoffeaTeam/fsspec-xrootd#54

you can process the chunk while some other is being fetched,

More specifically, you can run decompress in another thread while more data is being fetched, and that releases GIL. This is a benefit I agree is useful, but I am not sure if it outweights the costs in the case of xrootd.

@jpivarski
Copy link
Member

I think this is done and I just forgot to merge it. I'm going to update and enable auto-merge (which will probably fail because all of the Windows tests are failing now, across all PRs).

@jpivarski jpivarski enabled auto-merge (squash) March 20, 2024 16:52
@nsmith-
Copy link
Collaborator Author

nsmith- commented Mar 20, 2024

Ok, though I plan to follow this up with batching range requests into units of _cat_ranges as @agoose77 proposed, as well as some range coalescing logic

@martindurant
Copy link

https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L533 may help with that. You can see how it's used in ReferenceFileSystem

@jpivarski jpivarski disabled auto-merge March 20, 2024 17:29
@jpivarski
Copy link
Member

Do you want to do that in this PR?

(I have a fix for the Windows failures coming in #1178.)

@nsmith-
Copy link
Collaborator Author

nsmith- commented Mar 20, 2024

I think its fine to do that in a separate PR, unless @lobis objects

@jpivarski
Copy link
Member

Okay, I'll merge this for now. Thanks!

@jpivarski jpivarski merged commit e1cc99c into scikit-hep:main Mar 20, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants