feat: Use cat_ranges in fsspec source #1162

nsmith- · 2024-03-05T23:00:26Z

This allows filesystem implementations to use possibly more optimal request strategies, such as fsspec-xrootd's use of vector_read.

It's not clear to me if all fsspec implementations support this, and if it will generally perform better or not. At least for xrootd it is an improvement.

This allows filesystem implementations to use possibly more optimal request strategies, such as fsspec-xrootd's use of vector_read.

jpivarski · 2024-03-05T23:04:08Z

@lobis found that not all fsspec implementations support cat_ranges—I remember now that this did come up. If we have to check before we use it, is there a way to do that, @martindurant?

nsmith- · 2024-03-05T23:08:58Z

Well I suppose hasattr(fs, "_cat_ranges") is a quick option

nsmith- · 2024-03-05T23:21:23Z

The vector read is actually broken in fsspec-xrootd, I fixed it in CoffeaTeam/fsspec-xrootd#56, so that is (should be) necessary for the network tests to pass

lobis · 2024-03-06T08:46:59Z

@lobis found that not all fsspec implementations support cat_ranges—I remember now that this did come up. If we have to check before we use it, is there a way to do that, @martindurant?

Well I suppose hasattr(fs, "_cat_ranges") is a quick option

If I remember correctly there was a reason why I chose not to use _cat_ranges, not because it was not implemented in all sources.

I think it's because fetching all chunks in a single call is worse than being able to fetch each individually, even if both fetch concurrently. When you use the individual _cat_range you can process the chunk while some other is being fetched, while if you use a single call to _cat_ranges you need to wait until all are fetched.

agoose77 · 2024-03-06T09:41:00Z

@lobis @nsmith- I've not spent much time thinking about this, but is there a benefit to batching range requests into units of _cat_ranges?

martindurant · 2024-03-06T14:08:30Z

Agreed that _cat_ranges has no benefit over a bunch of _cat_file calls if you are running them concurrently in async land yourself. There is the potential to merge adjacent/nearly calls to cut down the total number of requests, but this is only implemented in referenceFS (in cat(), actually, to decide what ranges to get). cat_ranges (no leading underscore) is very much better, though, than many cat_files, since each of these calls would block.

We could:

implement a fallback cat_ranges in AbstractFileSystem to maintain compatibility
merge ranges in _cat_ranges when possible

you can process the chunk while some other is being fetched,

Not strictly true: you can process a chunk while another is waiting (latency period), but it can't actively read at the same time. All downloads involve some amount of GIL-holding decode. This is why IO pipelining like this, if it were to be implemented async, should probably not be in python. fsspec actually runs async in a thread, so you might get thrashing if there are many CPU-heavy tasks too.

nsmith- · 2024-03-06T16:08:38Z

Agreed that _cat_ranges has no benefit over a bunch of _cat_file calls if you are running them concurrently in async land yourself.

Also agreed, modulo the deficiencies of the xrootd protocol, which requires setting up a remote "handle" per fetch, which adds overhead and server load. I think if we cache these "handles" the same way as other protocols cache connections, we'll be fine in this regard, but it needs to be measured. That is the subject of CoffeaTeam/fsspec-xrootd#54

you can process the chunk while some other is being fetched,

More specifically, you can run decompress in another thread while more data is being fetched, and that releases GIL. This is a benefit I agree is useful, but I am not sure if it outweights the costs in the case of xrootd.

jpivarski · 2024-03-20T16:52:14Z

I think this is done and I just forgot to merge it. I'm going to update and enable auto-merge (which will probably fail because all of the Windows tests are failing now, across all PRs).

nsmith- · 2024-03-20T17:11:21Z

Ok, though I plan to follow this up with batching range requests into units of _cat_ranges as @agoose77 proposed, as well as some range coalescing logic

martindurant · 2024-03-20T17:21:26Z

https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L533 may help with that. You can see how it's used in ReferenceFileSystem

jpivarski · 2024-03-20T17:30:30Z

Do you want to do that in this PR?

(I have a fix for the Windows failures coming in #1178.)

nsmith- · 2024-03-20T20:43:48Z

I think its fine to do that in a separate PR, unless @lobis objects

jpivarski · 2024-03-20T22:13:15Z

Okay, I'll merge this for now. Thanks!

Use cat_ranges in fsspec source

1e07092

This allows filesystem implementations to use possibly more optimal request strategies, such as fsspec-xrootd's use of vector_read.

nsmith- changed the title ~~Use cat_ranges in fsspec source~~ feat: Use cat_ranges in fsspec source Mar 5, 2024

nsmith- requested a review from lobis March 5, 2024 23:09

nsmith- mentioned this pull request Mar 6, 2024

Performance degradation reading files via root:// #1157

Closed

Merge branch 'main' into cat_ranges

8393aaa

jpivarski enabled auto-merge (squash) March 20, 2024 16:52

jpivarski disabled auto-merge March 20, 2024 17:29

Merge branch 'main' into cat_ranges

de81c4f

jpivarski merged commit e1cc99c into scikit-hep:main Mar 20, 2024
21 checks passed

nsmith- mentioned this pull request May 10, 2024

feat: Implement read coalescing algorithm #1198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Use cat_ranges in fsspec source #1162

feat: Use cat_ranges in fsspec source #1162

nsmith- commented Mar 5, 2024

jpivarski commented Mar 5, 2024

nsmith- commented Mar 5, 2024

nsmith- commented Mar 5, 2024 •

edited

Loading

lobis commented Mar 6, 2024

agoose77 commented Mar 6, 2024 •

edited

Loading

martindurant commented Mar 6, 2024

nsmith- commented Mar 6, 2024

jpivarski commented Mar 20, 2024

nsmith- commented Mar 20, 2024

martindurant commented Mar 20, 2024

jpivarski commented Mar 20, 2024

nsmith- commented Mar 20, 2024

jpivarski commented Mar 20, 2024

feat: Use cat_ranges in fsspec source #1162

feat: Use cat_ranges in fsspec source #1162

Conversation

nsmith- commented Mar 5, 2024

jpivarski commented Mar 5, 2024

nsmith- commented Mar 5, 2024

nsmith- commented Mar 5, 2024 • edited Loading

lobis commented Mar 6, 2024

agoose77 commented Mar 6, 2024 • edited Loading

martindurant commented Mar 6, 2024

nsmith- commented Mar 6, 2024

jpivarski commented Mar 20, 2024

nsmith- commented Mar 20, 2024

martindurant commented Mar 20, 2024

jpivarski commented Mar 20, 2024

nsmith- commented Mar 20, 2024

jpivarski commented Mar 20, 2024

nsmith- commented Mar 5, 2024 •

edited

Loading

agoose77 commented Mar 6, 2024 •

edited

Loading