-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Use cat_ranges in fsspec source #1162
Conversation
This allows filesystem implementations to use possibly more optimal request strategies, such as fsspec-xrootd's use of vector_read.
@lobis found that not all fsspec implementations support |
Well I suppose |
The vector read is actually broken in fsspec-xrootd, I fixed it in CoffeaTeam/fsspec-xrootd#56, so that is (should be) necessary for the network tests to pass |
If I remember correctly there was a reason why I chose not to use I think it's because fetching all chunks in a single call is worse than being able to fetch each individually, even if both fetch concurrently. When you use the individual |
Agreed that We could:
Not strictly true: you can process a chunk while another is waiting (latency period), but it can't actively read at the same time. All downloads involve some amount of GIL-holding decode. This is why IO pipelining like this, if it were to be implemented async, should probably not be in python. fsspec actually runs async in a thread, so you might get thrashing if there are many CPU-heavy tasks too. |
Also agreed, modulo the deficiencies of the xrootd protocol, which requires setting up a remote "handle" per fetch, which adds overhead and server load. I think if we cache these "handles" the same way as other protocols cache connections, we'll be fine in this regard, but it needs to be measured. That is the subject of CoffeaTeam/fsspec-xrootd#54
More specifically, you can run |
I think this is done and I just forgot to merge it. I'm going to update and enable auto-merge (which will probably fail because all of the Windows tests are failing now, across all PRs). |
Ok, though I plan to follow this up with batching range requests into units of |
https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L533 may help with that. You can see how it's used in ReferenceFileSystem |
Do you want to do that in this PR? (I have a fix for the Windows failures coming in #1178.) |
I think its fine to do that in a separate PR, unless @lobis objects |
Okay, I'll merge this for now. Thanks! |
This allows filesystem implementations to use possibly more optimal request strategies, such as fsspec-xrootd's use of vector_read.
It's not clear to me if all fsspec implementations support this, and if it will generally perform better or not. At least for xrootd it is an improvement.