Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance impact of opening a new XRootD File in every cat operation (TBasket in Uproot) #55

Closed
jpivarski opened this issue Mar 5, 2024 · 6 comments · Fixed by #54

Comments

@jpivarski
Copy link

Reported by @chrisburr in scikit-hep/uproot5#1157 (comment)_:

The rest of the difference is coming from the fsspec source opening the file twice:

[2024-03-05 23:00:45.735325 +0100][Debug  ][ExDbgMsg          ] [ccsrm.ihep.ac.cn:1094] MsgHandler created: 0x15e5ffa0 (message: kXR_open (file: /dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) ).
[2024-03-05 23:00:50.783995 +0100][Debug  ][ExDbgMsg          ] [ccsrm.ihep.ac.cn:1094] MsgHandler created: 0x15f25490 (message: kXR_stat (path: /dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst, flags: none) ).
FSSpecSource.chunk(start=0, stop=403)
[2024-03-05 23:00:51.071420 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x1781f720 (message: kXR_read (handle: 0x00000000, offset: 0, size: 403) ).
FSSpecSource.chunk(start=3872048242, stop=3872048555)
[2024-03-05 23:00:51.376154 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x1781f510 (message: kXR_read (handle: 0x00000000, offset: 3872048242, size: 313) ).
FSSpecSource.chunk(start=3867679172, stop=3867679853)
[2024-03-05 23:00:51.682223 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x15f25740 (message: kXR_read (handle: 0x00000000, offset: 3867679172, size: 681) ).
FSSpecSource.chunks(ranges=[(3867678939, 3867679107)])
[2024-03-05 23:00:51.995422 +0100][Debug  ][ExDbgMsg          ] [ccsrm.ihep.ac.cn:1094] MsgHandler created: 0x23f04350 (message: kXR_open (file: /dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst, mode: 074, flags: kXR_open_read kXR_async kXR_retstat ) ).
[2024-03-05 23:00:52.588776 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x15f25490 (message: kXR_read (handle: 0x01000000, offset: 3867678939, size: 168) ).
[2024-03-05 23:00:52.890174 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x17826850 (message: kXR_close (handle: 0x01000000) ).
Took 7.651638916009688 seconds
[2024-03-05 23:00:53.264742 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x27b2c1c0 (message: kXR_close (handle: 0x00000000) ).

The second one is caused by FSSpecSource.chunks calling _cat_file:

https://github.com/scikit-hep/uproot5/blob/e47934f32bd16439a2ca9e92428d2a9a4610a144/src/uproot/source/fsspec.py#L164

which opens a new file behind the scenes for every call:

_myFile = client.File()
try:
status, _n = await _async_wrap(
_myFile.open,
self.unstrip_protocol(path),
OpenFlags.READ,
self.timeout,
)

In my experience, these File objects are heavy; slow to open. fsspec's cat interface is stateless, so it seems that you have to create a new one of these for every call, but that means every TBasket in Uproot.

Is there an alternative that we can use, some multi_cat or a context that holds the File object so that we don't need so many? Is there a way to use XRootD in a lightweight, stateless way (like HTTP connections)?

@nsmith-
Copy link
Member

nsmith- commented Mar 5, 2024

So, I'm going to try to write up my performance study in scikit-hep/uproot5#1157 to collect all in one spot, but yes fsspec has cat_ranges which allows to bypass this particular multi-open issue. I also looked into caching open file handles in #54 but I think this is really unreliable because the server disallows opening a file for writing if any reader is connected, and this cache leaks reader handles. To do it right we need to use (async or regular) context managers for all file access.

@nsmith-
Copy link
Member

nsmith- commented Mar 5, 2024

Is there a way to use XRootD in a lightweight, stateless way (like HTTP connections)?

I spent a bit of time this morning reading the protocol doc https://xrootd.slac.stanford.edu/doc/dev56/XRdv520.pdf and I don't see any obvious way to interact with an xrootd server in a stateless way: one needs to acquire an opaque char[4] fHandle in kXR_open to then pass as an argument to kXR_read.

@chrisburr
Copy link
Contributor

I think this is really unreliable because the server disallows opening a file for writing if any reader is connected

Does anyone actually write ROOT files directly over XRootD? I understand why fsspec-xrootd would want to support it but for uproot a keyword argument on the filesystem class that enables the cache and prevents writing might be enough, at least as a short term fix.

@jpivarski
Copy link
Author

It is now technically possible, but I don't think it's a good idea. Writing a ROOT file involves a lot of seeking back and forth (they can't be written directly from beginning to end, unless the sizes of everything that is to be written is known in advance), and that would mean a lot of interaction over the network.

Since it was a requested feature, we can't break it, but we don't need to ensure that it is the optimal path. If reopening the file is necessary for writing but not for reading, that would be fine.

@nsmith-
Copy link
Member

nsmith- commented Mar 6, 2024

Writing files in general over xrootd is a very desired feature. For example, I am writing several GB of parquet files to FNAL EOS storage in my skim example . It works quite well. I would hope we extend uproot writing to support fsspec sinks, using the simplecache local cache feature to only write (commit) the whole file at the end of the writing process.

All that said, I am happy to re-start work on #54

@jpivarski
Copy link
Author

That's just the thing: the Parquet format is defined in such a way that all metadata that needs to know the sizes of things gets written after (at larger seek values) than the data it represents. With causal knowledge of only the past, it can be written from the beginning to the end of the file, in order. That can't be done with the ROOT format, especially if the file is to be valid between writing individual objects and if sizes of everything that will be written isn't known in advance.

@nsmith- nsmith- linked a pull request Jun 26, 2024 that will close this issue
@nsmith- nsmith- closed this as completed Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants