-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating MDS Streaming with HF Dataset Streaming #633
Comments
Hey, this would be great! What did you have in mind regarding the implementation -- what should be done on Streaming's side? |
It would be nice to stream datasets from HF using Streaming, e.g. supporting hf:// paths |
At HF we want to make the Hub more open and support more data formats and libraries. We recently added support for WebDataset for example, and there are hundreds of datasets in WebDataset format on the HF Hub already. Users can already upload data files in MDS format that they have locally using e.g. Anyway what I think is the most interesting is if Streaming could stream datasets in MDS formats from HF (e.g. using |
Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding Basically @karan6181 -- trying to figure out what "S3-compatible object store" really means under the hood vs. what the HF Hub natively supports. |
Yes that's correct ! |
@siddk It appears that the HF hub functions primarily as a cloud storage solution, accessible via the |
Hey @karan6181 -- I'm a bit swamped with upcoming paper deadlines right now, but would love to see this supported. I can try carving out time to work on things in a few weeks, but wouldn't mind your expert take on this. I think the broader HF community would really appreciate it as well! |
Included in v0.8.0 release |
Wow amazing ! are there some docs already on how to use it ? Also let me know if you plan to share this on social media, I'll be happy to re-share with the community ! |
Hey @lhoestq, @orionw added support for storing MDS datasets in huggingface. The relevant section in the docs is here. Will ask internally about posting on socials! @orionw provided this simple script which shows off the new functionality:
|
@lhoestq we tweeted here: https://x.com/DbrxMosaicAI/status/1818407826852921833 |
🚀 Feature Request
Hey folks - I've loved using
streaming
for some of my research in multimodal pretraining and robotics. One thing I'd love to support is first-class integration with HF Datasets (e.g., similar functionality to their WebDataset Streaming Integration).I've created an issue on HF Datasets here, and @lhoestq seems receptive to the idea. At a low-level, not sure about the best way to implement this support. Would pointers/to talk this through!
Motivation
Mosaic Streaming from MDS is fantastic for large-scale, reproducible pretraining! For some of my larger datasets, supporting the ability to stream MDS shards stored on HF Datasets while training would be fantastic.
Thanks!
The text was updated successfully, but these errors were encountered: