Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating MDS Streaming with HF Dataset Streaming #633

Closed
siddk opened this issue Mar 19, 2024 · 12 comments
Closed

Integrating MDS Streaming with HF Dataset Streaming #633

siddk opened this issue Mar 19, 2024 · 12 comments
Labels
enhancement New feature or request

Comments

@siddk
Copy link

siddk commented Mar 19, 2024

🚀 Feature Request

Hey folks - I've loved using streaming for some of my research in multimodal pretraining and robotics. One thing I'd love to support is first-class integration with HF Datasets (e.g., similar functionality to their WebDataset Streaming Integration).

I've created an issue on HF Datasets here, and @lhoestq seems receptive to the idea. At a low-level, not sure about the best way to implement this support. Would pointers/to talk this through!

Motivation

Mosaic Streaming from MDS is fantastic for large-scale, reproducible pretraining! For some of my larger datasets, supporting the ability to stream MDS shards stored on HF Datasets while training would be fantastic.

Thanks!

@siddk siddk added the enhancement New feature or request label Mar 19, 2024
@snarayan21
Copy link
Collaborator

Hey, this would be great! What did you have in mind regarding the implementation -- what should be done on Streaming's side?

@lhoestq
Copy link

lhoestq commented Mar 21, 2024

It would be nice to stream datasets from HF using Streaming, e.g. supporting hf:// paths

@karan6181
Copy link
Collaborator

@lhoestq Would it be possible for the user to upload the MDS shard files in the hf:// paths? Or is your ask to support the HF remote path with whatever underlying files it can contain, such as Parquet, JSONL, etc?

@lhoestq
Copy link

lhoestq commented Apr 4, 2024

At HF we want to make the Hub more open and support more data formats and libraries. We recently added support for WebDataset for example, and there are hundreds of datasets in WebDataset format on the HF Hub already.

Users can already upload data files in MDS format that they have locally using e.g. huggingface_hub. Maybe one day with the MDSWriter directly ? that would be cool !

Anyway what I think is the most interesting is if Streaming could stream datasets in MDS formats from HF (e.g. using hf:// paths). That would be useful to many researchers IMO

@siddk
Copy link
Author

siddk commented Apr 17, 2024

Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system

From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?

Basically @karan6181 -- trying to figure out what "S3-compatible object store" really means under the hood vs. what the HF Hub natively supports.

@lhoestq
Copy link

lhoestq commented Apr 17, 2024

Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system
From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?

Yes that's correct !

@karan6181
Copy link
Collaborator

Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system

From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?

Basically @karan6181 -- trying to figure out what "S3-compatible object store" really means under the hood vs. what the HF Hub natively supports.

@siddk It appears that the HF hub functions primarily as a cloud storage solution, accessible via the hf:// prefix. Integrating HF hub support into the streaming dataset should be straightforward. Do you have the capacity to implement HF hub backend support in the streaming dataset? You can model your work on the structure outlined in the PRs at #311 and #256. Please let us know if you have any questions—we're here to assist you.

@siddk
Copy link
Author

siddk commented Apr 23, 2024

Hey @karan6181 -- I'm a bit swamped with upcoming paper deadlines right now, but would love to see this supported. I can try carving out time to work on things in a few weeks, but wouldn't mind your expert take on this. I think the broader HF community would really appreciate it as well!

@mvpatel2000
Copy link
Contributor

Included in v0.8.0 release

@lhoestq
Copy link

lhoestq commented Jul 30, 2024

Wow amazing ! are there some docs already on how to use it ?

Also let me know if you plan to share this on social media, I'll be happy to re-share with the community !

@snarayan21
Copy link
Collaborator

Hey @lhoestq, @orionw added support for storing MDS datasets in huggingface. The relevant section in the docs is here. Will ask internally about posting on socials!

@orionw provided this simple script which shows off the new functionality:

from streaming import StreamingDataset

# Create streaming dataset
dataset = StreamingDataset(remote="hf://datasets/orionweller/wikipedia_mds/", shuffle=False, split=None, batch_size=1)

# Let's see what's in it
for sample in dataset:
    text = sample['text']
    id = sample['id']
    print(f"Text: {text}")
    print(f"ID: {id}")
    break

@snarayan21
Copy link
Collaborator

@lhoestq we tweeted here: https://x.com/DbrxMosaicAI/status/1818407826852921833
thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants