Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dump sent old.json after each succesfull upload #1

Closed
observingClouds opened this issue Jun 6, 2022 · 6 comments
Closed

feat: dump sent old.json after each succesfull upload #1

observingClouds opened this issue Jun 6, 2022 · 6 comments
Assignees

Comments

@observingClouds
Copy link

Hi @Jorropo,

Thank you so much for developing this package. I hope this is a good place to raise some suggestions/issues that I have.

One issue I run into is that my filesystem (Lustre) does not use reflinks. The creation of cars is therefore quite slow. Because this process takes some time to finish for a dataset of ~1TB, especially when using the default estuary driver with its direct upload, I ran into issues with the job scheduler on my HPC cluster. The scheduler just stops the job after a certain time before it has successfully finished.

Is there any way to restart the process without creating all cars again (and probably fail again due to the time limits)?

Thank you!

@Jorropo
Copy link
Owner

Jorropo commented Jun 6, 2022

First, It is very likely the upload would not be faster even with reflink.

The chunking of the files is done pipelined with the upload, that means the performance that matters is whoever is slower between chunking and uploads. (except for the first 32GiB and last 32GiB)

Estuary is not fast (I get ~15MiB/s going France -> US uploads). Unless your disks is slower than this, the upload wont be faster because the chunker still has to wait for the data to be uploaded.

Reflinking is important if you have a fast remote server or use -driver car.

Secondly about your issue

The scheduler just stops the job after a certain time before it has successfully finished.

I have two questions:

  • What does "stops the job" means ? kill -9 ?
    (because if it do that, imagine it being killed in the middle of a backup, you just lost the old.json content so not much use anyway)
  • And sorry if that a dumb question, but can't you just make it not do that ?

It would be possible to dump old.json after every successful upload if we attach a modtime per file instead of a global one all of the content. So a "snapshot" would be done after every 32GiB (at least that how big the car target is for estuary).

Having proper more complex state recovery would be really hard with the current architecture, I would work on multithreaded traversal before working on that (if I ever work on it).

@observingClouds
Copy link
Author

Thank you very much for your quick response. I experience similar upload limitations from Germany.

To your replies:

What does "stops the job" means ? kill -9 ?

Not necessarily, the job scheduler (SLURM) does allow to send any signal (e.g. SIGTERM) ahead of time before doing kill -9.

Can't you just make it not do that ?

The compute resources are shared among users and in an attempt to make the usage fair, jobs are only allowed to run for a certain amount of time ( up to 8 hours in my case).

It would be possible to dump old.json after every successful upload if we attach a modtime per file instead of a global one all of the content. So a "snapshot" would be done after every 32GiB (at least that how big the car target is for estuary).

This seems like a good solution 👍
Currently, the creation of one 32GiB-car takes about 10 minutes. Converting a file of 1TB takes therefore ~5 h. Having intermediate "snapshot"s would help to reduce the risk of rewriting the cars in case of any failure.

@Jorropo Jorropo changed the title Restart of car creation feat: dump sent old.json after each succesfull upload Jun 7, 2022
@Jorropo Jorropo self-assigned this Jun 7, 2022
@Jorropo
Copy link
Owner

Jorropo commented Jun 7, 2022

I'll probably work on this in the next few days.

Self note to myself: we cannot just dump old.json because it would save files we havn't uploaded yet.
A solution to fix this is to double buffer the old mapping so we can keep in accordance what is inside the .car and old.json.

@Jorropo
Copy link
Owner

Jorropo commented Jun 14, 2022

@observingClouds I have implemented this.

Can you pls retry with the current master (0297e30) ?

Performance might take a slight hit (since it's not very efficiently programmed).
But should be mostly fine since this is massively bottlenecked by network anyway.

I might move that to yet an other background job.

@observingClouds
Copy link
Author

Thank you so much @Jorropo! It does seem to work 🚀 I was just tricked by the fact that the numbering of the output cars started again with 1 (out.1.car) after a restart, starting to overwrite the cars of the first run. This is of course only an issue if you use the car-driver and create local cars, but not in case of the estuary driver.

@Jorropo
Copy link
Owner

Jorropo commented Jun 15, 2022

It does seem to work

🎉

I was just tricked by the fact that the numbering of the output cars started again with 1 after a restart, starting to overwrite the cars of the first run.

Silently overwriting previous files is an issue, I'll fix: #4

Fyi you can specify a patern when using the car driver.
So you could do this (%d gets replaced by the current output car):

  • linux2ipfs -driver car-out.run.1.%d.car files
  • linux2ipfs -driver car-out.run.2.%d.car files

But I'll just fix it so it logs something and just skip to the next file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants