feat: dump sent `old.json` after each succesfull upload #1

observingClouds · 2022-06-06T21:01:22Z

Thank you so much for developing this package. I hope this is a good place to raise some suggestions/issues that I have.

One issue I run into is that my filesystem (Lustre) does not use reflinks. The creation of cars is therefore quite slow. Because this process takes some time to finish for a dataset of ~1TB, especially when using the default estuary driver with its direct upload, I ran into issues with the job scheduler on my HPC cluster. The scheduler just stops the job after a certain time before it has successfully finished.

Is there any way to restart the process without creating all cars again (and probably fail again due to the time limits)?

Thank you!

Jorropo · 2022-06-06T21:33:53Z

First, It is very likely the upload would not be faster even with reflink.

The chunking of the files is done pipelined with the upload, that means the performance that matters is whoever is slower between chunking and uploads. (except for the first 32GiB and last 32GiB)

Estuary is not fast (I get ~15MiB/s going France -> US uploads). Unless your disks is slower than this, the upload wont be faster because the chunker still has to wait for the data to be uploaded.

Reflinking is important if you have a fast remote server or use -driver car.

Secondly about your issue

The scheduler just stops the job after a certain time before it has successfully finished.

I have two questions:

What does "stops the job" means ? kill -9 ?
(because if it do that, imagine it being killed in the middle of a backup, you just lost the old.json content so not much use anyway)
And sorry if that a dumb question, but can't you just make it not do that ?

It would be possible to dump old.json after every successful upload if we attach a modtime per file instead of a global one all of the content. So a "snapshot" would be done after every 32GiB (at least that how big the car target is for estuary).

Having proper more complex state recovery would be really hard with the current architecture, I would work on multithreaded traversal before working on that (if I ever work on it).

observingClouds · 2022-06-07T10:04:35Z

Thank you very much for your quick response. I experience similar upload limitations from Germany.

To your replies:

What does "stops the job" means ? kill -9 ?

Not necessarily, the job scheduler (SLURM) does allow to send any signal (e.g. SIGTERM) ahead of time before doing kill -9.

Can't you just make it not do that ?

The compute resources are shared among users and in an attempt to make the usage fair, jobs are only allowed to run for a certain amount of time ( up to 8 hours in my case).

It would be possible to dump old.json after every successful upload if we attach a modtime per file instead of a global one all of the content. So a "snapshot" would be done after every 32GiB (at least that how big the car target is for estuary).

This seems like a good solution 👍
Currently, the creation of one 32GiB-car takes about 10 minutes. Converting a file of 1TB takes therefore ~5 h. Having intermediate "snapshot"s would help to reduce the risk of rewriting the cars in case of any failure.

Jorropo · 2022-06-07T13:58:12Z

I'll probably work on this in the next few days.

Self note to myself: we cannot just dump old.json because it would save files we havn't uploaded yet.
A solution to fix this is to double buffer the old mapping so we can keep in accordance what is inside the .car and old.json.

Jorropo · 2022-06-14T20:29:52Z

@observingClouds I have implemented this.

Can you pls retry with the current master (0297e30) ?

Performance might take a slight hit (since it's not very efficiently programmed).
But should be mostly fine since this is massively bottlenecked by network anyway.

I might move that to yet an other background job.

observingClouds · 2022-06-15T13:08:12Z

Thank you so much @Jorropo! It does seem to work 🚀 I was just tricked by the fact that the numbering of the output cars started again with 1 (out.1.car) after a restart, starting to overwrite the cars of the first run. This is of course only an issue if you use the car-driver and create local cars, but not in case of the estuary driver.

Jorropo · 2022-06-15T14:31:11Z

It does seem to work

🎉

I was just tricked by the fact that the numbering of the output cars started again with 1 after a restart, starting to overwrite the cars of the first run.

Silently overwriting previous files is an issue, I'll fix: #4

Fyi you can specify a patern when using the car driver.
So you could do this (%d gets replaced by the current output car):

linux2ipfs -driver car-out.run.1.%d.car files
linux2ipfs -driver car-out.run.2.%d.car files

But I'll just fix it so it logs something and just skip to the next file.

Jorropo changed the title ~~Restart of car creation~~ feat: dump sent old.json after each succesfull upload Jun 7, 2022

Jorropo mentioned this issue Jun 7, 2022

feat: cancel and terminate somewhat cleanly on SIGTERM #2

Closed

Jorropo self-assigned this Jun 7, 2022

Jorropo closed this as completed in 0297e30 Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dump sent `old.json` after each succesfull upload #1

feat: dump sent `old.json` after each succesfull upload #1

observingClouds commented Jun 6, 2022

Jorropo commented Jun 6, 2022 •

edited

Loading

observingClouds commented Jun 7, 2022

Jorropo commented Jun 7, 2022

Jorropo commented Jun 14, 2022

observingClouds commented Jun 15, 2022

Jorropo commented Jun 15, 2022 •

edited

Loading

feat: dump sent old.json after each succesfull upload #1

feat: dump sent old.json after each succesfull upload #1

Comments

observingClouds commented Jun 6, 2022

Jorropo commented Jun 6, 2022 • edited Loading

First, It is very likely the upload would not be faster even with reflink.

Secondly about your issue

observingClouds commented Jun 7, 2022

Jorropo commented Jun 7, 2022

Jorropo commented Jun 14, 2022

observingClouds commented Jun 15, 2022

Jorropo commented Jun 15, 2022 • edited Loading

feat: dump sent `old.json` after each succesfull upload #1

feat: dump sent `old.json` after each succesfull upload #1

Jorropo commented Jun 6, 2022 •

edited

Loading

Jorropo commented Jun 15, 2022 •

edited

Loading