-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: dump sent old.json
after each succesfull upload
#1
Comments
First, It is very likely the upload would not be faster even with reflink.The chunking of the files is done pipelined with the upload, that means the performance that matters is whoever is slower between chunking and uploads. (except for the first 32GiB and last 32GiB) Estuary is not fast (I get ~15MiB/s going France -> US uploads). Unless your disks is slower than this, the upload wont be faster because the chunker still has to wait for the data to be uploaded. Reflinking is important if you have a fast remote server or use Secondly about your issue
I have two questions:
It would be possible to dump Having proper more complex state recovery would be really hard with the current architecture, I would work on multithreaded traversal before working on that (if I ever work on it). |
Thank you very much for your quick response. I experience similar upload limitations from Germany. To your replies:
Not necessarily, the job scheduler (SLURM) does allow to send any signal (e.g. SIGTERM) ahead of time before doing
The compute resources are shared among users and in an attempt to make the usage fair, jobs are only allowed to run for a certain amount of time ( up to 8 hours in my case).
This seems like a good solution 👍 |
old.json
after each succesfull upload
I'll probably work on this in the next few days. Self note to myself: we cannot just dump old.json because it would save files we havn't uploaded yet. |
@observingClouds I have implemented this. Can you pls retry with the current master (0297e30) ? Performance might take a slight hit (since it's not very efficiently programmed). I might move that to yet an other background job. |
Thank you so much @Jorropo! It does seem to work 🚀 I was just tricked by the fact that the numbering of the output cars started again with 1 ( |
🎉
Silently overwriting previous files is an issue, I'll fix: #4 Fyi you can specify a patern when using the car driver.
But I'll just fix it so it logs something and just skip to the next file. |
Hi @Jorropo,
Thank you so much for developing this package. I hope this is a good place to raise some suggestions/issues that I have.
One issue I run into is that my filesystem (Lustre) does not use reflinks. The creation of cars is therefore quite slow. Because this process takes some time to finish for a dataset of ~1TB, especially when using the default estuary driver with its direct upload, I ran into issues with the job scheduler on my HPC cluster. The scheduler just stops the job after a certain time before it has successfully finished.
Is there any way to restart the process without creating all cars again (and probably fail again due to the time limits)?
Thank you!
The text was updated successfully, but these errors were encountered: