-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Merge idist into master #1045
Conversation
* add utils for distributed * autopep8 fix * [WIP] Added comp models and tests * [WIP] Added create_from_backend and create_from_context for _DistModel * [WIP] Added spawn to _DistModel * [WIP] Refactored comp models and added spawn for xla * autopep8 fix * Improved tests * autopep8 fix * Fixes flake8 * autopep8 fix * - Removed is_distributed - Renamed _DistModel -> _NativeDistModel * autopep8 fix * Added docs and tests for xla spawn * Fixes conftest bug * autopep8 fix * Updates * autopep8 fix * Fixes available_backends bug * autopep8 fix * Fixed tests Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: AutoPEP8 <>
* [WIP] Updates docs * Adapted metrics with idist, fixed tests - added local rank estimation with hostname heuristics * autopep8 fix * Adapted metrics code and tests to use idist * autopep8 fix * Updated docs, docstrings * Updated xla tests and fix a bug with tensor dtype * autopep8 fix * Fixed all gather using all reduce op * autopep8 fix * Improved tests of create_supervised_trainer on TPU Co-authored-by: AutoPEP8 <>
* [WIP] make accumulation tests on TPU(s) * Fixed tests with accumulation metric - by decreasing tolerence * Fixed all_gather bug * autopep8 fix * Added metric tests for xla * autopep8 fix * Fixed bug in test of precision - updated other regression tests * Fixed failing tests on TPU - increased err tolerence Co-authored-by: AutoPEP8 <>
* add TPU checkpointing to CPU. * autopep8 fix * update docstring to include TPU notice. * add skip for non-TPU tests. * autopep8 fix * refactor to use idist API. * autopep8 fix * add complex save with TPU. * autopep8 fix * fix tests. * fix typo in docstring. Co-authored-by: vfdev <vfdev.5@gmail.com> Co-authored-by: AutoPEP8 <> Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> Co-authored-by: vfdev <vfdev.5@gmail.com>
* Added barrier op in idist * Fixed test and updated one_rank_only to use idist * Moved one_rank_only to idist, adapted tests * autopep8 fix * Removed redundant imports * Another test fix of setup_logger Co-authored-by: AutoPEP8 <>
* - idist.device() returns "torch.device('cuda')" if non-dist conf and cuda device is available * Improved setup_common_training_handlers - no need to handle train_sampler if idist.model_name() is not serial, but train_sampler is not setup as distributed due to 1 proc. - Warn only if train_sampler has set_epoch
Something makes things stuck on TPUs in |
* Improve tests on XLA * Fixes xla test when spawn without 'fork' * Added test of dtype for XLA
* Added support for str input for all gather * More tests for better coverage
* issue_1055 * autopep8 fix * decorate and refactor getattr * remove decoration - need further discussions * Added missing decorator for plx * Added note about dist-friendly interface * Updated Checkpoint to dist config + TPU * autopep8 fix * [WIP] Checkpoint in dist config * autopep8 fix * [WIP] Checkpoint in dist config * autopep8 fix * [WIP] Checkpoint on XLA * autopep8 fix * Fix checkpoint tests on XLA * Put back Loggers as dist-unfriendly + tests for contrib savers * Updated tests for XLA - removed neptune xla tests * autopep8 fix * minor fix for coverage * [WIP] New XLA tests for trains logger * Fixed distrib tests for trains Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: AutoPEP8 <> Co-authored-by: vfdev-5 <vfdev.5@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
* remove useless barriers * Fix failing tests * Added missing barrier in test for XLA Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: vfdev-5 <vfdev.5@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will request changes for now, but I'm happy to change my review to an approve to expedite RC. If we have users who can test out these features, that would be very nice, too. 😄
return getattr(mlflow, attr)(*args, **kwargs) | ||
|
||
return wrapper | ||
return getattr(mlflow, attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above - would be nice to replace this if we can.
return getattr(neptune, attr)(*args, **kwargs) | ||
|
||
return wrapper | ||
return getattr(neptune, attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
return getattr(self.experiment, attr)(*args, **kwargs) | ||
|
||
return wrapper | ||
return getattr(self.experiment, attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
My requested changes are fairly superficial and it looks like it's mostly for experiment loggers rather than in the "core" code so I'm happy to 'ok' this if needed; this is a very impressive PR and kudos to you and @sdesrozis for working so hard. 👏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
Todo:
idist.device()
output uniform :torch.device
, Improved device() method #1062tests withparallel_api
on cifar10Description:
Introduces distributed module to handle GPU, CPU, XLA distributed config.
Check list: