Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Merge idist into master #1045

Merged
merged 54 commits into from
May 31, 2020
Merged

[WIP] Merge idist into master #1045

merged 54 commits into from
May 31, 2020

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented May 16, 2020

Todo:

Description:
Introduces distributed module to handle GPU, CPU, XLA distributed config.

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

vfdev-5 and others added 24 commits May 11, 2020 17:58
* add utils for distributed

* autopep8 fix

* [WIP] Added comp models and tests

* [WIP] Added create_from_backend and create_from_context for _DistModel

* [WIP] Added spawn to _DistModel

* [WIP] Refactored comp models and added spawn for xla

* autopep8 fix

* Improved tests

* autopep8 fix

* Fixes flake8

* autopep8 fix

* - Removed is_distributed
- Renamed _DistModel -> _NativeDistModel

* autopep8 fix

* Added docs and tests for xla spawn

* Fixes conftest bug

* autopep8 fix

* Updates

* autopep8 fix

* Fixes available_backends bug

* autopep8 fix

* Fixed tests

Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr>
Co-authored-by: AutoPEP8 <>
* [WIP] Updates docs

* Adapted metrics with idist, fixed tests
- added local rank estimation with hostname heuristics

* autopep8 fix

* Adapted metrics code and tests to use idist

* autopep8 fix

* Updated docs, docstrings

* Updated xla tests and fix a bug with tensor dtype

* autopep8 fix

* Fixed all gather using all reduce op

* autopep8 fix

* Improved tests of create_supervised_trainer on TPU

Co-authored-by: AutoPEP8 <>
* [WIP] make accumulation tests on TPU(s)

* Fixed tests with accumulation metric
- by decreasing tolerence

* Fixed all_gather bug

* autopep8 fix

* Added metric tests for xla

* autopep8 fix

* Fixed bug in test of precision
- updated other regression tests

* Fixed failing tests on TPU
- increased err tolerence

Co-authored-by: AutoPEP8 <>
- average output in RunningAverage
* add TPU checkpointing to CPU.

* autopep8 fix

* update docstring to include TPU notice.

* add skip for non-TPU tests.

* autopep8 fix

* refactor to use idist API.

* autopep8 fix

* add complex save with TPU.

* autopep8 fix

* fix tests.

* fix typo in docstring.

Co-authored-by: vfdev <vfdev.5@gmail.com>

Co-authored-by: AutoPEP8 <>
Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com>
Co-authored-by: vfdev <vfdev.5@gmail.com>
* Added barrier op in idist

* Fixed test and updated one_rank_only to use idist

* Moved one_rank_only to idist, adapted tests

* autopep8 fix

* Removed redundant imports

* Another test fix of setup_logger

Co-authored-by: AutoPEP8 <>
@vfdev-5 vfdev-5 marked this pull request as ready for review May 18, 2020 23:55
@vfdev-5 vfdev-5 marked this pull request as draft May 18, 2020 23:57
* - idist.device() returns "torch.device('cuda')" if non-dist conf and cuda device is available

* Improved setup_common_training_handlers
- no need to handle train_sampler if idist.model_name() is not serial, but train_sampler is not setup as distributed due to 1 proc.
- Warn only if train_sampler has set_epoch
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented May 25, 2020

Something makes things stuck on TPUs in parallel_api PR. Need to investigate where is the problem, before merging this PR.
=> Seems like problem is with using Checkpoint, cc @erip

vfdev-5 and others added 8 commits May 28, 2020 22:20
* Improve tests on XLA

* Fixes xla test when spawn without 'fork'

* Added test of dtype for XLA
* Added support for str input for all gather

* More tests for better coverage
* issue_1055

* autopep8 fix

* decorate and refactor getattr

* remove decoration - need further discussions

* Added missing decorator for plx

* Added note about dist-friendly interface

* Updated Checkpoint to dist config + TPU

* autopep8 fix

* [WIP] Checkpoint in dist config

* autopep8 fix

* [WIP] Checkpoint in dist config

* autopep8 fix

* [WIP] Checkpoint on XLA

* autopep8 fix

* Fix checkpoint tests on XLA

* Put back Loggers as dist-unfriendly
+ tests for contrib savers

* Updated tests for XLA
- removed neptune xla tests

* autopep8 fix

* minor fix for coverage

* [WIP] New XLA tests for trains logger

* Fixed distrib tests for trains

Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr>
Co-authored-by: AutoPEP8 <>
Co-authored-by: vfdev-5 <vfdev.5@gmail.com>
Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

vfdev-5 and others added 5 commits May 31, 2020 00:34
* remove useless barriers

* Fix failing tests

* Added missing barrier in test for XLA

Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr>
Co-authored-by: vfdev-5 <vfdev.5@gmail.com>
Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

Copy link
Contributor

@erip erip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will request changes for now, but I'm happy to change my review to an approve to expedite RC. If we have users who can test out these features, that would be very nice, too. 😄

ignite/contrib/engines/common.py Outdated Show resolved Hide resolved
ignite/contrib/engines/common.py Outdated Show resolved Hide resolved
ignite/contrib/engines/common.py Outdated Show resolved Hide resolved
return getattr(mlflow, attr)(*args, **kwargs)

return wrapper
return getattr(mlflow, attr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above - would be nice to replace this if we can.

return getattr(neptune, attr)(*args, **kwargs)

return wrapper
return getattr(neptune, attr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

return getattr(self.experiment, attr)(*args, **kwargs)

return wrapper
return getattr(self.experiment, attr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

@erip
Copy link
Contributor

erip commented May 31, 2020

My requested changes are fairly superficial and it looks like it's mostly for experiment loggers rather than in the "core" code so I'm happy to 'ok' this if needed; this is a very impressive PR and kudos to you and @sdesrozis for working so hard. 👏

@vfdev-5 vfdev-5 requested a review from erip May 31, 2020 21:49
Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@vfdev-5 vfdev-5 merged commit dff5996 into master May 31, 2020
@vfdev-5 vfdev-5 deleted the idist branch June 8, 2020 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants