[WIP] Merge idist into master #1045

vfdev-5 · 2020-05-16T20:16:47Z

Todo:

Ensure that no args were removed to keep BC.
Make idist.device() output uniform : torch.device, Improved device() method #1062
Add kwargs to spawned function: Idist kwargs dict #1064
Improved device and setup_common_training_handlers: Improved device and setup_common_training_handlers #1066
Checkpoint and loggers should work on XLA, Fix #1055 #1068
~~tests with parallel_api on cifar10~~

Description:
Introduces distributed module to handle GPU, CPU, XLA distributed config.

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

* add utils for distributed * autopep8 fix * [WIP] Added comp models and tests * [WIP] Added create_from_backend and create_from_context for _DistModel * [WIP] Added spawn to _DistModel * [WIP] Refactored comp models and added spawn for xla * autopep8 fix * Improved tests * autopep8 fix * Fixes flake8 * autopep8 fix * - Removed is_distributed - Renamed _DistModel -> _NativeDistModel * autopep8 fix * Added docs and tests for xla spawn * Fixes conftest bug * autopep8 fix * Updates * autopep8 fix * Fixes available_backends bug * autopep8 fix * Fixed tests Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: AutoPEP8 <>

* [WIP] Updates docs * Adapted metrics with idist, fixed tests - added local rank estimation with hostname heuristics * autopep8 fix * Adapted metrics code and tests to use idist * autopep8 fix * Updated docs, docstrings * Updated xla tests and fix a bug with tensor dtype * autopep8 fix * Fixed all gather using all reduce op * autopep8 fix * Improved tests of create_supervised_trainer on TPU Co-authored-by: AutoPEP8 <>

…n-idist

* [WIP] make accumulation tests on TPU(s) * Fixed tests with accumulation metric - by decreasing tolerence * Fixed all_gather bug * autopep8 fix * Added metric tests for xla * autopep8 fix * Fixed bug in test of precision - updated other regression tests * Fixed failing tests on TPU - increased err tolerence Co-authored-by: AutoPEP8 <>

- average output in RunningAverage

* add TPU checkpointing to CPU. * autopep8 fix * update docstring to include TPU notice. * add skip for non-TPU tests. * autopep8 fix * refactor to use idist API. * autopep8 fix * add complex save with TPU. * autopep8 fix * fix tests. * fix typo in docstring. Co-authored-by: vfdev <vfdev.5@gmail.com> Co-authored-by: AutoPEP8 <> Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> Co-authored-by: vfdev <vfdev.5@gmail.com>

* Added barrier op in idist * Fixed test and updated one_rank_only to use idist * Moved one_rank_only to idist, adapted tests * autopep8 fix * Removed redundant imports * Another test fix of setup_logger Co-authored-by: AutoPEP8 <>

ignite/contrib/engines/common.py

* - idist.device() returns "torch.device('cuda')" if non-dist conf and cuda device is available * Improved setup_common_training_handlers - no need to handle train_sampler if idist.model_name() is not serial, but train_sampler is not setup as distributed due to 1 proc. - Warn only if train_sampler has set_epoch

vfdev-5 · 2020-05-25T10:12:57Z

Something makes things stuck on TPUs in parallel_api PR. Need to investigate where is the problem, before merging this PR.
=> Seems like problem is with using Checkpoint, cc @erip

* Improve tests on XLA * Fixes xla test when spawn without 'fork' * Added test of dtype for XLA

* Added support for str input for all gather * More tests for better coverage

* issue_1055 * autopep8 fix * decorate and refactor getattr * remove decoration - need further discussions * Added missing decorator for plx * Added note about dist-friendly interface * Updated Checkpoint to dist config + TPU * autopep8 fix * [WIP] Checkpoint in dist config * autopep8 fix * [WIP] Checkpoint in dist config * autopep8 fix * [WIP] Checkpoint on XLA * autopep8 fix * Fix checkpoint tests on XLA * Put back Loggers as dist-unfriendly + tests for contrib savers * Updated tests for XLA - removed neptune xla tests * autopep8 fix * minor fix for coverage * [WIP] New XLA tests for trains logger * Fixed distrib tests for trains Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: AutoPEP8 <> Co-authored-by: vfdev-5 <vfdev.5@gmail.com>

sdesrozis

Looks good.

* remove useless barriers * Fix failing tests * Added missing barrier in test for XLA Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: vfdev-5 <vfdev.5@gmail.com>

sdesrozis

Ok!

erip

I will request changes for now, but I'm happy to change my review to an approve to expedite RC. If we have users who can test out these features, that would be very nice, too. 😄

ignite/contrib/engines/common.py

erip · 2020-05-31T20:49:58Z

ignite/contrib/handlers/mlflow_logger.py

-            return getattr(mlflow, attr)(*args, **kwargs)
-
-        return wrapper
+        return getattr(mlflow, attr)


Same comment as above - would be nice to replace this if we can.

erip · 2020-05-31T20:50:37Z

ignite/contrib/handlers/neptune_logger.py

-            return getattr(neptune, attr)(*args, **kwargs)
-
-        return wrapper
+        return getattr(neptune, attr)


erip · 2020-05-31T20:50:55Z

ignite/contrib/handlers/polyaxon_logger.py

-            return getattr(self.experiment, attr)(*args, **kwargs)
-
-        return wrapper
+        return getattr(self.experiment, attr)


erip · 2020-05-31T20:56:18Z

My requested changes are fairly superficial and it looks like it's mostly for experiment loggers rather than in the "core" code so I'm happy to 'ok' this if needed; this is a very impressive PR and kudos to you and @sdesrozis for working so hard. 👏

sdesrozis

👍🏻

vfdev-5 and others added 24 commits May 11, 2020 17:58

[WIP] create from context for XLA

91d8875

autopep8 fix

3cfccd4

Tests for _sync_model for XLA

f71043f

autopep8 fix

093ddb1

More tests and updates

7ad7fcf

autopep8 fix

d57b3c9

[WIP] create from context for Native Torch Dist

7fcadca

autopep8 fix

5a6e052

Added tests for idist.* created from context for native dist settings

1c362fe

[WIP] Fix tests

12512cf

Fixed metric related tests

228fd89

autopep8 fix

b09ea05

Merge branch 'master' of https://github.com/pytorch/ignite into idist

a23da8e

Merge branch 'master' into origin-idist

0352bc6

Merge branch 'master' of https://github.com/pytorch/ignite into origi…

16256cf

…n-idist

Merge branch 'master' into idist

feb79b4

Increased err tol for mse and rmse tests on single TPU

25d38d1

Fixes #991 (#1047)

8886948

- average output in RunningAverage

Merge branch 'master' into idist

add8a4d

Updated tests on checkpoint and TPU

d1cc29d

vfdev-5 force-pushed the idist branch from 85a804e to d1cc29d Compare May 17, 2020 00:16

vfdev-5 added 2 commits May 17, 2020 02:19

Merge branch 'master' into idist

977ac8c

Added barrier op in idist (#1050)

15072ae

* Added barrier op in idist * Fixed test and updated one_rank_only to use idist * Moved one_rank_only to idist, adapted tests * autopep8 fix * Removed redundant imports * Another test fix of setup_logger Co-authored-by: AutoPEP8 <>

vfdev-5 marked this pull request as ready for review May 18, 2020 23:55

Merge branch 'master' into idist

ac86d46

vfdev-5 marked this pull request as draft May 18, 2020 23:57

vfdev-5 commented May 24, 2020

View reviewed changes

ignite/contrib/engines/common.py Outdated Show resolved Hide resolved

sdesrozis approved these changes May 24, 2020

View reviewed changes

vfdev-5 and others added 8 commits May 28, 2020 22:20

Idist improve2 (#1075)

74ddacb

* Improve tests on XLA * Fixes xla test when spawn without 'fork' * Added test of dtype for XLA

Merge branch 'master' into idist

6735dc0

Merge branch 'master' into idist

b1b5d56

Added support for str input for all gather (#1081)

1e5d7d3

* Added support for str input for all gather * More tests for better coverage

Merge branch 'master' into idist

1c34eda

Fix failing tests on multi-gpus

d277a25

Fix failing XLA tests

d9a80c6

sdesrozis approved these changes May 30, 2020

View reviewed changes

vfdev-5 and others added 5 commits May 31, 2020 00:34

Merge branch 'master' into idist

f617787

Merge branch 'master' into idist

a8f03e8

Fixes failing tests on multi-GPUs

b41cf6d

autopep8 fix

222cb60

Remove useless barriers (#1085)

b3b9aff

* remove useless barriers * Fix failing tests * Added missing barrier in test for XLA Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: vfdev-5 <vfdev.5@gmail.com>

sdesrozis approved these changes May 31, 2020

View reviewed changes

Fixes failing TPU with fork mp

44f4c63

vfdev-5 force-pushed the idist branch from 7d82041 to 44f4c63 Compare May 31, 2020 11:52

Merge branch 'master' into idist

8989e5e

erip suggested changes May 31, 2020

View reviewed changes

vfdev-5 and others added 2 commits May 31, 2020 21:45

Applied review suggestions

f4ee4f9

autopep8 fix

669ef8a

vfdev-5 requested a review from erip May 31, 2020 21:49

sdesrozis approved these changes May 31, 2020

View reviewed changes

vfdev-5 merged commit dff5996 into master May 31, 2020

vfdev-5 deleted the idist branch June 8, 2020 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Merge idist into master #1045

[WIP] Merge idist into master #1045

vfdev-5 commented May 16, 2020 •

edited

Loading

vfdev-5 commented May 25, 2020 •

edited

Loading

sdesrozis left a comment

sdesrozis left a comment

erip left a comment

erip May 31, 2020

erip May 31, 2020

erip May 31, 2020

erip commented May 31, 2020 •

edited

Loading

sdesrozis left a comment

[WIP] Merge idist into master #1045

[WIP] Merge idist into master #1045

Conversation

vfdev-5 commented May 16, 2020 • edited Loading

vfdev-5 commented May 25, 2020 • edited Loading

sdesrozis left a comment

Choose a reason for hiding this comment

sdesrozis left a comment

Choose a reason for hiding this comment

erip left a comment

Choose a reason for hiding this comment

erip May 31, 2020

Choose a reason for hiding this comment

erip May 31, 2020

Choose a reason for hiding this comment

erip May 31, 2020

Choose a reason for hiding this comment

erip commented May 31, 2020 • edited Loading

sdesrozis left a comment

Choose a reason for hiding this comment

vfdev-5 commented May 16, 2020 •

edited

Loading

vfdev-5 commented May 25, 2020 •

edited

Loading

erip commented May 31, 2020 •

edited

Loading