Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] fix incorrect name when fetch data in sparse optim #3808

Merged
merged 2 commits into from
Mar 6, 2022

Conversation

Rhett-Ying
Copy link
Collaborator

@Rhett-Ying Rhett-Ying commented Mar 4, 2022

Description

fix for sparse optim

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@Rhett-Ying Rhett-Ying requested a review from zheng-da March 4, 2022 10:38
@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 4, 2022

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@@ -74,7 +74,7 @@ def step(self):
# will send grad to each corresponding trainer
if self._world_size > 1:
# get idx split from kvstore
idx_split = kvstore.get_partid(name, idics)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should remove name = emb._tensor.name at line 49

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm bit confused. emb._tensor.name and emb.data_name are different?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, they're different. as group_id is attached to the name user specifies. so _tensor.name returns the original name which does not append with group_id. _tensor._name is the globally unique name which is attached with group_id. It's confusing indeed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove line 49?
can you add a test for this modification?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test may be not possible, as the code logic can only be touched while world_size > 1 and distributed is enabled. Is it possible to add such a test in test_dist_optim.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and line 49 cannot be removed, it's used in line 114/118/120/121

@zheng-da zheng-da merged commit bb6cec2 into dmlc:master Mar 6, 2022
@Rhett-Ying Rhett-Ying deleted the m5gnn_naming branch March 16, 2022 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants