-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Fix multiple issues in distributed multi-GPU GraphSAGE example #3870
Conversation
@bioannidis @tonyjie Could you try this solution out? PyTorch 1.10 now recommends using |
I suggest also mention this in the related tutorials/user guides. |
Actually I'm fine with GraphSAGE distributed training code. I'm trying to use But in general, in fewer machines settings, it tends to be easier to run successfully, e.g. 2 partitions compared to 4 partitions. So I think it's still a synchronization problem. |
That's bad, because the purpose of
Also, were you able to notice where your code is hanging (e.g. something like If you could provide a reproducible example it would be even better, although I understand it may be hard to do so. |
One setting that fails (stuck in deadlock) is as follows:
But when the Potential Questions & Bugs from my side?
I found that I'm not doing anything like Thanks for your detailed question. Looking forward to some suggestions. |
The code would just hang on a "random" step: |
I downloaded the latest nightly build I would open another issue later. Thanks |
…3870) * fix distributed multi-GPU example device * try Join * update version requirement in README * use model.join * fix docs Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
i tried the idea here and it doesn't work for me either. here is the error i got:
|
Hi, I think my problem is kind of different and I already (basically) solved it. I'm trying to write my own distributed training code for link prediction task, therefore I use Now I use |
pad_data
and trytorch.distributed.algorithms.join.Join
to deal with uneven training set sizes.