Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable mixed precision training for Transformer models #211

Merged
merged 1 commit into from
Oct 3, 2018

Conversation

guillaumekln
Copy link
Contributor

Closes #57.

@guillaumekln guillaumekln merged commit 87f6f3c into OpenNMT:master Oct 3, 2018
@guillaumekln guillaumekln deleted the mixed-precision branch October 3, 2018 16:10
wanghm92 added a commit to wanghm92/OpenNMT-tf that referenced this pull request Jan 5, 2019
@mehmedes
Copy link

Hi @guillaumekln,
Would you mind giving us some input on your speed gains with your mixed precision implementation vs. fp32:
tensorflow/tensor2tensor#1221

@guillaumekln
Copy link
Contributor Author

guillaumekln commented Jan 18, 2019

Hi,

I gathered some fresh values on a P3 instance (1 x V100) using the tensorflow/tensorflow:nightly-gpu-py3 Docker image. Same configuration in both tests to highlight the raw gain:

  • Model type: TransformerBase (without shared weights)
  • Batch size: 8192
vocab size step/s source tokens/s target tokens/s
FP32 32,001 2.64 18.1k 20.4k
FP16 32,001 3.56 24.5k 27.6k
FP16 32,000 4.03 27.8k 31.4k
FP16 (with #309) 32,000 4.68 32.8k 37.1k

@mehmedes
Copy link

Thank you for the feedback. Looks like we share the same fate :_(

@guillaumekln
Copy link
Contributor Author

@mehmedes Please note that it's important to make the vocabulary size a multiple of 8. In my initial experiment it was actually 32,000 + 1 (the <unk> token). Changing it to 31,999 + 1 makes a difference, see the table above.

@guillaumekln
Copy link
Contributor Author

Similarly, the batch size should ideally be a multiple of 8. With #309, additional gains are observed (see the updated table above).

@guillaumekln
Copy link
Contributor Author

@mehmedes Here are additional data for a big Transformer model with a batch size of 4096 and the latest updates:

step/s source tokens/s target tokens/s
FP32 1.92 6.6k 7.4k
FP16 4.27 15.3k 17.3k

So to summarize, here are the current gains (with equal batch size):

  • base Transformer: x1.77
  • big Transformer: x2.22

which are in line with the expected FP16 gains (but generally lower than what one can achieve in PyTorch for example).

leod pushed a commit to leod/OpenNMT-tf that referenced this pull request Jun 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants