-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix T5 incorrect weight decay in Trainer and official summarization example #18002
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR, I've added a few comments.
@@ -526,7 +526,7 @@ def postprocess_text(preds, labels): | |||
|
|||
# Optimizer | |||
# Split weights in two groups, one with weight decay and the other not. | |||
no_decay = ["bias", "LayerNorm.weight"] | |||
no_decay = ["bias", "layer_norm.weight"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep both here, since the user might use other models.
src/transformers/trainer.py
Outdated
@@ -70,6 +70,7 @@ | |||
from .modelcard import TrainingSummary | |||
from .modeling_utils import PreTrainedModel, load_sharded_checkpoint, unwrap_model | |||
from .models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES | |||
from .models.t5.modeling_t5 import T5LayerNorm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is not something we really want to add as the Trainer
shouldn't depend on individual model files. Instead T5LayerNorm
should subclass torch.nn.LayerNorm
(it rewrites the init and the forward anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments! It seems that no one has written any types of layer norm as subclass of nn.LayerNorm
. I'm not sure if this is true:
class T5LayerNorm(nn.LayerNorm):
def __init__(self, hidden_size, eps):
"""
Construct a layernorm module in the T5 style. No bias and no subtraction of mean. """ super().__init__(hidden_size, eps, elementwise_affine=False)
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
self.reset_parameters()
def forward(self, hidden_states):
# T5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean
# Square Layer Normalization https://arxiv.org/abs/1910.07467 thus varience is calculated # w/o mean and there is no bias. Additionally we want to make sure that the accumulation for # half-precision inputs is done in fp32
variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
# convert into half-precision if necessary
if self.weight.dtype in [torch.float16, torch.bfloat16]:
hidden_states = hidden_states.to(self.weight.dtype)
return self.weight * hidden_states
def reset_parameters(self):
if self.weight is not None:
nn.init.ones_(self.weight)
LongT5LayerNorm
is copied from T5LayerNorm
which need to be changed same.
@@ -221,15 +221,17 @@ def _create_global_aggregates( | |||
|
|||
|
|||
# Copied from transformers.models.t5.modeling_t5.T5LayerNorm with T5->LongT5 | |||
class LongT5LayerNorm(nn.Module): | |||
class LongT5LayerNorm(nn.LayerNorm): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would create unused bias weights for every layernorm in T5 at the moment, I don't think we want this. @sgugger I'm actually not really in favor of adding this abstraction here to the T5LayerNorm
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if the init calls the superclass. If it calls the super of the super class (nn.Module), there is nothing breaking. We just can't have the Trainer start depending on multiple modeling files because there are 100 flavors of LayerNorms.
An alternative would be to have a constant in pt_utils
where each submodule that defines a specific LayerNorm adds their own, would that work better?
So here we would have after defining LongT5LayerNorm
:
ALL_LAYERNORM_LAYERS.append(LongT5LayerNorm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ADAning Could you amend your PR to use this solution maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does pt_utils
mean transformers.utils
?
@ADAning Could you amend your PR to use this solution maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No transformers.pytorch_utils
. Sorry I misspelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actuall this is my first time to use git... I don't know why other people commits can be seen after I fetch and rebase from upstream, it really bothers me. So I reset my commit to I created this PR before, then this PR closed. Then my commit can't be seen there. Could you re-open this PR? @sgugger
Im sorry for my mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just re-opened your PR! In general you only need to rebase if the PR has gone a long time (a month) or if there is some critical changes in the main branch you need, so don't bother for small changes :-)
@@ -262,6 +262,8 @@ def forward(self, hidden_states): | |||
return self.weight * hidden_states | |||
|
|||
|
|||
ALL_LAYERNORM_LAYERS.append(T5LayerNorm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be after the try except block below, which may change T5LayerNorm
.
@@ -247,6 +248,8 @@ def forward(self, hidden_states): | |||
return self.weight * hidden_states | |||
|
|||
|
|||
ALL_LAYERNORM_LAYERS.append(LongT5LayerNorm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be after the try/except block below, which can change LongT5LayerNorm
.
Thanks a lot for iterating with us! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for iterating on this and sorry to reply so late here - the final solution looks very nice!
…xample (huggingface#18002) * Add ALL_LAYERNORM_LAYERS for LayerNorm * fix bug of appending layer norm
What does this PR do?
Official summrization examples use T5 as pretrained model, but the name of T5 layer norm is
layer_norm
, notlayerNorm
:Output:
In Official example of summarization,
layer_norm
not included:A similar problem occurred in trainer.py, which may cause
Seq2SeqTrainer
train T5 layer norm with weight decay:Because T5 uses T5LayerNorm Layer norm not
nn.LayerNorm
:Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.