Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have seq2seq just use gather #27025

Merged
merged 9 commits into from
Nov 14, 2023
Merged

Have seq2seq just use gather #27025

merged 9 commits into from
Nov 14, 2023

Conversation

muellerzr
Copy link
Contributor

@muellerzr muellerzr commented Oct 23, 2023

What does this PR do?

In the case of using Seq2Seq, we don't want gather_for_metrics to use its magic and we just want to do .gather() (since otherwise it will drop samples in the former case as accelerate will drop "duplicates" based on the batch size, which leads to a bug).

This PR sets a new gather_function in the Trainer which by default is gather_for_metrics, but if a particular Trainer needs to modify it (such as Seq2SeqTrainer), then it can be specified.

Fixes #25231

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @younesbelkada

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 23, 2023

The documentation is not available anymore as the PR was closed or merged.

@muellerzr
Copy link
Contributor Author

muellerzr commented Oct 23, 2023

@ArthurZucker any thoughts on what could fix this? I saw negligible time differences between main and my branch when running locally on CPU (~72s)

Looks like it's all passing now!

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Looks clean on my end

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

tests/trainer/test_trainer_seq2seq.py Outdated Show resolved Hide resolved
tests/trainer/test_trainer_seq2seq.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Show resolved Hide resolved
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

A few comments/questions for my own understanding of the PR before I can approve:

  • Could you clarify in the PR description the issue i.e. what does gather_metrics do differently from gather (what is the "magic")?
  • Am I right in understanding this should only be applied to cases when evaluating generations from seq2seq models and the generation config specifies num_return_sequences > 1?
  • What happens and what should happen if I call evaluate with a generation config with num_return_sequences > 1 and then call a second time with num_return_sequences==1?

src/transformers/trainer.py Show resolved Hide resolved
Comment on lines 178 to 183
for num_return_sequences in range(1, 4):
gen_config.num_return_sequences = num_return_sequences
metrics = trainer.evaluate(eval_dataset=prepared_dataset, generation_config=gen_config)
assert (
metrics["eval_samples"] == dataset_len * num_return_sequences
), f"Got {metrics['eval_samples']}, expected: {dataset_len * num_return_sequences}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works because the state of the trainer is set such that self.gather_function = self.accelerator.gather_metrics initially and then switches to self.accelerator.gather when num_return_sequences. However, I don't think this would work if you did for num_return_sequences in range(3, 0, -1), as the trainer would never have self.gather_function = self.accelerator.gather_metrics for when num_return_sequence=1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the test to use range(3,0,-1). It still passed beforehand, but simplified the logic to just use gather()

@muellerzr
Copy link
Contributor Author

@amyeroberts:

Am I right in understanding this should only be applied to cases when evaluating generations from seq2seq models and the generation config specifies num_return_sequences > 1?

Correct, otherwise we will drop samples. Technically we can avoid this entirely I think by just using gather, and the test seems to show that will indeed work fine. As a result, I'll simplify this to just use .gather()

What happens and what should happen if I call evaluate with a generation config with num_return_sequences > 1 and then call a second time with num_return_sequences==1?

Per your recommendation of the test, I tried this, and it worked as it should (because gather_for_metrics doesn't do anything for a bs of 1 really).

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muellerzr Sorry, I still don't fully understand and need some clarification.

Correct, otherwise we will drop samples.

Why does using gather_metrics drop samples?

Technically we can avoid this entirely I think by just using gather, and the test seems to show that will indeed work fine.

Is this only true for Seq2SeqTrainer. If not, why not just use gather everywhere?

src/transformers/trainer_seq2seq.py Outdated Show resolved Hide resolved
tests/trainer/test_trainer_seq2seq.py Outdated Show resolved Hide resolved
@muellerzr
Copy link
Contributor Author

muellerzr commented Nov 3, 2023

@amyeroberts:

Why does using gather_metrics drop samples?

It's some logic in Accelerate, gather_for_metrics will drop samples if we think they've been duplicated (such as filling in the last batch of data if data has been duplicated to make DDP efficient), however here is an edge case where just using gather is better

Is this only true for Seq2SeqTrainer. If not, why not just use gather everywhere?

Yes, just Seq2Seq. Otherwise gather_for_metrics should be used always

muellerzr and others added 2 commits November 3, 2023 11:56
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

As a general comment, this seems like something that should really be resolved on the accelerate side, however the fix seems tidy enough here.

@muellerzr muellerzr merged commit 067c4a3 into main Nov 14, 2023
21 checks passed
@muellerzr muellerzr deleted the muellerzr-seq2seq branch November 14, 2023 19:54
Saibo-creator pushed a commit to epfl-dlab/transformers-GCD-PR that referenced this pull request Nov 15, 2023
* Have seq2seq just use gather

* Change

* Reset after

* Make slow

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Clean

* Simplify and just use gather

* Update tests/trainer/test_trainer_seq2seq.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* gather always for seq2seq

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
wgifford pushed a commit to namctin/transformers that referenced this pull request Nov 17, 2023
* Have seq2seq just use gather

* Change

* Reset after

* Make slow

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Clean

* Simplify and just use gather

* Update tests/trainer/test_trainer_seq2seq.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* gather always for seq2seq

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* Have seq2seq just use gather

* Change

* Reset after

* Make slow

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Clean

* Simplify and just use gather

* Update tests/trainer/test_trainer_seq2seq.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* gather always for seq2seq

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@psychocosine
Copy link

Hello @muellerzr , I think .gather will cause the last batch containing duplicated samples, and it will eventually lead to take these duplicated samples into computing the in compute_metrics , which is unexpected.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seq2SeqTrainer.evaluate and predict don't yield the right number of predictions when num_return_sequences > 1
6 participants