Fine-Tuning Whisper on both Language Identification and Transcription tasks #1454

sproocht · 2023-06-16T23:21:58Z

sproocht
Jun 16, 2023

Hi all,
I have been fine-tuning Whisper on low-resource languages and the issue I see is that language identification gets worse on these languages after fine-tuning while transcription improves.

Is there a way to fine-tune for both language identification and transcription at the same using the same data since the language token is generated as well?

Thanks in advance!

andrespimartin · 2023-06-20T13:21:29Z

andrespimartin
Jun 20, 2023

Hey!

I'm working with low-resource languages as well and I'm facing similar problems.

Are you fine-tuning Whisper on more than one language?
How are you testing the language identification? I mean, how do you get the language probs when working with transformers?
I have tested language identification even before finetuning for my low-resource language and it is not working very well.

When fine-tuning the model, are you indicating the language in the tokenizer and processor?

Thanks!

0 replies

sproocht · 2023-06-20T19:49:47Z

sproocht
Jun 20, 2023
Author

Hi Andrés,

I am training on both a single language and on multiple languages.

In the case of multiple languages, I pre-process each dataset in advance using the right language for the tokenizer, and then I concatenate (and shuffle) them prior to running the finetuning. Of course, you can also do this on the fly during fine-tuning by changing the tokenizer language dynamically.

During finetuning, I do not indicate any language for the processor or tokenizer, while specifying the "transcribe" task, and so rely on the model's language identification capability.

Now, during inference, language identification does not seem to work well for low resource languages, as there were probably not enough data related to such languages that were used during pre-training of the LID task. However, this does not seem to affect the transcription result much. However, the translation task does not seem good on these languages either.

So, my idea is to fine-tune for both LID and ASR using a multi-task objective. We can compute a loss for LID (loss2), and so, the global loss would be something like: loss = a*loss1 + (1-a)*loss2

We can get the predicted language token by performing a forward pass using a single token, startoftranscript (sot). You can check the code here:

whisper/whisper/decoding.py

Line 19 in 248b6cb

def detect_language(

and here: https://huggingface.co/spaces/openai/whisper/discussions/6#63f648c152799101f3d0178f

Best!

8 replies

andrespimartin Jun 29, 2023

Hi @sproocht

Thank you so much for the solution. It works perfectly.

Have you made any progress with the language identification task?
I have been doing some tests, and basically I confirm the problem you mentioned: my models, fine-tuned in 6 languages, clearly worsen in the language identification task after fine-tuning. I have tried fine-tuning with the "language" label in the tokeniser and also without it. In inference both work well to obtain the transcription, and both worsen the performance for language identification.

Best!

sproocht Jul 3, 2023
Author

Hi @andrespimartin,

Sorry for the delay; I have been busy on some other tasks.

I have quickly tested the idea of multi-objective optimization for fine-tuning Whisper for both language identification and transcription. However, I will need to do some additional testing.

Two things to keep in mind though are the following:

You would need to extract the "labels" in your batch and compute a cross-entropy loss with the predicted logits discussed earlier. So, you get your LID loss.
You would need to subclass the (Seq2Seq)Trainer class to combine the above loss to the ASR loss, to get your final loss, if you use HF transformers otherwise you would need to write your own training loop. You have an example here: https://huggingface.co/docs/transformers/main_classes/trainer

Another option you have to help reduce the castastrophic forgetting while fine-tuning, is to use adapters (PEFT LoRA ...). This seems even more efficient for low resource languages. Besides, you can train in 8-bit on lower hardware if you want, without much loss of performance.
Here are a couple of links about this:
https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

https://github.com/huggingface/peft

Hope this helps!

andrespimartin Jul 4, 2023

Hi @sproocht

Thank you for this information. I will try to move forward with those two things in mind.

Regarding LoRA / AdaLoRA, it is just what I have been recommended by Sanchit-gandhi in this HuggingFace post:
https://huggingface.co/spaces/openai/whisper/discussions/93#64a2efecd05c79e4f5561398

I will also try it out to check the results. Anyway, I'm very interested in the idea of performing the finetuning also on that task, since one of the languages I work with is Galician (underrepresented language), so I'm interested not only in maintaining the performance of the base model, but also in improving it on the languages of my fine-tuning.

Best!

andrespimartin Jul 18, 2023

Hi @sproocht

I have good news.
I have been testing the combination of losses for multitasking training, but in my tests I found that I was inferring wrong for the language detection task.
By correcting the inference, I have found that the fine tuned model with the language labels in the labels does work well for this task (I get more than 97% accuracy for 6 languages). In fact, this makes sense. By incorporating the language label, and doing the fine tuning only for the ASR task, the model must predict, among other things, the language, and calculates the loss based on this prediction (even if it is not reflected in the metrics).

With regard to language detection, I calculate it as follows:

bos_token_id = processor.tokenizer.all_special_ids[-106]
decoder_input_ids = torch.tensor([[1, bos_token_id]])

with torch.no_grad():
    logits = model(input_features, decoder_input_ids=decoder_input_ids).logits

pred_ids = torch.argmax(logits[:,1], dim=-1)
lang_ids = processor.decode(pred_ids[0])

Another thing I have not yet been able to do is to infer the language and get the transcript with logits or generate (all in one step).

Hope this helps!

sproocht Jul 19, 2023
Author

Hi @andrespimartin,

Those are definitely some good news! What are the six languages that you are fine-tuning on? Are any pairs closely related? It would be interesting to see per-language accuracy on the LID task post fine-tuning.

In my case, I have German and Luxembourgish, which are quite close. So, it is important that the model is able to distinguish between both, which it does not do well by default. Hence, the idea of the additional training objective.

Here are some preliminary results from my tests on a subset of the FLEURS dataset:

Accuracy on LID task:

1. Pretrained Whisper-large-v2 

French: 0.93
German: 1.0
Luxembourgish: 0.43

2. Finetuned Whisper-large-v2 on Luxembourgish ASR

French: 0.0
German: 0.0
Luxembourgish: 1.0

3. Multilingual Multitask Finetuned Whisper-large-v2 on Luxembourgish, German and French ASR and LID

French: 1.0
German: 1.0
Luxembourgish: 1.0

So, we can postulate that pretrained Whisper model performance on the LID task is very good for high resource languages but not so for low resource languages...

Regarding your LID code issue, here is what works for me to get predicted languages for a given batch and related input_features using finetuned models:

bos_token_id = processor.tokenizer.all_special_ids[-106]
decoder_input_ids = torch.tensor([[1, bos_token_id] for _ in range(len(batch))]).to(device)

with torch.no_grad():
    logits = model(input_features, decoder_input_ids=decoder_input_ids).logits

pred_ids = torch.argmax(logits[:, -1, :], dim=-1)

lang_ids = [processor.decode(s) for s in pred_ids]
print(lang_ids)

Hope this proves useful!

andrespimartin · 2023-07-19T14:50:52Z

andrespimartin
Jul 19, 2023

Hi @sproocht

I'm fine-tuning on Galician, Spanish, Portuguese, French, German and English.
Galician and Spanish, and Galician and Portuguese are closely related. I'm working with the small version of Whisper.

Here you have some results from my tests with a subset of the validation common voice dataset:

1. Pretrained Whisper-small

Galician: 0.0807
Spanish: 0.9755
Portuguese: 0.9677
French: 0.9815
German: 0.9803
English: 0.9880

2. Multilingual Multitask Finetuned Whisper-samll on six languages (ASR and LID):

Galician: 0.9689
Spanish: 0.9202
Portuguese: 0.9785
French: 0.9938
German: 0.9934
English: 0.9819

It seems reasonable that it makes mistakes especially between Galician and Spanish, as there are phrases that are the same in both languages.

How did you fine-tune to get these results? Did you combine the losses from the two tasks or did you simply include the language label and train in the ASR task?

Are you able to infer language and transcription using the same code?

Best!

3 replies

sproocht Jul 19, 2023
Author

Great! The accuracy of the LID for Galician looks excellent post fine-tuning. What about ASR results, do you see any significant improvements in terms of WER?

For fine-tuning, I am combining the LID loss with the ASR loss., and using PEFT with LoRA. I will be testing different settings for the loss factor and see if the final ASR result improves.

I am not sure I understand your last question correctly but you can surely generate the transcription from the model in addition to the language but in 2 separate statements:

with torch.no_grad():
    logits = model(input_features, decoder_input_ids=decoder_input_ids).logits
    generated_ids = model.generate(input_features, max_length=480_000, use_cache=True)

transcriptions = processor.batch_decode(generated_ids , skip_special_tokens=True)

and if necessary, forcing the language token while decoding:

lang = "gl"
task="transcribe"
forced_decoder_ids = processor.get_decoder_prompt_ids(language=lang, task=task)

with torch.no_grad():
    logits = model(input_features, decoder_input_ids=decoder_input_ids).logits
    generated_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids, max_length=480_000, use_cache=True)

transcriptions = processor.batch_decode(generated_ids, skip_special_tokens=True)

Cheers!

andrespimartin Jul 21, 2023

The ASR result are good, but I'm still working on how to continue improving them without degradating the performance on the well-resource languages. Here you have the results over part of the validation set of the common voice dataset:

1. Pretrained Whisper-small

Galician: 0.3857
Spanish: 0.1024
Portuguese: 0.1279
French: 0.3036
German: 0.1163
English: 0.0881

2. Multilingual Multitask Finetuned Whisper-samll on six languages (ASR and LID):

Galician: 0.1143
Spanish: 0.0928
Portuguese: 0.1044
French: 0.1646
German: 0.1041
English: 0.0948

Regarding your training, are you including the language label into the labels when preparing your data? I mean, are you using the lang label in the ASR task?

Regarding the code, I see that you are doing the same as me (two different inferences). My doubt was if with a single inference it was possible to obtain the logits and the generated_ids (for example, obtaining the logits from the model.generate()).

I don't know if this is something you have in mind for the future, but it would be interesting to see how you have combined the losses and the results of the different strategies.

Best!

sproocht Jul 22, 2023
Author

Hi @andrespimartin,

These are excellent results, especially for Galician and French (and despite the little increase in WER for English)! Good job!

For my training, I pre-process the data in a specific language using different language settings for the tokenizer accordingly. I then combine the result into a single dataset before splitting into training, eval and test, keeping similar proportions for languages each time.

For inference, it would be nice to extract the inferred language using model.generate() . However, from the testing that I have done, it appears that if you do not force the decoder ids, then the inferred language turns out to be English. And if you force it, then you only get the forced language. I will need to check the original code and see why the former is the case. Here is the code that I use:

with torch.no_grad():
    logits = model(input_features, decoder_input_ids=decoder_input_ids).logits
    generated_ids = model.generate(input_features, max_length=480_000, use_cache=True)
    lang_ids = [result[1] for result in generated_ids ]
languages = processor.tokenizer.batch_decode(lang_ids, skip_special_tokens=False, normalize=False)

All the best!

phu-lam · 2023-08-01T06:52:51Z

phu-lam
Aug 1, 2023

Can you provide me with the source code of what you have done? I'm new to this field and currently researching it. I find the work you both are doing quite fascinating, and I'm also interested in fine-tuning on multitask Language Identification (LID) and Automatic Speech Recognition (ASR). I've noticed that LID is becoming worse. I have some problem with custom trainer. I appreciate that both of you can support me.
All the best you guys !

3 replies

andrespimartin Aug 17, 2023

Hi @phu-lam

Sorry for the delay in the response.
Basically, you can use the code from this tutorial:
https://huggingface.co/blog/fine-tune-whisper

Along with these changes to tokenize and include the language label:
https://huggingface.co/spaces/openai/whisper/discussions/6#643d8bc551e2958ef6cd69ef

This is the solution that worked for me. By including the language label the model learns how to predict this label as part of the ASR task.

By the way, at inference time, you will need to follow the code provedie earlier in the thread (two different inferences).

Hope this helps!

phamkhactu Jun 12, 2024

Hope this helps!

Hi @andrespimartin,

With your suggestion 2 Links above. Have you used Spoocht's ideas:

So, my idea is to fine-tune for both LID and ASR using a multi-task objective. We can compute a loss for LID (loss2), and so, the global loss would be something like: loss = a*loss1 + (1-a)*loss2

You would need to subclass the (Seq2Seq)Trainer class to combine the above loss to the ASR loss, to get your final loss, if you use HF transformers otherwise you would need to write your own training loop. You have an example here:

I don't see anything executes his idea. Forgiving me if I wrong.
If not using transformer as you mention, I mean that I need code from scratch. Could you give me some example or link code.
Or if using, would you mind pointing me where it is?

I have another question, Why uses feature_extractor in Seq2SeqTrainer not processor.tokenizer. In pre_data func used "processor.tokenizer" to get label. what is the affected when using it?

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

Best regards,
Tu

S-Cardenas Sep 24, 2024

@andrespimartin Thanks for the links to the code excerpts you based your LID fine-tuning on. Is there any chance you could provide us with a small segments of the full code so that we can also try this out for ourselves? I want to make sure I am not making any silly mistakes when integrating sanchit-gandhi's LID comment with the original whisper fine-tuning code.

Thanks in advance

Stanwang1210 · 2024-01-24T12:24:34Z

Stanwang1210
Jan 24, 2024

Hi @andrespimartin

I'm also working on a project to do low-resource fine-tuning whisper.
However, I found that when I fine-tune whisper-medium to German (A seen languages for whisper) with one-hour of data, the ASR performance of German on test set degrades (CER : 5.4% -> 21.7%), which seems weird and contradicts to your results. I'm wondering whether there is bug in my code. I would appreciate that if you could tell me your intuition about the issue. Is the degradation from the limited data or bugs?
Also, would you please let me know the scale of fine-tuning data you used for each language?

0 replies

andrespimartin · 2024-01-26T13:06:57Z

andrespimartin
Jan 26, 2024

Hi @Stanwang1210 ,

It is difficult to know the exact reason, but I would say it is related to the amount of data. For my multilingual training I use 12 hours of audio for training and 7 hours for evaluation, for each of the languages.
By training with so little data, you are somehow telling the model to focus a lot on a very specific context and vocabulary, and if you take it out of that context, it makes sense that the result will get worse. Is the test data set from the same domain as the one you used for fine-tuning? The length of training may also play a role, although this is difficult to configure with so little data.

0 replies

mujhenahiata · 2024-02-27T19:48:09Z

mujhenahiata
Feb 27, 2024

Hi all, I have been fine-tuning Whisper on low-resource languages and the issue I see is that language identification gets worse on these languages after fine-tuning while transcription improves.

Is there a way to fine-tune for both language identification and transcription at the same using the same data since the language token is generated as well?

Thanks in advance!

Can you share the code for training multiple objectives

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-Tuning Whisper on both Language Identification and Transcription tasks #1454

{{title}}

Replies: 7 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Fine-Tuning Whisper on both Language Identification and Transcription tasks #1454

Replies: 7 comments · 14 replies

sproocht Jun 20, 2023 Author

sproocht Jul 3, 2023 Author

sproocht Jul 19, 2023 Author

sproocht Jul 19, 2023 Author

sproocht Jul 22, 2023 Author

Replies: 7 comments 14 replies

sproocht
Jun 20, 2023
Author

sproocht Jul 3, 2023
Author

sproocht Jul 19, 2023
Author

sproocht Jul 19, 2023
Author

sproocht Jul 22, 2023
Author