Replies: 7 comments 14 replies
-
Hey! I'm working with low-resource languages as well and I'm facing similar problems. Are you fine-tuning Whisper on more than one language? When fine-tuning the model, are you indicating the language in the tokenizer and processor? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Hi Andrés, I am training on both a single language and on multiple languages. In the case of multiple languages, I pre-process each dataset in advance using the right language for the tokenizer, and then I concatenate (and shuffle) them prior to running the finetuning. Of course, you can also do this on the fly during fine-tuning by changing the tokenizer language dynamically. During finetuning, I do not indicate any language for the processor or tokenizer, while specifying the "transcribe" task, and so rely on the model's language identification capability. Now, during inference, language identification does not seem to work well for low resource languages, as there were probably not enough data related to such languages that were used during pre-training of the LID task. However, this does not seem to affect the transcription result much. However, the translation task does not seem good on these languages either. So, my idea is to fine-tune for both LID and ASR using a multi-task objective. We can compute a loss for LID (loss2), and so, the global loss would be something like: loss = a*loss1 + (1-a)*loss2 We can get the predicted language token by performing a forward pass using a single token, startoftranscript (sot). You can check the code here: Line 19 in 248b6cb and here: https://huggingface.co/spaces/openai/whisper/discussions/6#63f648c152799101f3d0178f Best! |
Beta Was this translation helpful? Give feedback.
-
Hi @sproocht I'm fine-tuning on Galician, Spanish, Portuguese, French, German and English. Here you have some results from my tests with a subset of the validation common voice dataset:
It seems reasonable that it makes mistakes especially between Galician and Spanish, as there are phrases that are the same in both languages. How did you fine-tune to get these results? Did you combine the losses from the two tasks or did you simply include the language label and train in the ASR task? Are you able to infer language and transcription using the same code? Best! |
Beta Was this translation helpful? Give feedback.
-
Can you provide me with the source code of what you have done? I'm new to this field and currently researching it. I find the work you both are doing quite fascinating, and I'm also interested in fine-tuning on multitask Language Identification (LID) and Automatic Speech Recognition (ASR). I've noticed that LID is becoming worse. I have some problem with custom trainer. I appreciate that both of you can support me. |
Beta Was this translation helpful? Give feedback.
-
I'm also working on a project to do low-resource fine-tuning whisper. |
Beta Was this translation helpful? Give feedback.
-
Hi @Stanwang1210 , It is difficult to know the exact reason, but I would say it is related to the amount of data. For my multilingual training I use 12 hours of audio for training and 7 hours for evaluation, for each of the languages. |
Beta Was this translation helpful? Give feedback.
-
Can you share the code for training multiple objectives |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I have been fine-tuning Whisper on low-resource languages and the issue I see is that language identification gets worse on these languages after fine-tuning while transcription improves.
Is there a way to fine-tune for both language identification and transcription at the same using the same data since the language token is generated as well?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions