Skip to content

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

License

Notifications You must be signed in to change notification settings

wxsong20/GPT-SoVITS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT-SoVITS - Voice Conversion and Text-to-Speech WebUI

Demo Video and Features

Check out our demo video in Chinese: Bilibili Demo

few.shot.fine.tuning.demo.mp4

Features:

  1. Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.

  2. Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.

  3. Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.

  4. WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.

Todo List

  1. High Priority:

    • Localization in Japanese and English.
    • User guide.
  2. Features:

    • Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
    • TTS speaking speed control.
    • Enhanced TTS emotion control.
    • Experiment with changing SoVITS token inputs to probability distribution of vocabs.
    • Improve English and Japanese text frontend.
    • Develop tiny and larger-sized TTS models.
    • Colab scripts.
    • Expand training dataset (2k -> 10k).

Requirements (How to Install)

Python and PyTorch Version

Tested with Python 3.9, PyTorch 2.0.1, and CUDA 11.

Pip Packages

pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm==4.59.0 cn2an pypinyin pyopenjtalk g2p_en

Additional Requirements

If you need Chinese ASR (supported by FunASR), install:

pip install modelscope torchaudio sentencepiece funasr

FFmpeg

Ubuntu/Debian Users

sudo apt install ffmpeg

MacOS Users

brew install ffmpeg

Windows Users

Download and place ffmpeg.exe and ffprobe.exe in the GPT-SoVITS root.

Pretrained Models

Download pretrained models from GPT-SoVITS Models and place them in GPT_SoVITS\pretrained_models.

For Chinese ASR, download models from Damo ASR Models and place them in tools/damo_asr/models.

For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal), download models from UVR5 Weights and place them in tools/uvr5/uvr5_weights.

Dataset Format

The TTS annotation .list file format:

vocal_path|speaker_name|language|text

Example:

D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.

Language dictionary:

  • 'zh': Chinese
  • 'ja': Japanese
  • 'en': English

Credits

Special thanks to the following projects and contributors:

About

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%