sovits-test / README_ZH.md
atsushieee's picture
Upload folder using huggingface_hub
9791162
|
raw
history blame
18.3 kB
<div align="center">
<h1> Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS </h1>
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/maxmax20160403/sovits5.0)
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/PlayVoice/so-vits-svc-5.0">
<img alt="GitHub forks" src="https://img.shields.io/github/forks/PlayVoice/so-vits-svc-5.0">
<img alt="GitHub issues" src="https://img.shields.io/github/issues/PlayVoice/so-vits-svc-5.0">
<img alt="GitHub" src="https://img.shields.io/github/license/PlayVoice/so-vits-svc-5.0">
</div>
### 本项目使用简洁明了的代码结构,用于深度学习技术的研究
### 基于学习的目的,本项目并不追求效果极限、而更多的为学生笔记本考虑,采用了低配置参数、最终预训练模型为202M(包括生成器和判别器,且为float32模型),远远小于同类项目模型大小
### 如果你寻找的是直接可用的项目,本项目并不适合你
- 本项目的目标群体是:深度学习初学者,具备Python和PyTorch的基本操作是使用本项目的前置条件;
- 本项目旨在帮助深度学习初学者,摆脱枯燥的纯理论学习,通过与实践结合,熟练掌握深度学习基本知识;
- 本项目不支持实时变声;(支持需要换掉whisper)
- 本项目不会开发用于其他用途的一键包
### 代码详解课程
- 1-整体框架 https://www.bilibili.com/video/BV1Tj411e7pQ
- 2-数据准备和预处理 https://www.bilibili.com/video/BV1uj411v7zW
- 3-先验后验编码器 https://www.bilibili.com/video/BV1Be411Q7r5
- 4-decoder部分 https://www.bilibili.com/video/BV19u4y1b73U
- 5-蛇形激活函数 https://www.bilibili.com/video/BV1HN4y1D7AR
- 6-Flow部分 https://www.bilibili.com/video/BV1ju411F7Fs
- 7-训练及损失函数部分 https://www.bilibili.com/video/BV1qw411W73B
- 8-训练推理以及基频矫正 https://www.bilibili.com/video/BV1eb4y1u7ER
![vits-5.0-frame](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/3854b281-8f97-4016-875b-6eb663c92466)
- 【无 泄漏】支持多发音人
- 【捏 音色】创造独有发音人
- 【带 伴奏】也能进行转换,轻度伴奏
- 【用 Excel】进行原始调教,纯手工
https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/63858332-cc0d-40e1-a216-6fe8bf638f7c
Powered by [@ShadowVap](https://space.bilibili.com/491283091)
## 模型特点:
| Feature | From | Status | Function |
| :--- | :--- | :--- | :--- |
| whisper | OpenAI | ✅ | 强大的抗噪能力 |
| bigvgan | NVIDA | ✅ | 抗锯齿与蛇形激活,共振峰更清晰,提升音质明显 |
| natural speech | Microsoft | ✅ | 减少发音错误 |
| neural source-filter | NII | ✅ | 解决断音问题 |
| speaker encoder | Google | ✅ | 音色编码与聚类 |
| GRL for speaker | Ubisoft |✅ | 对抗去音色 |
| SNAC | Samsung | ✅ | VITS 一句话克隆 |
| SCLN | Microsoft | ✅ | 改善克隆 |
| PPG perturbation | 本项目 | ✅ | 提升抗噪性和去音色 |
| HuBERT perturbation | 本项目 | ✅ | 提升抗噪性和去音色 |
| VAE perturbation | 本项目 | ✅ | 提升音质 |
| Mix encoder | 本项目 | ✅ | 提升转换稳定性 |
| USP 推理 | 本项目 | ✅ | 提升转换稳定性 |
**USP : 即使unvoice和silence在推理的时候,也有Pitch,这个Pitch平滑链接voice段**
![vits_svc_usp](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/ba733b48-8a89-4612-83e0-a0745587d150)
## 为什么要mix
![mix_frame](https://github.com/PlayVoice/whisper-vits-svc/assets/16432329/3ffa1be0-1a21-4752-96b5-6220f98f2313)
## 安装环境
1. 安装[PyTorch](https://pytorch.org/get-started/locally/)
2. 安装项目依赖
```
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
```
**注意:不能额外安装whisper,否则会和代码内置whisper冲突**
3. 下载[音色编码器](https://drive.google.com/drive/folders/15oeBYf6Qn1edONkVLXe82MzdIi3O_9m3), 把`best_model.pth.tar`放到`speaker_pretrain/`里面 (**不要解压**
4. 下载[whisper-large-v2模型](https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt),把`large-v2.pt`放到`whisper_pretrain/`里面
5. 下载[hubert_soft模型](https://github.com/bshall/hubert/releases/tag/v0.1),把`hubert-soft-0d54a1f4.pt`放到`hubert_pretrain/`里面
6. 下载音高提取模型[crepe full](https://github.com/maxrmorrison/torchcrepe/tree/master/torchcrepe/assets),把`full.pth`放到`crepe/assets`里面
**注意:full.pth为84.9M,请确认文件大小无误**
7. 下载[sovits5.0.pretrain.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/5.0/), 把它放到`vits_pretrain/`里面,推理测试
> python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
## 数据集准备
1. 人声分离,如果数据集没有BGM直接跳过此步骤(推荐使用[UVR](https://github.com/Anjok07/ultimatevocalremovergui)中的3_HP-Vocal-UVR模型或者htdemucs_ft模型抠出数据集中的人声)
2. 用[slicer](https://github.com/flutydeer/audio-slicer)剪切音频,whisper要求为小于30秒(建议丢弃不足2秒的音频,短音频大多没有音素,有可能会影响训练效果)
3. 手动筛选经过第1步和第2步处理过的音频,裁剪或者丢弃杂音明显的音频,如果数据集没有BGM直接跳过此步骤
4. 用Adobe Audition进行响度平衡处理
5. 按下面文件结构,将数据集放入dataset_raw目录
```shell
dataset_raw
├───speaker0
│ ├───000001.wav
│ ├───...
│ └───000xxx.wav
└───speaker1
├───000001.wav
├───...
└───000xxx.wav
```
## 数据预处理
```shell
python svc_preprocessing.py -t 2
```
-t:指定线程数,必须是正整数且不得超过CPU总核心数,一般写2就可以了
预处理完成后文件夹结构如下面所示
```shell
data_svc/
└── waves-16k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── waves-32k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── pitch
│ └── speaker0
│ │ ├── 000001.pit.npy
│ │ └── 000xxx.pit.npy
│ └── speaker1
│ ├── 000001.pit.npy
│ └── 000xxx.pit.npy
└── hubert
│ └── speaker0
│ │ ├── 000001.vec.npy
│ │ └── 000xxx.vec.npy
│ └── speaker1
│ ├── 000001.vec.npy
│ └── 000xxx.vec.npy
└── whisper
│ └── speaker0
│ │ ├── 000001.ppg.npy
│ │ └── 000xxx.ppg.npy
│ └── speaker1
│ ├── 000001.ppg.npy
│ └── 000xxx.ppg.npy
└── speaker
│ └── speaker0
│ │ ├── 000001.spk.npy
│ │ └── 000xxx.spk.npy
│ └── speaker1
│ ├── 000001.spk.npy
│ └── 000xxx.spk.npy
└── singer
├── speaker0.spk.npy
└── speaker1.spk.npy
```
如果您有编程基础,推荐,逐步完成数据处理,也利于学习内部工作原理
- 1, 重采样
生成采样率16000Hz音频, 存储路径为:./data_svc/waves-16k
> python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
生成采样率32000Hz音频, 存储路径为:./data_svc/waves-32k
> python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
- 2, 使用16K音频,提取音高
> python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
- 3, 使用16k音频,提取内容编码
> python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
- 4, 使用16k音频,提取内容编码
> python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert
- 5, 使用16k音频,提取音色编码
> python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
- 6, 提取音色编码均值;用于推理,也可作为发音人统一音色用于生成训练索引(数据音色变化不大的情况下)
> python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
- 7, 使用32k音频,提取线性谱
> python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
- 8, 使用32k音频,生成训练索引
> python prepare/preprocess_train.py
- 9, 训练文件调试
> python prepare/preprocess_zzz.py
## 训练
0. 参数调整
如果基于预训练模型微调,需要下载预训练模型[sovits5.0.pretrain.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/5.0)并且放在项目根目录下面<br>
并且修改`configs/base.yaml`的参数`pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"`,并适当调小学习率(建议从5e-5开始尝试)<br>
**learning_rate & batch_size & accum_step 为三个紧密相关的参数,需要仔细调节**<br>
**batch_size 乘以 accum_step 通常等于 16 或 32,对于低显存GPU,可以尝试 batch_size = 4,accum_step = 4**
1. 开始训练
```
python svc_trainer.py -c configs/base.yaml -n sovits5.0
```
2. 恢复训练
```
python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/sovits5.0_***.pt
```
3. 训练日志可视化
```
tensorboard --logdir logs/
```
![sovits5 0_base](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/1628e775-5888-4eac-b173-a28dca978faa)
![sovits_spec](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/c4223cf3-b4a0-4325-bec0-6d46d195a1fc)
## 推理
1. 导出推理模型:文本编码器,Flow网络,Decoder网络;判别器和后验编码器等只在训练中使用
```
python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
```
2. 推理
- 如果不想手动调整f0,只需要最终的推理结果,运行下面的命令即可
```
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/修改成对应的名称.npy --wave test.wav --shift 0
```
- 如果需要手动调整f0,依据下面的流程操作
- 使用whisper提取内容编码,生成test.ppg.npy
```
python whisper/inference.py -w test.wav -p test.ppg.npy
```
- 使用hubert提取内容编码,生成test.vec.npy
```
python hubert/inference.py -w test.wav -v test.vec.npy
```
- 提取csv文本格式F0参数,用Excel打开csv文件,对照Audition或者SonicVisualiser手动修改错误的F0
```
python pitch/inference.py -w test.wav -p test.csv
```
- 最终推理
```
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/修改成对应的名称.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv --shift 0
```
3. 一些注意点
当指定--ppg后,多次推理同一个音频时,可以避免重复提取音频内容编码;没有指定,也会自动提取
当指定--vec后,多次推理同一个音频时,可以避免重复提取音频内容编码;没有指定,也会自动提取
当指定--pit后,可以加载手工调教的F0参数;没有指定,也会自动提取
生成文件在当前目录svc_out.wav
| args | --config | --model | --spk | --wave | --ppg | --vec | --pit | --shift |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| name | 配置文件 | 模型文件 | 音色文件 | 音频文件 | ppg内容 | hubert内容 | 音高内容 | 升降调 |
4. 去噪后处理
```
python svc_inference_post.py --ref test.wav --svc svc_out.wav --out svc_out_post.wav
```
## 两种训练模式
- 分散模式:训练索引中,音色文件使用音频音色
- 统一模式:训练索引中,音色文件使用发音人音色
**问题:哪种情况下,哪个模式更好**
## 模型融合
```
python svc_merge.py --model1 模型1.pt --model1 模型2.pt --rate 模型1占比(0~1)
```
对不同epoch的模型进行融合,可以获得比较平均的性能、削弱过拟合
例如:python svc_merge.py --model1 chkpt\sovits5.0\sovits5.0_1045.pt --model2 chkpt\sovits5.0\sovits5.0_1050.pt --rate 0.4
## 捏音色
纯属巧合的取名:average -> ave -> eva,夏娃代表者孕育和繁衍
```
python svc_eva.py
```
```python
eva_conf = {
'./configs/singers/singer0022.npy': 0,
'./configs/singers/singer0030.npy': 0,
'./configs/singers/singer0047.npy': 0.5,
'./configs/singers/singer0051.npy': 0.5,
}
```
生成的音色文件为:eva.spk.npy
## 数据集
| Name | URL |
| :--- | :--- |
|KiSing |http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/|
|PopCS |https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md|
|opencpop |https://wenet.org.cn/opencpop/download/|
|Multi-Singer |https://github.com/Multi-Singer/Multi-Singer.github.io|
|M4Singer |https://github.com/M4Singer/M4Singer/blob/master/apply_form.md|
|CSD |https://zenodo.org/record/4785016#.YxqrTbaOMU4|
|KSS |https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset|
|JVS MuSic |https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music|
|PJS |https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus|
|JUST Song |https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song|
|MUSDB18 |https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems|
|DSD100 |https://sigsep.github.io/datasets/dsd100.html|
|Aishell-3 |http://www.aishelltech.com/aishell_3|
|VCTK |https://datashare.ed.ac.uk/handle/10283/2651|
|Korean Songs |http://urisori.co.kr/urisori-en/doku.php/|
## 代码来源和参考文献
https://github.com/facebookresearch/speech-resynthesis [paper](https://arxiv.org/abs/2104.00355)
https://github.com/jaywalnut310/vits [paper](https://arxiv.org/abs/2106.06103)
https://github.com/openai/whisper/ [paper](https://arxiv.org/abs/2212.04356)
https://github.com/NVIDIA/BigVGAN [paper](https://arxiv.org/abs/2206.04658)
https://github.com/mindslab-ai/univnet [paper](https://arxiv.org/abs/2106.07889)
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS
https://github.com/brentspell/hifi-gan-bwe
https://github.com/mozilla/TTS
https://github.com/bshall/soft-vc
https://github.com/maxrmorrison/torchcrepe
https://github.com/MoonInTheRiver/DiffSinger
https://github.com/OlaWod/FreeVC [paper](https://arxiv.org/abs/2210.15418)
https://github.com/yl4579/HiFTNet [paper](https://arxiv.org/abs/2309.09493)
[One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization](https://arxiv.org/abs/1904.05742)
[SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech](https://github.com/hcy71o/SNAC)
[Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](https://arxiv.org/abs/2211.00585)
[AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/pdf/2103.00993.pdf)
[AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation](https://arxiv.org/pdf/2206.00208.pdf)
[Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis](https://github.com/ubisoft/ubisoft-laforge-daft-exprt)
[Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings](https://arxiv.org/abs/2305.05401)
[Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion](https://arxiv.org/pdf/2305.09167.pdf)
[Multilingual Speech Synthesis and Cross-Language Voice Cloning: GRL](https://arxiv.org/abs/1907.04448)
[RoFormer: Enhanced Transformer with rotary position embedding](https://arxiv.org/abs/2104.09864))https://github.com/facebookresearch/speech-resynthesis [paper](https://arxiv.org/abs/2104.00355)
## 基于数据扰动防止音色泄露的方法
https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py
## 贡献者
<a href="https://github.com/PlayVoice/so-vits-svc/graphs/contributors">
<img src="https://contrib.rocks/image?repo=PlayVoice/so-vits-svc" />
</a>
## 特别感谢
https://github.com/Francis-Komizu/Sovits
## 原创过程
2022.04.12 https://mp.weixin.qq.com/s/autNBYCsG4_SvWt2-Ll_zA
2022.04.22 https://github.com/PlayVoice/VI-SVS
2022.07.26 https://mp.weixin.qq.com/s/qC4TJy-4EVdbpvK2cQb1TA
2022.09.08 https://github.com/PlayVoice/VI-SVC
## 被这个项目拷贝:svc-develop-team/so-vits-svc
![coarse_f0_1](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/e2f5e5d3-d169-42c1-953f-4e1648b6da37)
![coarse_f0_2](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/f3539c83-7c8a-425e-bf20-2c402132f0f4)
![coarse_f0_3](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/f3cee94a-0eeb-4189-b9bb-7043d06e62ef)
## Rcell对拷贝的真实回应
![Rcell](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/8ebb236d-e233-4cea-9359-8e44029b5af5)