Spaces:
Runtime error
Runtime error
Commit
·
c395ff4
1
Parent(s):
7613654
update
Browse files- README.md +7 -436
- bert_vits2/Model/Azuma/G_17400.pth +3 -0
- bert_vits2/Model/Azuma/config.json +95 -0
- bert_vits2/bert/pytorch_model.bin +3 -0
- config.py +2 -0
README.md
CHANGED
|
@@ -1,436 +1,7 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
<img src="https://img.shields.io/badge/python-3.10-green">
|
| 9 |
-
<a href="https://hub.docker.com/r/artrajz/vits-simple-api">
|
| 10 |
-
<img src="https://img.shields.io/docker/pulls/artrajz/vits-simple-api"></a>
|
| 11 |
-
</p>
|
| 12 |
-
<a href="https://github.com/Artrajz/vits-simple-api/blob/main/README.md">English</a>|<a href="https://github.com/Artrajz/vits-simple-api/blob/main/README_zh.md">中文文档</a>
|
| 13 |
-
<br/>
|
| 14 |
-
</div>
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
# Feature
|
| 21 |
-
|
| 22 |
-
- [x] VITS text-to-speech, voice conversion
|
| 23 |
-
- [x] HuBert-soft VITS
|
| 24 |
-
- [x] [vits_chinese](https://github.com/PlayVoice/vits_chinese)
|
| 25 |
-
- [x] [Bert-VITS2](https://github.com/Stardust-minus/Bert-VITS2)
|
| 26 |
-
- [x] W2V2 VITS / [emotional-vits](https://github.com/innnky/emotional-vits) dimensional emotion model
|
| 27 |
-
- [x] Support for loading multiple models
|
| 28 |
-
- [x] Automatic language recognition and processing,set the scope of language type recognition according to model's cleaner,support for custom language type range
|
| 29 |
-
- [x] Customize default parameters
|
| 30 |
-
- [x] Long text batch processing
|
| 31 |
-
- [x] GPU accelerated inference
|
| 32 |
-
- [x] SSML (Speech Synthesis Markup Language) work in progress...
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
## demo
|
| 36 |
-
|
| 37 |
-
[](https://huggingface.co/spaces/Artrajz/vits-simple-api)
|
| 38 |
-
|
| 39 |
-
Please note that different IDs may support different languages.[speakers](https://artrajz-vits-simple-api.hf.space/voice/speakers)
|
| 40 |
-
|
| 41 |
-
- `https://artrajz-vits-simple-api.hf.space/voice/vits?text=你好,こんにちは&id=164`
|
| 42 |
-
- `https://artrajz-vits-simple-api.hf.space/voice/vits?text=Difficult the first time, easy the second.&id=4`
|
| 43 |
-
- excited:`https://artrajz-vits-simple-api.hf.space/voice/w2v2-vits?text=こんにちは&id=3&emotion=111`
|
| 44 |
-
- whispered:`https://artrajz-vits-simple-api.hf.space/w2v2-vits?text=こんにちは&id=3&emotion=2077`
|
| 45 |
-
|
| 46 |
-
https://user-images.githubusercontent.com/73542220/237995061-c1f25b4e-dd86-438a-9363-4bb1fe65b425.mov
|
| 47 |
-
|
| 48 |
-
# Deploy
|
| 49 |
-
|
| 50 |
-
## Docker(Recommended for Linux)
|
| 51 |
-
|
| 52 |
-
### Docker image pull script
|
| 53 |
-
|
| 54 |
-
```
|
| 55 |
-
bash -c "$(wget -O- https://raw.githubusercontent.com/Artrajz/vits-simple-api/main/vits-simple-api-installer-latest.sh)"
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
- The platforms currently supported by Docker images are `linux/amd64` and `linux/arm64`.(arm64 only has a CPU version)
|
| 59 |
-
- After a successful pull, the vits model needs to be imported before use. Please follow the steps below to import the model.
|
| 60 |
-
|
| 61 |
-
### Download VITS model
|
| 62 |
-
|
| 63 |
-
Put the model into `/usr/local/vits-simple-api/Model`
|
| 64 |
-
|
| 65 |
-
<details><summary>Folder structure</summary><pre><code>
|
| 66 |
-
│ hubert-soft-0d54a1f4.pt
|
| 67 |
-
│ model.onnx
|
| 68 |
-
│ model.yaml
|
| 69 |
-
│
|
| 70 |
-
├─g
|
| 71 |
-
│ config.json
|
| 72 |
-
│ G_953000.pth
|
| 73 |
-
│
|
| 74 |
-
├─louise
|
| 75 |
-
│ 360_epochs.pth
|
| 76 |
-
│ config.json
|
| 77 |
-
│
|
| 78 |
-
├─Nene_Nanami_Rong_Tang
|
| 79 |
-
│ 1374_epochs.pth
|
| 80 |
-
│ config.json
|
| 81 |
-
│
|
| 82 |
-
├─Zero_no_tsukaima
|
| 83 |
-
│ 1158_epochs.pth
|
| 84 |
-
│ config.json
|
| 85 |
-
│
|
| 86 |
-
└─npy
|
| 87 |
-
25ecb3f6-f968-11ed-b094-e0d4e84af078.npy
|
| 88 |
-
all_emotions.npy
|
| 89 |
-
</code></pre></details>
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
### Modify model path
|
| 96 |
-
|
| 97 |
-
Modify in `/usr/local/vits-simple-api/config.py`
|
| 98 |
-
|
| 99 |
-
<details><summary>config.py</summary><pre><code>
|
| 100 |
-
# Fill in the model path here
|
| 101 |
-
MODEL_LIST = [
|
| 102 |
-
# VITS
|
| 103 |
-
[ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/1374_epochs.pth", ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/config.json"],
|
| 104 |
-
[ABS_PATH + "/Model/Zero_no_tsukaima/1158_epochs.pth", ABS_PATH + "/Model/Zero_no_tsukaima/config.json"],
|
| 105 |
-
[ABS_PATH + "/Model/g/G_953000.pth", ABS_PATH + "/Model/g/config.json"],
|
| 106 |
-
# HuBert-VITS (Need to configure HUBERT_SOFT_MODEL)
|
| 107 |
-
[ABS_PATH + "/Model/louise/360_epochs.pth", ABS_PATH + "/Model/louise/config.json"],
|
| 108 |
-
# W2V2-VITS (Need to configure DIMENSIONAL_EMOTION_NPY)
|
| 109 |
-
[ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
|
| 110 |
-
]
|
| 111 |
-
# hubert-vits: hubert soft model
|
| 112 |
-
HUBERT_SOFT_MODEL = ABS_PATH + "/Model/hubert-soft-0d54a1f4.pt"
|
| 113 |
-
# w2v2-vits: Dimensional emotion npy file
|
| 114 |
-
# load single npy: ABS_PATH+"/all_emotions.npy
|
| 115 |
-
# load mutiple npy: [ABS_PATH + "/emotions1.npy", ABS_PATH + "/emotions2.npy"]
|
| 116 |
-
# load mutiple npy from folder: ABS_PATH + "/Model/npy"
|
| 117 |
-
DIMENSIONAL_EMOTION_NPY = ABS_PATH + "/Model/npy"
|
| 118 |
-
# w2v2-vits: Need to have both `model.onnx` and `model.yaml` files in the same path.
|
| 119 |
-
DIMENSIONAL_EMOTION_MODEL = ABS_PATH + "/Model/model.yaml"
|
| 120 |
-
</code></pre></details>
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
### Startup
|
| 127 |
-
|
| 128 |
-
`docker compose up -d`
|
| 129 |
-
|
| 130 |
-
Or execute the pull script again
|
| 131 |
-
|
| 132 |
-
### Image update
|
| 133 |
-
|
| 134 |
-
Run the docker image pull script again
|
| 135 |
-
|
| 136 |
-
## Virtual environment deployment
|
| 137 |
-
|
| 138 |
-
### Clone
|
| 139 |
-
|
| 140 |
-
`git clone https://github.com/Artrajz/vits-simple-api.git`
|
| 141 |
-
|
| 142 |
-
### Download python dependencies
|
| 143 |
-
|
| 144 |
-
A python virtual environment is recommended
|
| 145 |
-
|
| 146 |
-
`pip install -r requirements.txt`
|
| 147 |
-
|
| 148 |
-
Fasttext may not be installed on windows, you can install it with the following command,or download wheels [here](https://www.lfd.uci.edu/~gohlke/pythonlibs/#fasttext)
|
| 149 |
-
|
| 150 |
-
```
|
| 151 |
-
# python3.10 win_amd64
|
| 152 |
-
pip install https://github.com/Artrajz/archived/raw/main/fasttext/fasttext-0.9.2-cp310-cp310-win_amd64.whl
|
| 153 |
-
```
|
| 154 |
-
|
| 155 |
-
### Download VITS model
|
| 156 |
-
|
| 157 |
-
Put the model into `/path/to/vits-simple-api/Model`
|
| 158 |
-
|
| 159 |
-
<details><summary>Folder structure</summary><pre><code>
|
| 160 |
-
│ hubert-soft-0d54a1f4.pt
|
| 161 |
-
│ model.onnx
|
| 162 |
-
│ model.yaml
|
| 163 |
-
│
|
| 164 |
-
├─g
|
| 165 |
-
│ config.json
|
| 166 |
-
│ G_953000.pth
|
| 167 |
-
│
|
| 168 |
-
├─louise
|
| 169 |
-
│ 360_epochs.pth
|
| 170 |
-
│ config.json
|
| 171 |
-
│
|
| 172 |
-
├─Nene_Nanami_Rong_Tang
|
| 173 |
-
│ 1374_epochs.pth
|
| 174 |
-
│ config.json
|
| 175 |
-
│
|
| 176 |
-
├─Zero_no_tsukaima
|
| 177 |
-
│ 1158_epochs.pth
|
| 178 |
-
│ config.json
|
| 179 |
-
│
|
| 180 |
-
└─npy
|
| 181 |
-
25ecb3f6-f968-11ed-b094-e0d4e84af078.npy
|
| 182 |
-
all_emotions.npy
|
| 183 |
-
</code></pre></details>
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
### Modify model path
|
| 188 |
-
|
| 189 |
-
Modify in `/path/to/vits-simple-api/config.py`
|
| 190 |
-
|
| 191 |
-
<details><summary>config.py</summary><pre><code>
|
| 192 |
-
# Fill in the model path here
|
| 193 |
-
MODEL_LIST = [
|
| 194 |
-
# VITS
|
| 195 |
-
[ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/1374_epochs.pth", ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/config.json"],
|
| 196 |
-
[ABS_PATH + "/Model/Zero_no_tsukaima/1158_epochs.pth", ABS_PATH + "/Model/Zero_no_tsukaima/config.json"],
|
| 197 |
-
[ABS_PATH + "/Model/g/G_953000.pth", ABS_PATH + "/Model/g/config.json"],
|
| 198 |
-
# HuBert-VITS (Need to configure HUBERT_SOFT_MODEL)
|
| 199 |
-
[ABS_PATH + "/Model/louise/360_epochs.pth", ABS_PATH + "/Model/louise/config.json"],
|
| 200 |
-
# W2V2-VITS (Need to configure DIMENSIONAL_EMOTION_NPY)
|
| 201 |
-
[ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
|
| 202 |
-
]
|
| 203 |
-
# hubert-vits: hubert soft model
|
| 204 |
-
HUBERT_SOFT_MODEL = ABS_PATH + "/Model/hubert-soft-0d54a1f4.pt"
|
| 205 |
-
# w2v2-vits: Dimensional emotion npy file
|
| 206 |
-
# load single npy: ABS_PATH+"/all_emotions.npy
|
| 207 |
-
# load mutiple npy: [ABS_PATH + "/emotions1.npy", ABS_PATH + "/emotions2.npy"]
|
| 208 |
-
# load mutiple npy from folder: ABS_PATH + "/Model/npy"
|
| 209 |
-
DIMENSIONAL_EMOTION_NPY = ABS_PATH + "/Model/npy"
|
| 210 |
-
# w2v2-vits: Need to have both `model.onnx` and `model.yaml` files in the same path.
|
| 211 |
-
DIMENSIONAL_EMOTION_MODEL = ABS_PATH + "/Model/model.yaml"
|
| 212 |
-
</code></pre></details>
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
### Startup
|
| 217 |
-
|
| 218 |
-
`python app.py`
|
| 219 |
-
|
| 220 |
-
# GPU accelerated
|
| 221 |
-
|
| 222 |
-
## Windows
|
| 223 |
-
### Install CUDA
|
| 224 |
-
Check the highest version of CUDA supported by your graphics card:
|
| 225 |
-
```
|
| 226 |
-
nvidia-smi
|
| 227 |
-
```
|
| 228 |
-
Taking CUDA 11.7 as an example, download it from the [official website](https://developer.nvidia.com/cuda-11-7-0-download-archive?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exe_local)
|
| 229 |
-
### Install GPU version of PyTorch
|
| 230 |
-
|
| 231 |
-
1.13.1+cu117 is recommended, other versions may have memory instability issues.
|
| 232 |
-
|
| 233 |
-
```
|
| 234 |
-
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
|
| 235 |
-
```
|
| 236 |
-
## Linux
|
| 237 |
-
The installation process is similar, but I don't have the environment to test it.
|
| 238 |
-
|
| 239 |
-
# Dependency Installation Issues
|
| 240 |
-
|
| 241 |
-
Since pypi.org does not have the `pyopenjtalk` whl file, it usually needs to be installed from the source code. This process might be troublesome for some people. Therefore, you can also use the whl I built for installation.
|
| 242 |
-
|
| 243 |
-
```
|
| 244 |
-
pip install pyopenjtalk -i https://pypi.artrajz.cn/simple
|
| 245 |
-
```
|
| 246 |
-
|
| 247 |
-
# API
|
| 248 |
-
|
| 249 |
-
## GET
|
| 250 |
-
|
| 251 |
-
#### speakers list
|
| 252 |
-
|
| 253 |
-
- GET http://127.0.0.1:23456/voice/speakers
|
| 254 |
-
|
| 255 |
-
Returns the mapping table of role IDs to speaker names.
|
| 256 |
-
|
| 257 |
-
#### voice vits
|
| 258 |
-
|
| 259 |
-
- GET http://127.0.0.1:23456/voice/vits?text=text
|
| 260 |
-
|
| 261 |
-
Default values are used when other parameters are not specified.
|
| 262 |
-
|
| 263 |
-
- GET http://127.0.0.1:23456/voice/vits?text=[ZH]text[ZH][JA]text[JA]&lang=mix
|
| 264 |
-
|
| 265 |
-
When lang=mix, the text needs to be annotated.
|
| 266 |
-
|
| 267 |
-
- GET http://127.0.0.1:23456/voice/vits?text=text&id=142&format=wav&lang=zh&length=1.4
|
| 268 |
-
|
| 269 |
-
The text is "text", the role ID is 142, the audio format is wav, the text language is zh, the speech length is 1.4, and the other parameters are default.
|
| 270 |
-
|
| 271 |
-
#### check
|
| 272 |
-
|
| 273 |
-
- GET http://127.0.0.1:23456/voice/check?id=0&model=vits
|
| 274 |
-
|
| 275 |
-
## POST
|
| 276 |
-
|
| 277 |
-
- See `api_test.py`
|
| 278 |
-
|
| 279 |
-
## API KEY
|
| 280 |
-
|
| 281 |
-
Set `API_KEY_ENABLED = True` in `config.py` to enable API key authentication. The API key is `API_KEY = "api-key"`.
|
| 282 |
-
After enabling it, you need to add the `api_key` parameter in GET requests and add the `X-API-KEY` parameter in the header for POST requests.
|
| 283 |
-
|
| 284 |
-
# Parameter
|
| 285 |
-
|
| 286 |
-
## VITS
|
| 287 |
-
|
| 288 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
| 289 |
-
| ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ |
|
| 290 |
-
| Synthesized text | text | true | | str | Text needed for voice synthesis. |
|
| 291 |
-
| Speaker ID | id | false | From `config.py` | int | The speaker ID. |
|
| 292 |
-
| Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac |
|
| 293 |
-
| Text language | lang | false | From `config.py` | str | The language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text |
|
| 294 |
-
| Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
|
| 295 |
-
| Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. |
|
| 296 |
-
| SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
|
| 297 |
-
| Segmentation threshold | max | false | v | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. |
|
| 298 |
-
| Streaming response | streaming | false | false | bool | Streamed synthesized speech with faster initial response. |
|
| 299 |
-
|
| 300 |
-
## VITS voice conversion
|
| 301 |
-
|
| 302 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
| 303 |
-
| -------------- | ----------- | ------- | ------- | ---- | --------------------------------------------------------- |
|
| 304 |
-
| Uploaded Audio | upload | true | | file | The audio file to be uploaded. It should be in wav or ogg |
|
| 305 |
-
| Source Role ID | original_id | true | | int | The ID of the role used to upload the audio file. |
|
| 306 |
-
| Target Role ID | target_id | true | | int | The ID of the target role to convert the audio to. |
|
| 307 |
-
|
| 308 |
-
## HuBert-VITS
|
| 309 |
-
|
| 310 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
| 311 |
-
| ----------------- | --------- | ------- | ------- | ----- | ------------------------------------------------------------ |
|
| 312 |
-
| Uploaded Audio | upload | true | | file | The audio file to be uploaded. It should be in wav or ogg format. |
|
| 313 |
-
| Target speaker ID | id | true | | int | The target speaker ID. |
|
| 314 |
-
| Audio format | format | true | | str | wav,ogg,silk |
|
| 315 |
-
| Audio length | length | true | | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
|
| 316 |
-
| Noise | noise | true | | float | Sample noise, controlling the randomness of the synthesis. |
|
| 317 |
-
| sdp noise | noisew | true | | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
|
| 318 |
-
|
| 319 |
-
## W2V2-VITS
|
| 320 |
-
|
| 321 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
| 322 |
-
| ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ |
|
| 323 |
-
| Synthesized text | text | true | | str | Text needed for voice synthesis. |
|
| 324 |
-
| Speaker ID | id | false | From `config.py` | int | The speaker ID. |
|
| 325 |
-
| Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac |
|
| 326 |
-
| Text language | lang | false | From `config.py` | str | The language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text |
|
| 327 |
-
| Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
|
| 328 |
-
| Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. |
|
| 329 |
-
| SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
|
| 330 |
-
| Segmentation threshold | max | false | From `config.py` | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. |
|
| 331 |
-
| Dimensional emotion | emotion | false | 0 | int | The range depends on the emotion reference file in npy format, such as the range of the [innnky](https://huggingface.co/spaces/innnky/nene-emotion/tree/main)'s model all_emotions.npy, which is 0-5457. |
|
| 332 |
-
|
| 333 |
-
## Dimensional emotion
|
| 334 |
-
|
| 335 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
| 336 |
-
| -------------- | --------- | ------- | ------- | ---- | ------------------------------------------------------------ |
|
| 337 |
-
| Uploaded Audio | upload | true | | file | Return the npy file that stores the dimensional emotion vectors. |
|
| 338 |
-
|
| 339 |
-
## Bert-VITS2
|
| 340 |
-
|
| 341 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
| 342 |
-
| ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ |
|
| 343 |
-
| Synthesized text | text | true | | str | Text needed for voice synthesis. |
|
| 344 |
-
| Speaker ID | id | false | From `config.py` | int | The speaker ID. |
|
| 345 |
-
| Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac |
|
| 346 |
-
| Text language | lang | false | From `config.py` | str | "Auto" is a mode for automatic language detection and is also the default mode. However, it currently only supports detecting the language of an entire text passage and cannot distinguish languages on a per-sentence basis. The other available language options are "zh" and "ja". |
|
| 347 |
-
| Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
|
| 348 |
-
| Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. |
|
| 349 |
-
| SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
|
| 350 |
-
| Segmentation threshold | max | false | From `config.py` | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. |
|
| 351 |
-
| SDP/DP mix ratio | sdp_ratio | false | From `config.py` | int | The theoretical proportion of SDP during synthesis, the higher the ratio, the larger the variance in synthesized voice tone. |
|
| 352 |
-
|
| 353 |
-
## SSML (Speech Synthesis Markup Language)
|
| 354 |
-
|
| 355 |
-
Supported Elements and Attributes
|
| 356 |
-
|
| 357 |
-
`speak` Element
|
| 358 |
-
|
| 359 |
-
| Attribute | Instruction | Is must |
|
| 360 |
-
| --------- | ------------------------------------------------------------ | ------- |
|
| 361 |
-
| id | Default value is retrieved from `config.py` | false |
|
| 362 |
-
| lang | Default value is retrieved from `config.py` | false |
|
| 363 |
-
| length | Default value is retrieved from `config.py` | false |
|
| 364 |
-
| noise | Default value is retrieved from `config.py` | false |
|
| 365 |
-
| noisew | Default value is retrieved from `config.py` | false |
|
| 366 |
-
| max | Splits text into segments based on punctuation marks. When the sum of segment lengths exceeds `max`, it is treated as one segment. `max<=0` means no segmentation. The default value is 0. | false |
|
| 367 |
-
| model | Default is `vits`. Options: `w2v2-vits`, `emotion-vits` | false |
|
| 368 |
-
| emotion | Only effective when using `w2v2-vits` or `emotion-vits`. The range depends on the npy emotion reference file. | false |
|
| 369 |
-
|
| 370 |
-
`voice` Element
|
| 371 |
-
|
| 372 |
-
Higher priority than `speak`.
|
| 373 |
-
|
| 374 |
-
| Attribute | Instruction | Is must |
|
| 375 |
-
| --------- | ------------------------------------------------------------ | ------- |
|
| 376 |
-
| id | Default value is retrieved from `config.py` | false |
|
| 377 |
-
| lang | Default value is retrieved from `config.py` | false |
|
| 378 |
-
| length | Default value is retrieved from `config.py` | false |
|
| 379 |
-
| noise | Default value is retrieved from `config.py` | false |
|
| 380 |
-
| noisew | Default value is retrieved from `config.py` | false |
|
| 381 |
-
| max | Splits text into segments based on punctuation marks. When the sum of segment lengths exceeds `max`, it is treated as one segment. `max<=0` means no segmentation. The default value is 0. | false |
|
| 382 |
-
| model | Default is `vits`. Options: `w2v2-vits`, `emotion-vits` | false |
|
| 383 |
-
| emotion | Only effective when using `w2v2-vits` or `emotion-vits` | false |
|
| 384 |
-
|
| 385 |
-
`break` Element
|
| 386 |
-
|
| 387 |
-
| Attribute | Instruction | Is must |
|
| 388 |
-
| --------- | ------------------------------------------------------------ | ------- |
|
| 389 |
-
| strength | x-weak, weak, medium (default), strong, x-strong | false |
|
| 390 |
-
| time | The absolute duration of a pause in seconds (such as `2s`) or milliseconds (such as `500ms`). Valid values range from 0 to 5000 milliseconds. If you set a value greater than the supported maximum, the service will use `5000ms`. If the `time` attribute is set, the `strength` attribute is ignored. | false |
|
| 391 |
-
|
| 392 |
-
| Strength | Relative Duration |
|
| 393 |
-
| :------- | :---------------- |
|
| 394 |
-
| x-weak | 250 ms |
|
| 395 |
-
| weak | 500 ms |
|
| 396 |
-
| medium | 750 ms |
|
| 397 |
-
| strong | 1000 ms |
|
| 398 |
-
| x-strong | 1250 ms |
|
| 399 |
-
|
| 400 |
-
Example
|
| 401 |
-
|
| 402 |
-
```xml
|
| 403 |
-
<speak lang="zh" format="mp3" length="1.2">
|
| 404 |
-
<voice id="92" >这几天心里颇不宁静。</voice>
|
| 405 |
-
<voice id="125">今晚在院子里坐着乘凉,忽然想起日日走过的荷塘,在这满月的光里,总该另有一番样子吧。</voice>
|
| 406 |
-
<voice id="142">月亮渐渐地升高了,墙外马路上孩子们的欢笑,已经听不见了;</voice>
|
| 407 |
-
<voice id="98">妻在屋里拍着闰儿,迷迷糊糊地哼着眠歌。</voice>
|
| 408 |
-
<voice id="120">我悄悄地披了大衫,带上门出去。</voice><break time="2s"/>
|
| 409 |
-
<voice id="121">沿着荷塘,是一条曲折的小煤屑路。</voice>
|
| 410 |
-
<voice id="122">这是一条幽僻的路;白天也少人走,夜晚更加寂寞。</voice>
|
| 411 |
-
<voice id="123">荷塘四面,长着许多树,蓊蓊郁郁的。</voice>
|
| 412 |
-
<voice id="124">路的一旁,是些杨柳,和一些不知道名字的树。</voice>
|
| 413 |
-
<voice id="125">没有月光的晚上,这路上阴森森的,有些怕人。</voice>
|
| 414 |
-
<voice id="126">今晚却很好,虽然月光也还是淡淡的。</voice><break time="2s"/>
|
| 415 |
-
<voice id="127">路上只我一个人,背着手踱着。</voice>
|
| 416 |
-
<voice id="128">这一片天地好像是我的;我也像超出了平常的自己,到了另一个世界里。</voice>
|
| 417 |
-
<voice id="129">我爱热闹,也爱冷静;<break strength="x-weak"/>爱群居,也爱独处。</voice>
|
| 418 |
-
<voice id="130">像今晚上,一个人在这苍茫的月下,什么都可以想,什么都可以不想,便觉是个自由的人。</voice>
|
| 419 |
-
<voice id="131">白天里一定要做的事,一定要说的话,现在都可不理。</voice>
|
| 420 |
-
<voice id="132">这是独处的妙处,我且受用这无边的荷香月色好了。</voice>
|
| 421 |
-
</speak>
|
| 422 |
-
```
|
| 423 |
-
|
| 424 |
-
# Communication
|
| 425 |
-
|
| 426 |
-
Learning and communication,now there is only Chinese [QQ group](https://qm.qq.com/cgi-bin/qm/qr?k=-1GknIe4uXrkmbDKBGKa1aAUteq40qs_&jump_from=webapi&authKey=x5YYt6Dggs1ZqWxvZqvj3fV8VUnxRyXm5S5Kzntc78+Nv3iXOIawplGip9LWuNR/)
|
| 427 |
-
|
| 428 |
-
# Acknowledgements
|
| 429 |
-
|
| 430 |
-
- vits:https://github.com/jaywalnut310/vits
|
| 431 |
-
- MoeGoe:https://github.com/CjangCjengh/MoeGoe
|
| 432 |
-
- emotional-vits:https://github.com/innnky/emotional-vits
|
| 433 |
-
- vits-uma-genshin-honkai:https://huggingface.co/spaces/zomehwh/vits-uma-genshin-honkai
|
| 434 |
-
- vits_chinese:https://github.com/PlayVoice/vits_chinese
|
| 435 |
-
- Bert_VITS2:https://github.com/fishaudio/Bert-VITS2
|
| 436 |
-
|
|
|
|
| 1 |
+
license: mit
|
| 2 |
+
title: vits-simple-api
|
| 3 |
+
sdk: gradio
|
| 4 |
+
pinned: true
|
| 5 |
+
python_version: 3.10.11
|
| 6 |
+
emoji: 👀
|
| 7 |
+
app_file: app.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
bert_vits2/Model/Azuma/G_17400.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:184324a03109748e68f7e6587433ef2889e1c57aab84ebbe7825bf3bf0fbfc63
|
| 3 |
+
size 629537628
|
bert_vits2/Model/Azuma/config.json
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"train": {
|
| 3 |
+
"log_interval": 10,
|
| 4 |
+
"eval_interval": 100,
|
| 5 |
+
"seed": 52,
|
| 6 |
+
"epochs": 10000,
|
| 7 |
+
"learning_rate": 0.0003,
|
| 8 |
+
"betas": [
|
| 9 |
+
0.8,
|
| 10 |
+
0.99
|
| 11 |
+
],
|
| 12 |
+
"eps": 1e-09,
|
| 13 |
+
"batch_size": 18,
|
| 14 |
+
"fp16_run": false,
|
| 15 |
+
"lr_decay": 0.999875,
|
| 16 |
+
"segment_size": 16384,
|
| 17 |
+
"init_lr_ratio": 1,
|
| 18 |
+
"warmup_epochs": 0,
|
| 19 |
+
"c_mel": 45,
|
| 20 |
+
"c_kl": 1.0
|
| 21 |
+
},
|
| 22 |
+
"data": {
|
| 23 |
+
"use_mel_posterior_encoder": false,
|
| 24 |
+
"training_files": "filelists/train.list",
|
| 25 |
+
"validation_files": "filelists/val.list",
|
| 26 |
+
"max_wav_value": 32768.0,
|
| 27 |
+
"sampling_rate": 44100,
|
| 28 |
+
"filter_length": 2048,
|
| 29 |
+
"hop_length": 512,
|
| 30 |
+
"win_length": 2048,
|
| 31 |
+
"n_mel_channels": 128,
|
| 32 |
+
"mel_fmin": 0.0,
|
| 33 |
+
"mel_fmax": null,
|
| 34 |
+
"add_blank": true,
|
| 35 |
+
"n_speakers": 1,
|
| 36 |
+
"cleaned_text": true,
|
| 37 |
+
"spk2id": {
|
| 38 |
+
"Azuma": 0
|
| 39 |
+
}
|
| 40 |
+
},
|
| 41 |
+
"model": {
|
| 42 |
+
"use_spk_conditioned_encoder": true,
|
| 43 |
+
"use_noise_scaled_mas": true,
|
| 44 |
+
"use_mel_posterior_encoder": false,
|
| 45 |
+
"use_duration_discriminator": true,
|
| 46 |
+
"inter_channels": 192,
|
| 47 |
+
"hidden_channels": 192,
|
| 48 |
+
"filter_channels": 768,
|
| 49 |
+
"n_heads": 2,
|
| 50 |
+
"n_layers": 6,
|
| 51 |
+
"kernel_size": 3,
|
| 52 |
+
"p_dropout": 0.1,
|
| 53 |
+
"resblock": "1",
|
| 54 |
+
"resblock_kernel_sizes": [
|
| 55 |
+
3,
|
| 56 |
+
7,
|
| 57 |
+
11
|
| 58 |
+
],
|
| 59 |
+
"resblock_dilation_sizes": [
|
| 60 |
+
[
|
| 61 |
+
1,
|
| 62 |
+
3,
|
| 63 |
+
5
|
| 64 |
+
],
|
| 65 |
+
[
|
| 66 |
+
1,
|
| 67 |
+
3,
|
| 68 |
+
5
|
| 69 |
+
],
|
| 70 |
+
[
|
| 71 |
+
1,
|
| 72 |
+
3,
|
| 73 |
+
5
|
| 74 |
+
]
|
| 75 |
+
],
|
| 76 |
+
"upsample_rates": [
|
| 77 |
+
8,
|
| 78 |
+
8,
|
| 79 |
+
2,
|
| 80 |
+
2,
|
| 81 |
+
2
|
| 82 |
+
],
|
| 83 |
+
"upsample_initial_channel": 512,
|
| 84 |
+
"upsample_kernel_sizes": [
|
| 85 |
+
16,
|
| 86 |
+
16,
|
| 87 |
+
8,
|
| 88 |
+
2,
|
| 89 |
+
2
|
| 90 |
+
],
|
| 91 |
+
"n_layers_q": 3,
|
| 92 |
+
"use_spectral_norm": false,
|
| 93 |
+
"gin_channels": 256
|
| 94 |
+
}
|
| 95 |
+
}
|
bert_vits2/bert/pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4ac62d49144d770c5ca9a5d1d3039c4995665a080febe63198189857c6bd11cd
|
| 3 |
+
size 1306484351
|
config.py
CHANGED
|
@@ -67,6 +67,8 @@ MODEL_LIST = [
|
|
| 67 |
# [ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
|
| 68 |
# Bert-VITS2
|
| 69 |
# [ABS_PATH + "/Model/bert_vits2/G_9000.pth", ABS_PATH + "/Model/bert_vits2/config.json"],
|
|
|
|
|
|
|
| 70 |
]
|
| 71 |
|
| 72 |
# hubert-vits: hubert soft model
|
|
|
|
| 67 |
# [ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
|
| 68 |
# Bert-VITS2
|
| 69 |
# [ABS_PATH + "/Model/bert_vits2/G_9000.pth", ABS_PATH + "/Model/bert_vits2/config.json"],
|
| 70 |
+
|
| 71 |
+
[ABS_PATH + "/bert_vits2/Model/Azuma/G_17400.pth", ABS_PATH + "/bert_vits2/Model/Azuma/config.json"]
|
| 72 |
]
|
| 73 |
|
| 74 |
# hubert-vits: hubert soft model
|