Text-to-Speech
Safetensors
English
Chinese

Add pipeline tag, library name, link to paper and project page

#4
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +35 -29
README.md CHANGED
@@ -1,32 +1,38 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
  - zh
 
 
 
6
  ---
7
 
8
  # Model Description
9
  This is a huggingface model card for MegaTTS 3 👋
10
- - github: https://github.com/bytedance/MegaTTS3
11
- - [Demo Video](https://github.com/user-attachments/assets/0174c111-f392-4376-a34b-0b5b8164aacc)
12
- - Huggingface Space: (comming soon)
13
-
14
 
 
 
 
 
 
15
 
16
  ## Installation
17
- ``` sh
 
18
  # Clone the repository
19
  git clone https://github.com/bytedance/MegaTTS3
20
  cd MegaTTS3
21
  ```
22
 
23
  **Model Download**
24
- ``` sh
 
25
  huggingface-cli download ByteDance/MegaTTS3 --local-dir ./checkpoints --local-dir-use-symlinks False
26
  ```
27
 
28
  **Requirements (for Linux)**
29
- ``` sh
 
30
  # Create a python 3.10 conda env (you could also use virtualenv)
31
  conda create -n megatts3-env python=3.10
32
  conda activate megatts3-env
@@ -43,7 +49,8 @@ export CUDA_VISIBLE_DEVICES=0
43
  ```
44
 
45
  **Requirements (for Windows)**
46
- ``` sh
 
47
  # [The Windows version is currently under testing]
48
  # Comment below dependence in requirements.txt:
49
  # # WeTextProcessing==1.0.4.1
@@ -68,11 +75,11 @@ conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For co
68
  # [Optional] Set GPU
69
  set CUDA_VISIBLE_DEVICES=0 # Windows
70
  $env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
71
-
72
  ```
73
 
74
  **Requirements (for Docker)**
75
- ``` sh
 
76
  # [The Docker version is currently under testing]
77
  # ! You should download the pretrained checkpoint before running the following command
78
  docker build . -t megatts3:latest
@@ -85,19 +92,19 @@ docker run -it -p 7929:7929 megatts3:latest
85
  # Visit http://0.0.0.0:7929/ for gradio.
86
  ```
87
 
88
- > [!TIP]
89
- > [IMPORTANT]
90
  > For security issues, we do not upload the parameters of WaveVAE encoder to the above links. You can only use the pre-extracted latents from [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) for inference. If you want to synthesize speech for speaker A, you need "A.wav" and "A.npy" in the same directory. If you have any questions or suggestions for our model, please email us.
91
- >
92
  > This project is primarily intended for academic purposes. For academic datasets requiring evaluation, you may upload them to the voice request queue in [link2](https://drive.google.com/drive/folders/1gCWL1y_2xu9nIFhUX_OW5MbcFuB7J5Cl?usp=sharing) (within 24s for each clip). After verifying that your uploaded voices are free from safety issues, we will upload their latent files to [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) as soon as possible.
93
- >
94
  > In the coming days, we will also prepare and release the latent representations for some common TTS benchmarks.
95
 
96
-
97
  ## Inference
98
 
99
  **Command-Line Usage (Standard)**
100
- ``` bash
 
101
  # p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
102
  python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论?'" --output_dir ./gen
103
 
@@ -105,9 +112,11 @@ python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "
105
  # will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
106
  python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0
107
  ```
 
108
  **Command-Line Usage (for TTS with Accents)**
109
- ``` bash
110
- # When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
 
111
  # t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
112
  # Useful for accented TTS or solving the accent problems in cross-lingual TTS.
113
  python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0
@@ -116,24 +125,27 @@ python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '
116
  ```
117
 
118
  **Web UI Usage**
119
- ``` bash
 
120
  # We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
121
  python tts/gradio_api.py
122
  ```
123
 
124
-
125
-
126
  ## Security
 
127
  If you discover a potential security issue in this project, or think you may
128
  have discovered a security issue, we ask that you notify Bytedance Security via our [security center](https://security.bytedance.com/src) or [[email protected]]([email protected]).
129
 
130
  Please do **not** create a public issue.
131
 
132
  ## License
 
133
  This project is licensed under the [Apache-2.0 License](LICENSE).
134
 
135
  ## BibTeX Entry and Citation Info
 
136
  This repo contains forced-align version of `Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis` and the WavVAE is mainly based on `Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling`. Compared to the model described in paper, the repository includes additional models. These models not only enhance the stability and cloning capabilities of the algorithm but can also be independently utilized to serve a wider range of scenarios.
 
137
  ```
138
  @article{jiang2025sparse,
139
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
@@ -148,10 +160,4 @@ This repo contains forced-align version of `Sparse Alignment Enhanced Latent Dif
148
  journal={arXiv preprint arXiv:2408.16532},
149
  year={2024}
150
  }
151
- ```
152
-
153
-
154
-
155
-
156
-
157
-
 
1
  ---
 
2
  language:
3
  - en
4
  - zh
5
+ license: apache-2.0
6
+ pipeline_tag: text-to-speech
7
+ library_name: transformers
8
  ---
9
 
10
  # Model Description
11
  This is a huggingface model card for MegaTTS 3 👋
 
 
 
 
12
 
13
+ - Paper: [MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis](https://huggingface.co/papers/2502.18924)
14
+ - Project Page (Audio Samples): <https://sditdemo.github.io/sditdemo/>
15
+ - github: <https://github.com/bytedance/MegaTTS3>
16
+ - [Demo Video](https://github.com/user-attachments/assets/0174c111-f392-4376-a34b-0b5b8164aacc)
17
+ - Huggingface Space: (comming soon)
18
 
19
  ## Installation
20
+
21
+ ```sh
22
  # Clone the repository
23
  git clone https://github.com/bytedance/MegaTTS3
24
  cd MegaTTS3
25
  ```
26
 
27
  **Model Download**
28
+
29
+ ```sh
30
  huggingface-cli download ByteDance/MegaTTS3 --local-dir ./checkpoints --local-dir-use-symlinks False
31
  ```
32
 
33
  **Requirements (for Linux)**
34
+
35
+ ```sh
36
  # Create a python 3.10 conda env (you could also use virtualenv)
37
  conda create -n megatts3-env python=3.10
38
  conda activate megatts3-env
 
49
  ```
50
 
51
  **Requirements (for Windows)**
52
+
53
+ ```sh
54
  # [The Windows version is currently under testing]
55
  # Comment below dependence in requirements.txt:
56
  # # WeTextProcessing==1.0.4.1
 
75
  # [Optional] Set GPU
76
  set CUDA_VISIBLE_DEVICES=0 # Windows
77
  $env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
 
78
  ```
79
 
80
  **Requirements (for Docker)**
81
+
82
+ ```sh
83
  # [The Docker version is currently under testing]
84
  # ! You should download the pretrained checkpoint before running the following command
85
  docker build . -t megatts3:latest
 
92
  # Visit http://0.0.0.0:7929/ for gradio.
93
  ```
94
 
95
+ > \[!TIP]
96
+ > \[IMPORTANT]
97
  > For security issues, we do not upload the parameters of WaveVAE encoder to the above links. You can only use the pre-extracted latents from [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) for inference. If you want to synthesize speech for speaker A, you need "A.wav" and "A.npy" in the same directory. If you have any questions or suggestions for our model, please email us.
98
+ >
99
  > This project is primarily intended for academic purposes. For academic datasets requiring evaluation, you may upload them to the voice request queue in [link2](https://drive.google.com/drive/folders/1gCWL1y_2xu9nIFhUX_OW5MbcFuB7J5Cl?usp=sharing) (within 24s for each clip). After verifying that your uploaded voices are free from safety issues, we will upload their latent files to [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) as soon as possible.
100
+ >
101
  > In the coming days, we will also prepare and release the latent representations for some common TTS benchmarks.
102
 
 
103
  ## Inference
104
 
105
  **Command-Line Usage (Standard)**
106
+
107
+ ```bash
108
  # p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
109
  python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论?'" --output_dir ./gen
110
 
 
112
  # will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
113
  python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0
114
  ```
115
+
116
  **Command-Line Usage (for TTS with Accents)**
117
+
118
+ ```bash
119
+ # When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
120
  # t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
121
  # Useful for accented TTS or solving the accent problems in cross-lingual TTS.
122
  python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0
 
125
  ```
126
 
127
  **Web UI Usage**
128
+
129
+ ```bash
130
  # We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
131
  python tts/gradio_api.py
132
  ```
133
 
 
 
134
  ## Security
135
+
136
  If you discover a potential security issue in this project, or think you may
137
  have discovered a security issue, we ask that you notify Bytedance Security via our [security center](https://security.bytedance.com/src) or [[email protected]]([email protected]).
138
 
139
  Please do **not** create a public issue.
140
 
141
  ## License
142
+
143
  This project is licensed under the [Apache-2.0 License](LICENSE).
144
 
145
  ## BibTeX Entry and Citation Info
146
+
147
  This repo contains forced-align version of `Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis` and the WavVAE is mainly based on `Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling`. Compared to the model described in paper, the repository includes additional models. These models not only enhance the stability and cloning capabilities of the algorithm but can also be independently utilized to serve a wider range of scenarios.
148
+
149
  ```
150
  @article{jiang2025sparse,
151
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
 
160
  journal={arXiv preprint arXiv:2408.16532},
161
  year={2024}
162
  }
163
+ ```