ByteDance
/

MegaTTS3

Text-to-Speech

Safetensors

English

Chinese

Model card Files Files and versions Community

Add pipeline tag, library name, link to paper and project page

by nielsr HF Staff - opened Apr 4

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

+35

-29

Files changed (1) hide show

README.md +35 -29

README.md CHANGED Viewed

@@ -1,32 +1,38 @@
 ---
-license: apache-2.0
 language:
 - en
 - zh
 ---
 # Model Description
 This is a huggingface model card for MegaTTS 3 👋
-- github: https://github.com/bytedance/MegaTTS3
-- [Demo Video](https://github.com/user-attachments/assets/0174c111-f392-4376-a34b-0b5b8164aacc)
-- Huggingface Space: (comming soon)
 ## Installation
-``` sh
 # Clone the repository
 git clone https://github.com/bytedance/MegaTTS3
 cd MegaTTS3
 ```
 **Model Download**
-``` sh
 huggingface-cli download ByteDance/MegaTTS3 --local-dir ./checkpoints --local-dir-use-symlinks False
 ```
 **Requirements (for Linux)**
-``` sh
 # Create a python 3.10 conda env (you could also use virtualenv)
 conda create -n megatts3-env python=3.10
 conda activate megatts3-env
@@ -43,7 +49,8 @@ export CUDA_VISIBLE_DEVICES=0
 ```
 **Requirements (for Windows)**
-``` sh
 # [The Windows version is currently under testing]
 # Comment below dependence in requirements.txt:
 # # WeTextProcessing==1.0.4.1
@@ -68,11 +75,11 @@ conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For co
 # [Optional] Set GPU
 set CUDA_VISIBLE_DEVICES=0 # Windows
 $env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
 ```
 **Requirements (for Docker)**
-``` sh
 # [The Docker version is currently under testing]
 # ! You should download the pretrained checkpoint before running the following command
 docker build . -t megatts3:latest
@@ -85,19 +92,19 @@ docker run -it -p 7929:7929  megatts3:latest
 # Visit http://0.0.0.0:7929/ for gradio.
 ```
-> [!TIP]
-> [IMPORTANT]
 > For security issues, we do not upload the parameters of WaveVAE encoder to the above links. You can only use the pre-extracted latents from [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) for inference. If you want to synthesize speech for speaker A, you need "A.wav" and "A.npy" in the same directory. If you have any questions or suggestions for our model, please email us.
->
 > This project is primarily intended for academic purposes. For academic datasets requiring evaluation, you may upload them to the voice request queue in [link2](https://drive.google.com/drive/folders/1gCWL1y_2xu9nIFhUX_OW5MbcFuB7J5Cl?usp=sharing) (within 24s for each clip). After verifying that your uploaded voices are free from safety issues, we will upload their latent files to [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) as soon as possible.
->
 > In the coming days, we will also prepare and release the latent representations for some common TTS benchmarks.
 ## Inference
 **Command-Line Usage (Standard)**
-``` bash
 # p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
 python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav'  --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论？'" --output_dir ./gen
@@ -105,9 +112,11 @@ python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav'  --input_text "
 # will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0
 ```
 **Command-Line Usage (for TTS with Accents)**
-``` bash
-# When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
 # t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
 # Useful for accented TTS or solving the accent problems in cross-lingual TTS.
 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0
@@ -116,24 +125,27 @@ python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '
 ```
 **Web UI Usage**
-``` bash
 # We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
 python tts/gradio_api.py
 ```
 ## Security
 If you discover a potential security issue in this project, or think you may
 have discovered a security issue, we ask that you notify Bytedance Security via our [security center](https://security.bytedance.com/src) or [[email protected]]([email protected]).
 Please do **not** create a public issue.
 ## License
 This project is licensed under the [Apache-2.0 License](LICENSE).
 ## BibTeX Entry and Citation Info
 This repo contains forced-align version of `Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis` and the WavVAE is mainly based on `Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling`. Compared to the model described in paper, the repository includes additional models. These models not only enhance the stability and cloning capabilities of the algorithm but can also be independently utilized to serve a wider range of scenarios.
 ```
 @article{jiang2025sparse,
   title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
@@ -148,10 +160,4 @@ This repo contains forced-align version of `Sparse Alignment Enhanced Latent Dif
   journal={arXiv preprint arXiv:2408.16532},
   year={2024}
 }
-```

 ---
 language:
 - en
 - zh
+license: apache-2.0
+pipeline_tag: text-to-speech
+library_name: transformers
 ---
 # Model Description
 This is a huggingface model card for MegaTTS 3 👋
+-   Paper: [MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis](https://huggingface.co/papers/2502.18924)
+-   Project Page (Audio Samples): <https://sditdemo.github.io/sditdemo/>
+-   github: <https://github.com/bytedance/MegaTTS3>
+-   [Demo Video](https://github.com/user-attachments/assets/0174c111-f392-4376-a34b-0b5b8164aacc)
+-   Huggingface Space: (comming soon)
 ## Installation
+```sh
 # Clone the repository
 git clone https://github.com/bytedance/MegaTTS3
 cd MegaTTS3
 ```
 **Model Download**
+```sh
 huggingface-cli download ByteDance/MegaTTS3 --local-dir ./checkpoints --local-dir-use-symlinks False
 ```
 **Requirements (for Linux)**
+```sh
 # Create a python 3.10 conda env (you could also use virtualenv)
 conda create -n megatts3-env python=3.10
 conda activate megatts3-env
 ```
 **Requirements (for Windows)**
+```sh
 # [The Windows version is currently under testing]
 # Comment below dependence in requirements.txt:
 # # WeTextProcessing==1.0.4.1
 # [Optional] Set GPU
 set CUDA_VISIBLE_DEVICES=0 # Windows
 $env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
 ```
 **Requirements (for Docker)**
+```sh
 # [The Docker version is currently under testing]
 # ! You should download the pretrained checkpoint before running the following command
 docker build . -t megatts3:latest
 # Visit http://0.0.0.0:7929/ for gradio.
 ```
+> \[!TIP]
+> \[IMPORTANT]
 > For security issues, we do not upload the parameters of WaveVAE encoder to the above links. You can only use the pre-extracted latents from [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) for inference. If you want to synthesize speech for speaker A, you need "A.wav" and "A.npy" in the same directory. If you have any questions or suggestions for our model, please email us.
+>
 > This project is primarily intended for academic purposes. For academic datasets requiring evaluation, you may upload them to the voice request queue in [link2](https://drive.google.com/drive/folders/1gCWL1y_2xu9nIFhUX_OW5MbcFuB7J5Cl?usp=sharing) (within 24s for each clip). After verifying that your uploaded voices are free from safety issues, we will upload their latent files to [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) as soon as possible.
+>
 > In the coming days, we will also prepare and release the latent representations for some common TTS benchmarks.
 ## Inference
 **Command-Line Usage (Standard)**
+```bash
 # p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
 python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav'  --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论？'" --output_dir ./gen
 # will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0
 ```
 **Command-Line Usage (for TTS with Accents)**
+```bash
+# When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
 # t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
 # Useful for accented TTS or solving the accent problems in cross-lingual TTS.
 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0
 ```
 **Web UI Usage**
+```bash
 # We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
 python tts/gradio_api.py
 ```
 ## Security
 If you discover a potential security issue in this project, or think you may
 have discovered a security issue, we ask that you notify Bytedance Security via our [security center](https://security.bytedance.com/src) or [[email protected]]([email protected]).
 Please do **not** create a public issue.
 ## License
 This project is licensed under the [Apache-2.0 License](LICENSE).
 ## BibTeX Entry and Citation Info
 This repo contains forced-align version of `Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis` and the WavVAE is mainly based on `Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling`. Compared to the model described in paper, the repository includes additional models. These models not only enhance the stability and cloning capabilities of the algorithm but can also be independently utilized to serve a wider range of scenarios.
 ```
 @article{jiang2025sparse,
   title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
   journal={arXiv preprint arXiv:2408.16532},
   year={2024}
 }
+```