nari-labs
/

Dia-1.6B

Text-to-Speech

Safetensors

English

model_hub_mixin

pytorch_model_hub_mixin

Model card Files Files and versions Community

NariLabs commited on Apr 21

Commit

ea1fb66

verified ·

1 Parent(s): a73daec

Update README.md

Browse files

Files changed (1) hide show

README.md +25 -45

README.md CHANGED Viewed

@@ -11,6 +11,7 @@ tags:
 </a>
 </center>
 Dia is a 1.6B parameter text to speech model created by Nari Labs.
 Dia **directly generates highly realistic dialogue from a transcript**. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
@@ -20,18 +21,22 @@ To accelerate research, we are providing access to pretrained model checkpoints
 We also provide a [demo page](https://yummy-fir-7a4.notion.site/dia) comparing our model to [ElevenLabs Studio](https://elevenlabs.io/studio) and [Sesame CSM-1B](https://github.com/SesameAILabs/csm).
 - Join our [discord server](https://discord.gg/pgdB5YRe) for community support and access to new features.
-- We’re turning Dia into a B2C app. Generate fun conversations, remix content, and share with friends. Join the [waitlist](https://tally.so/r/meokbo) for early access.
-## Quickstart
 This will open a Gradio UI that you can work on.
 ```bash
 git clone https://github.com/nari-labs/dia.git
-cd dia && uv run app.py
 ```
-## Usage
 ### As a Python Library
@@ -50,53 +55,26 @@ output = model.generate(text)
 sf.write("simple.mp3", output, 44100)
 ```
-### Command-Line Interface (CLI)
-The CLI script `cli.py` in the project root allows generation from the terminal.
-**Basic Usage (Loading from Hub):**
-```bash
-python cli.py "Your input text goes here." \
-    --output generated_speech.wav \
-    --repo-id nari-labs/Dia-1.6B
-```
-**Loading from Local Files:**
-```bash
-python cli.py "Text for local model." \
-    --output local_output.wav \
-    --local-paths \
-    --config path/to/your/config.json \
-    --checkpoint path/to/your/checkpoint.pth
-```
-**With Audio Prompt:**
-```bash
-python cli.py "Generate speech like this prompt." \
-    --output prompted_output.wav \
-    --repo-id nari-labs/Dia-1.6B \
-    --audio-prompt path/to/your/prompt.wav
-```
-**Adjusting Generation Parameters:**
-```bash
-python cli.py "Adjusted generation." \
-    --output adjusted_output.wav \
-    --repo-id nari-labs/Dia-1.6B \
-    --temperature 1.0 \
-    --top-p 0.9 \
-    --cfg-scale 4.0
-```
-## License
 This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
-## Disclaimer
 This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**:
@@ -106,18 +84,20 @@ This project offers a high-fidelity speech generation model intended solely for
 By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.
-## TODO / Future Work
 - Optimize inference speed.
 - Add quantization for memory efficiency.
-## Contributing
 We are a tiny team of 1 full-time and 1 part-time research-engineers. We are extra-welcome to any contributions!
 Join our [Discord Server](https://discord.gg/pgdB5YRe) for discussions.
-## Acknowledgements
 - We thank the [Google TPU Research Cloud program](https://sites.research.google/trc/about/) for providing computation resources.
 - Our work was heavily inspired by [SoundStorm](https://arxiv.org/abs/2305.09636), [Parakeet](https://jordandarefsky.com/blog/2024/parakeet/), and [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec).
 - "Nari" is a pure Korean word for lily.

 </a>
 </center>
 Dia is a 1.6B parameter text to speech model created by Nari Labs.
 Dia **directly generates highly realistic dialogue from a transcript**. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
 We also provide a [demo page](https://yummy-fir-7a4.notion.site/dia) comparing our model to [ElevenLabs Studio](https://elevenlabs.io/studio) and [Sesame CSM-1B](https://github.com/SesameAILabs/csm).
 - Join our [discord server](https://discord.gg/pgdB5YRe) for community support and access to new features.
+- Play with a larger version of Dia: generate fun conversations, remix content, and share with friends. 🔮 Join the [waitlist](https://tally.so/r/meokbo) for early access.
+## ⚡️ Quickstart
 This will open a Gradio UI that you can work on.
 ```bash
 git clone https://github.com/nari-labs/dia.git
+cd dia
+python -m venv .venv
+source .venv/bin/activate
+pip install uv
+uv run app.py
 ```
+## ⚙️ Usage
 ### As a Python Library
 sf.write("simple.mp3", output, 44100)
 ```
+A pypi package and a working CLI tool will be available soon.
+## 💻 Hardware and Inference Speed
+Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be added soon.
+The initial run will take longer as the Descript Audio Codec also needs to be downloaded.
+On enterprise GPUs, Dia can generate audio in real-time. On older GPUs, inference time will be slower.
+For reference, on a A4000 GPU, Dia rougly generates 40 tokens/s (86 tokens equals 1 second of audio).
+`torch.compile` will increase speeds for supported GPUs.
+The full version of Dia requires around 10GB of VRAM to run. We will be adding a quantized version in the future.
+If you don't have hardware available or if you want to play with bigger versions of our models, join the waitlist [here](https://tally.so/r/meokbo).
+## 🪪 License
 This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
+## ⚠️ Disclaimer
 This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**:
 By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.
+## 🔭 TODO / Future Work
+- Docker support.
 - Optimize inference speed.
 - Add quantization for memory efficiency.
+## 🤝 Contributing
 We are a tiny team of 1 full-time and 1 part-time research-engineers. We are extra-welcome to any contributions!
 Join our [Discord Server](https://discord.gg/pgdB5YRe) for discussions.
+## 🤗 Acknowledgements
 - We thank the [Google TPU Research Cloud program](https://sites.research.google/trc/about/) for providing computation resources.
 - Our work was heavily inspired by [SoundStorm](https://arxiv.org/abs/2305.09636), [Parakeet](https://jordandarefsky.com/blog/2024/parakeet/), and [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec).
 - "Nari" is a pure Korean word for lily.
+- We thank Jason Y. for providing help with data filtering.