NariLabs commited on
Commit
ea1fb66
·
verified ·
1 Parent(s): a73daec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -45
README.md CHANGED
@@ -11,6 +11,7 @@ tags:
11
  </a>
12
  </center>
13
 
 
14
  Dia is a 1.6B parameter text to speech model created by Nari Labs.
15
 
16
  Dia **directly generates highly realistic dialogue from a transcript**. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
@@ -20,18 +21,22 @@ To accelerate research, we are providing access to pretrained model checkpoints
20
  We also provide a [demo page](https://yummy-fir-7a4.notion.site/dia) comparing our model to [ElevenLabs Studio](https://elevenlabs.io/studio) and [Sesame CSM-1B](https://github.com/SesameAILabs/csm).
21
 
22
  - Join our [discord server](https://discord.gg/pgdB5YRe) for community support and access to new features.
23
- - We’re turning Dia into a B2C app. Generate fun conversations, remix content, and share with friends. Join the [waitlist](https://tally.so/r/meokbo) for early access.
24
 
25
- ## Quickstart
26
 
27
  This will open a Gradio UI that you can work on.
28
 
29
  ```bash
30
  git clone https://github.com/nari-labs/dia.git
31
- cd dia && uv run app.py
 
 
 
 
32
  ```
33
 
34
- ## Usage
35
 
36
  ### As a Python Library
37
 
@@ -50,53 +55,26 @@ output = model.generate(text)
50
  sf.write("simple.mp3", output, 44100)
51
  ```
52
 
53
- ### Command-Line Interface (CLI)
54
-
55
- The CLI script `cli.py` in the project root allows generation from the terminal.
56
 
57
- **Basic Usage (Loading from Hub):**
58
 
59
- ```bash
60
- python cli.py "Your input text goes here." \
61
- --output generated_speech.wav \
62
- --repo-id nari-labs/Dia-1.6B
63
- ```
64
 
65
- **Loading from Local Files:**
66
-
67
- ```bash
68
- python cli.py "Text for local model." \
69
- --output local_output.wav \
70
- --local-paths \
71
- --config path/to/your/config.json \
72
- --checkpoint path/to/your/checkpoint.pth
73
- ```
74
 
75
- **With Audio Prompt:**
76
 
77
- ```bash
78
- python cli.py "Generate speech like this prompt." \
79
- --output prompted_output.wav \
80
- --repo-id nari-labs/Dia-1.6B \
81
- --audio-prompt path/to/your/prompt.wav
82
- ```
83
-
84
- **Adjusting Generation Parameters:**
85
-
86
- ```bash
87
- python cli.py "Adjusted generation." \
88
- --output adjusted_output.wav \
89
- --repo-id nari-labs/Dia-1.6B \
90
- --temperature 1.0 \
91
- --top-p 0.9 \
92
- --cfg-scale 4.0
93
- ```
94
 
95
- ## License
96
 
97
  This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
98
 
99
- ## Disclaimer
100
 
101
  This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**:
102
 
@@ -106,18 +84,20 @@ This project offers a high-fidelity speech generation model intended solely for
106
 
107
  By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.
108
 
109
- ## TODO / Future Work
110
 
 
111
  - Optimize inference speed.
112
  - Add quantization for memory efficiency.
113
 
114
- ## Contributing
115
 
116
  We are a tiny team of 1 full-time and 1 part-time research-engineers. We are extra-welcome to any contributions!
117
  Join our [Discord Server](https://discord.gg/pgdB5YRe) for discussions.
118
 
119
- ## Acknowledgements
120
 
121
  - We thank the [Google TPU Research Cloud program](https://sites.research.google/trc/about/) for providing computation resources.
122
  - Our work was heavily inspired by [SoundStorm](https://arxiv.org/abs/2305.09636), [Parakeet](https://jordandarefsky.com/blog/2024/parakeet/), and [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec).
123
  - "Nari" is a pure Korean word for lily.
 
 
11
  </a>
12
  </center>
13
 
14
+
15
  Dia is a 1.6B parameter text to speech model created by Nari Labs.
16
 
17
  Dia **directly generates highly realistic dialogue from a transcript**. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
 
21
  We also provide a [demo page](https://yummy-fir-7a4.notion.site/dia) comparing our model to [ElevenLabs Studio](https://elevenlabs.io/studio) and [Sesame CSM-1B](https://github.com/SesameAILabs/csm).
22
 
23
  - Join our [discord server](https://discord.gg/pgdB5YRe) for community support and access to new features.
24
+ - Play with a larger version of Dia: generate fun conversations, remix content, and share with friends. 🔮 Join the [waitlist](https://tally.so/r/meokbo) for early access.
25
 
26
+ ## ⚡️ Quickstart
27
 
28
  This will open a Gradio UI that you can work on.
29
 
30
  ```bash
31
  git clone https://github.com/nari-labs/dia.git
32
+ cd dia
33
+ python -m venv .venv
34
+ source .venv/bin/activate
35
+ pip install uv
36
+ uv run app.py
37
  ```
38
 
39
+ ## ⚙️ Usage
40
 
41
  ### As a Python Library
42
 
 
55
  sf.write("simple.mp3", output, 44100)
56
  ```
57
 
58
+ A pypi package and a working CLI tool will be available soon.
 
 
59
 
60
+ ## 💻 Hardware and Inference Speed
61
 
62
+ Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be added soon.
63
+ The initial run will take longer as the Descript Audio Codec also needs to be downloaded.
 
 
 
64
 
65
+ On enterprise GPUs, Dia can generate audio in real-time. On older GPUs, inference time will be slower.
66
+ For reference, on a A4000 GPU, Dia rougly generates 40 tokens/s (86 tokens equals 1 second of audio).
67
+ `torch.compile` will increase speeds for supported GPUs.
 
 
 
 
 
 
68
 
69
+ The full version of Dia requires around 10GB of VRAM to run. We will be adding a quantized version in the future.
70
 
71
+ If you don't have hardware available or if you want to play with bigger versions of our models, join the waitlist [here](https://tally.so/r/meokbo).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
+ ## 🪪 License
74
 
75
  This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
76
 
77
+ ## ⚠️ Disclaimer
78
 
79
  This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**:
80
 
 
84
 
85
  By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.
86
 
87
+ ## 🔭 TODO / Future Work
88
 
89
+ - Docker support.
90
  - Optimize inference speed.
91
  - Add quantization for memory efficiency.
92
 
93
+ ## 🤝 Contributing
94
 
95
  We are a tiny team of 1 full-time and 1 part-time research-engineers. We are extra-welcome to any contributions!
96
  Join our [Discord Server](https://discord.gg/pgdB5YRe) for discussions.
97
 
98
+ ## 🤗 Acknowledgements
99
 
100
  - We thank the [Google TPU Research Cloud program](https://sites.research.google/trc/about/) for providing computation resources.
101
  - Our work was heavily inspired by [SoundStorm](https://arxiv.org/abs/2305.09636), [Parakeet](https://jordandarefsky.com/blog/2024/parakeet/), and [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec).
102
  - "Nari" is a pure Korean word for lily.
103
+ - We thank Jason Y. for providing help with data filtering.