Text-to-Speech
Safetensors
English
llama
File size: 8,642 Bytes
c9ba2bf
 
5038e9f
 
 
 
 
 
c9ba2bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
810f4ff
c9ba2bf
 
 
 
 
810f4ff
c9ba2bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55034bb
 
 
 
 
 
 
 
 
 
c9ba2bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d475610
c9ba2bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5038e9f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
license: cc-by-4.0
datasets:
- facebook/multilingual_librispeech
- parler-tts/libritts_r_filtered
language:
- en
pipeline_tag: text-to-speech
---
<style>
table {
    border-collapse: collapse;
    width: 100%;
    margin-bottom: 20px;
}
th, td {
    border: 1px solid #ddd;
    padding: 8px;
    text-align: center;
}
.best {
    font-weight: bold;
    text-decoration: underline;
}
</style>

<div style="text-align: center; margin: 20px auto; padding: 20px; border: 3px solid #ddd; border-radius: 10px;">
  <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
  <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌎 OuteAI.com</a> 
  <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">🤝 Join our Discord</a>
  <a href="https://x.com/OuteAI" target="_blank">𝕏 @OuteAI</a>
</div>

# OuteTTS-0.1-350M

## Model Description

OuteTTS-0.1-350M is a novel text-to-speech synthesis model that leverages pure language modeling without external adapters or complex architectures, built upon the LLaMa architecture using our Oute3-350M-DEV base model, it demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.

## Key Features

- Pure language modeling approach to TTS
- Voice cloning capabilities
- LLaMa architecture
- Compatible with llama.cpp and GGUF format

## Technical Details

The model utilizes a three-step approach to audio processing:
1. Audio tokenization using WavTokenizer (processing 75 tokens per second)
2. CTC forced alignment for precise word-to-audio token mapping
3. Structured prompt creation following the format:
```
[full transcription]
[word] [duration token] [audio tokens]
```

## Technical Blog
https://www.outeai.com/blog/OuteTTS-0.1-350M

## Limitations
Being an experimental v0.1 release, there are some known issues:

- Vocabulary constraints due to training data limitations
- String-only input support
- Given its compact 350M parameter size, the model may frequently alter, insert, or omit wrong words, leading to variations in output quality.
- Variable temperature sensitivity depending on use case
- Performs best with shorter sentences, as accuracy may decrease with longer inputs

### Speech Samples

Listen to samples generated by OuteTTS-0.1-350M:

<div style="margin-top: 20px;">
<table style="width: 100%; border-collapse: collapse;">
  <thead>
    <tr>
        <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Input</th>
        <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Audio</th>
        <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid #ddd; padding: 8px;">Hello, I can speak pretty well, but sometimes I make some mistakes.</td>
        <td style="border: 1px solid #ddd; padding: 8px;">
            <audio controls style="width: 100%;">
                <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/2.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
            <audio controls style="width: 100%;">
                <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/1.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
        </td>
        <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1)</td>
    </tr>
    <tr>
        <td style="border: 1px solid #ddd; padding: 8px;">Once upon a time, there was a</td>
        <td style="border: 1px solid #ddd; padding: 8px;">
            <audio controls style="width: 100%;">
                <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/3.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
        </td>
        <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1)</td>
    </tr>
    <tr>
        <td style="border: 1px solid #ddd; padding: 8px;">Scientists have discovered a new planet that may be capable of supporting life!</td>
        <td style="border: 1px solid #ddd; padding: 8px;">
            <audio controls style="width: 100%;">
                <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/6.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
        </td>
        <td style="border: 1px solid #ddd; padding: 8px;">Using the Q4_K_M quantized model. (temperature=0.7, repetition_penalty=1.1)</td>
    </tr>
    <tr>
        <td style="border: 1px solid #ddd; padding: 8px;">Scientists have discovered a new planet that may be capable of supporting life!</td>
        <td style="border: 1px solid #ddd; padding: 8px;">
            <audio controls style="width: 100%;">
                <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/4.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
        </td>
        <td style="border: 1px solid #ddd; padding: 8px;">The model partially failed to follow the input text. (temperature=0.1, repetition_penalty=1.1) </td>
    </tr>
    <tr>
        <td style="border: 1px solid #ddd; padding: 8px;">Scientists have discovered a new planet that may be capable of supporting life!</td>
        <td style="border: 1px solid #ddd; padding: 8px;">
            <audio controls style="width: 100%;">
                <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/5.wav" type="audio/wav">
                Your browser does not support the audio element.
            </audio>
        </td>
        <td style="border: 1px solid #ddd; padding: 8px;">In this case, changing to a higher temperature from 0.1 to 0.7 produces more consistent output. (temperature=0.7, repetition_penalty=1.1)</td>
    </tr>
  </tbody>
  </table>
</div>

## Installation
https://github.com/edwko/OuteTTS

```bash
pip install outetts
```

## Usage

### Interface Usage
```python
from outetts.v0_1.interface import InterfaceHF, InterfaceGGUF

# Initialize the interface with the Hugging Face model
interface = InterfaceHF("OuteAI/OuteTTS-0.1-350M")

# Or initialize the interface with a GGUF model
# interface = InterfaceGGUF("path/to/model.gguf")

# Generate TTS output
# Without a speaker reference, the model generates speech with random speaker characteristics
output = interface.generate(
    text="Hello, am I working?",
    temperature=0.1,
    repetition_penalty=1.1,
    max_lenght=4096
)

# Play the generated audio
output.play()

# Save the generated audio to a file
output.save("output.wav")
```

### Voice Cloning
```python
# Create a custom speaker from an audio file
speaker = interface.create_speaker(
    "path/to/reference.wav",
    "reference text matching the audio"
)

# Generate TTS with the custom voice
output = interface.generate(
    text="This is a cloned voice speaking",
    speaker=speaker,
    temperature=0.1,
    repetition_penalty=1.1,
    max_lenght=4096
)
```

## Model Details
- **Model Type:** LLaMa-based language model
- **Size:** 350M parameters
- **Language Support:** English
- **License:** CC BY 4.0
- **Speech Datasets Used:** 
  - LibriTTS-R (CC BY 4.0)
  - Multilingual LibriSpeech (MLS) (CC BY 4.0)

## Future Improvements
- Scaling up parameters and training data
- Exploring alternative alignment methods for better character compatibility
- Potential expansion into speech-to-speech assistant models

## Credits

- WavTokenizer: https://github.com/jishengpeng/WavTokenizer
- CTC Forced Alignment: https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html

## Disclaimer
By using this model, you acknowledge that you understand and assume the risks associated with its use. 
You are solely responsible for ensuring compliance with all applicable laws and regulations. 
We disclaim any liability for problems arising from the use of this open-source model, including but not limited to direct, indirect, incidental, consequential, or punitive damages. 
We make no warranties, express or implied, regarding the model's performance, accuracy, or fitness for a particular purpose. Your use of this model is at your own risk, and you agree to hold harmless and indemnify us, our affiliates, and our contributors from any claims, damages, or expenses arising from your use of the model.