Spaces:
Running
on
Zero
Running
on
Zero
Sync from GitHub repo
Browse filesThis Space is synced from the GitHub repo: https://github.com/SWivid/F5-TTS. Please submit contributions to the Space there
- src/f5_tts/infer/README.md +23 -1
- src/f5_tts/infer/infer_cli.py +1 -1
- src/f5_tts/infer/utils_infer.py +1 -1
- src/f5_tts/train/README.md +6 -0
src/f5_tts/infer/README.md
CHANGED
@@ -13,7 +13,7 @@ To avoid possible inference failures, make sure you have seen through the follow
|
|
13 |
- Add some spaces (blank: " ") or punctuations (e.g. "," ".") <ins>to explicitly introduce some pauses</ins>.
|
14 |
- If English punctuation marks the end of a sentence, make sure there is a space " " after it. Otherwise not regarded as when chunk.
|
15 |
- <ins>Preprocess numbers</ins> to Chinese letters if you want to have them read in Chinese, otherwise in English.
|
16 |
-
- If the generation output is blank (pure silence), <ins>check for
|
17 |
- Try <ins>turn off `use_ema` if using an early-stage</ins> finetuned checkpoint (which goes just few updates).
|
18 |
|
19 |
|
@@ -129,6 +129,28 @@ ref_text = ""
|
|
129 |
```
|
130 |
You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
|
131 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
132 |
## Socket Real-time Service
|
133 |
|
134 |
Real-time voice output with chunk stream:
|
|
|
13 |
- Add some spaces (blank: " ") or punctuations (e.g. "," ".") <ins>to explicitly introduce some pauses</ins>.
|
14 |
- If English punctuation marks the end of a sentence, make sure there is a space " " after it. Otherwise not regarded as when chunk.
|
15 |
- <ins>Preprocess numbers</ins> to Chinese letters if you want to have them read in Chinese, otherwise in English.
|
16 |
+
- If the generation output is blank (pure silence), <ins>check for FFmpeg installation</ins>.
|
17 |
- Try <ins>turn off `use_ema` if using an early-stage</ins> finetuned checkpoint (which goes just few updates).
|
18 |
|
19 |
|
|
|
129 |
```
|
130 |
You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
|
131 |
|
132 |
+
## API Usage
|
133 |
+
|
134 |
+
```python
|
135 |
+
from importlib.resources import files
|
136 |
+
from f5_tts.api import F5TTS
|
137 |
+
|
138 |
+
f5tts = F5TTS()
|
139 |
+
wav, sr, spec = f5tts.infer(
|
140 |
+
ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
|
141 |
+
ref_text="some call me nature, others call me mother nature.",
|
142 |
+
gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
|
143 |
+
file_wave=str(files("f5_tts").joinpath("../../tests/api_out.wav")),
|
144 |
+
file_spec=str(files("f5_tts").joinpath("../../tests/api_out.png")),
|
145 |
+
seed=None,
|
146 |
+
)
|
147 |
+
```
|
148 |
+
Check [api.py](../api.py) for more details.
|
149 |
+
|
150 |
+
## TensorRT-LLM Deployment
|
151 |
+
|
152 |
+
See [detailed instructions](../runtime/triton_trtllm/README.md) for more information.
|
153 |
+
|
154 |
## Socket Real-time Service
|
155 |
|
156 |
Real-time voice output with chunk stream:
|
src/f5_tts/infer/infer_cli.py
CHANGED
@@ -323,7 +323,7 @@ def main():
|
|
323 |
ref_text_ = voices[voice]["ref_text"]
|
324 |
gen_text_ = text.strip()
|
325 |
print(f"Voice: {voice}")
|
326 |
-
audio_segment, final_sample_rate,
|
327 |
ref_audio_,
|
328 |
ref_text_,
|
329 |
gen_text_,
|
|
|
323 |
ref_text_ = voices[voice]["ref_text"]
|
324 |
gen_text_ = text.strip()
|
325 |
print(f"Voice: {voice}")
|
326 |
+
audio_segment, final_sample_rate, spectrogram = infer_process(
|
327 |
ref_audio_,
|
328 |
ref_text_,
|
329 |
gen_text_,
|
src/f5_tts/infer/utils_infer.py
CHANGED
@@ -384,7 +384,7 @@ def infer_process(
|
|
384 |
):
|
385 |
# Split the input text into batches
|
386 |
audio, sr = torchaudio.load(ref_audio)
|
387 |
-
max_chars = int(len(ref_text.encode("utf-8")) / (audio.shape[-1] / sr) * (22 - audio.shape[-1] / sr))
|
388 |
gen_text_batches = chunk_text(gen_text, max_chars=max_chars)
|
389 |
for i, gen_text in enumerate(gen_text_batches):
|
390 |
print(f"gen_text {i}", gen_text)
|
|
|
384 |
):
|
385 |
# Split the input text into batches
|
386 |
audio, sr = torchaudio.load(ref_audio)
|
387 |
+
max_chars = int(len(ref_text.encode("utf-8")) / (audio.shape[-1] / sr) * (22 - audio.shape[-1] / sr) * speed)
|
388 |
gen_text_batches = chunk_text(gen_text, max_chars=max_chars)
|
389 |
for i, gen_text in enumerate(gen_text_batches):
|
390 |
print(f"gen_text {i}", gen_text)
|
src/f5_tts/train/README.md
CHANGED
@@ -1,5 +1,11 @@
|
|
1 |
# Training
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
## Prepare Dataset
|
4 |
|
5 |
Example data processing scripts, and you may tailor your own one along with a Dataset class in `src/f5_tts/model/dataset.py`.
|
|
|
1 |
# Training
|
2 |
|
3 |
+
Check your FFmpeg installation:
|
4 |
+
```bash
|
5 |
+
ffmpeg -version
|
6 |
+
```
|
7 |
+
If not found, install it first (or skip assuming you know of other backends available).
|
8 |
+
|
9 |
## Prepare Dataset
|
10 |
|
11 |
Example data processing scripts, and you may tailor your own one along with a Dataset class in `src/f5_tts/model/dataset.py`.
|