Text-to-Speech
sambal commited on
Commit
0b8c0b8
Β·
verified Β·
1 Parent(s): 2fbd2c0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -1
README.md CHANGED
@@ -5,4 +5,130 @@ pipeline_tag: text-to-speech
5
 
6
  This repository contains the model as described in [LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM](https://hf.co/papers/2503.04724).
7
 
8
- For more information, check out the project page at https://mbzuai-oryx.github.io/LLMVoX/ and the code at https://github.com/mbzuai-oryx/LLMVoX.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
  This repository contains the model as described in [LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM](https://hf.co/papers/2503.04724).
7
 
8
+ For more information, check out the project page at https://mbzuai-oryx.github.io/LLMVoX/ and the code at https://github.com/mbzuai-oryx/LLMVoX.
9
+
10
+ # LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
11
+
12
+ <div>
13
+ <a href="https://mbzuai-oryx.github.io/LLMVoX/"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a>
14
+ <a href="https://arxiv.org/abs/2503.04724"><img src="https://img.shields.io/badge/arXiv-2503.04724-b31b1b.svg" alt="arXiv"></a>
15
+ <a href="https://github.com/mbzuai-oryx/LLMVoX/"><img src="https://img.shields.io/badge/GitHub-LLMVoX-black?logo=github" alt="GitHub Repository"></a>
16
+ <a href="https://github.com/mbzuai-oryx/LLMVoX/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
17
+ </div>
18
+
19
+ **Authors:**
20
+ **[Sambal Shikar](https://github.com/mbzuai-oryx/LLMVoX?tab=readme-ov-file)**, **[Mohammed Irfan K](https://scholar.google.com/citations?user=GJp0keYAAAAJ&hl=en)**, **[Sahal Shaji Mullappilly](https://scholar.google.com/citations?user=LJWxVpUAAAAJ&hl=en)**, **[Fahad Khan](https://sites.google.com/view/fahadkhans/home)**, **[Jean Lahoud](https://scholar.google.com/citations?user=LsivLPoAAAAJ&hl=en)**, **[Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ)**, **[Salman Khan](https://salman-h-khan.github.io/)**, **[Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)**
21
+
22
+ **Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE**
23
+
24
+ <p align="center">
25
+ <img src="assets/arch_diagram.svg" alt="LLMVoX Architecture" width="800px">
26
+ </p>
27
+
28
+ ## Overview
29
+
30
+ LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency.
31
+
32
+ Key features:
33
+ - πŸš€ **Lightweight & Fast**: Only 30M parameters with end-to-end latency as low as 300ms
34
+ - πŸ”Œ **LLM-Agnostic**: Works with any LLM and Vision-Language Model without fine-tuning
35
+ - 🌊 **Multi-Queue Streaming**: Enables continuous, low-latency speech generation
36
+ - 🌐 **Multilingual Support**: Adaptable to new languages with dataset adaptation
37
+
38
+ ## Quick Start
39
+
40
+ ### Installation
41
+
42
+ ```bash
43
+ # Requirements: CUDA 11.7+, Flash Attention 2.0+ compatible GPU
44
+
45
+ git clone https://github.com/mbzuai-oryx/LLMVoX.git
46
+ cd LLMVoX
47
+
48
+ conda create -n llmvox python=3.9
49
+ conda activate llmvox
50
+
51
+ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
52
+ pip install flash-attn --no-build-isolation
53
+ pip install -r requirements.txt
54
+
55
+ # Download checkpoints from Hugging Face
56
+ # https://huggingface.co/MBZUAI/LLMVoX/tree/main
57
+ mkdir -p CHECKPOINTS
58
+ # Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt
59
+ ```
60
+
61
+ ### Voice Chat
62
+
63
+ ```bash
64
+ # Basic usage
65
+ python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
66
+
67
+ # With multiple GPUs
68
+ python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
69
+ --llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2
70
+
71
+ # Balance latency/quality
72
+ python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
73
+ --initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280
74
+ ```
75
+
76
+ ### Text Chat & Visual Speech
77
+
78
+ ```bash
79
+ # Text-to-Speech
80
+ python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
81
+
82
+ # Visual Speech (Speech + Image β†’ Speech)
83
+ python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" \
84
+ --eos_token "<|im_end|>"
85
+
86
+ # Multimodal (support for models like Phi-4)
87
+ python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" \
88
+ --eos_token "<|end|>"
89
+ ```
90
+
91
+ ## API Reference
92
+
93
+ | Endpoint | Purpose | Required Parameters |
94
+ |----------|---------|---------------------|
95
+ | `/tts` | Text-to-speech | `text`: String to convert |
96
+ | `/voicechat` | Voice conversations | `audio_base64`, `source_language`, `target_language` |
97
+ | `/multimodalchat` | Voice + multiple images | `audio_base64`, `image_list` |
98
+ | `/vlmschat` | Voice + single image | `audio_base64`, `image_base64`, `source_language`, `target_language` |
99
+
100
+ ## Local UI Demo
101
+
102
+ <p align="center">
103
+ <img src="assets/ui.png" alt="Demo UI" width="800px">
104
+ </p>
105
+
106
+ ```bash
107
+ # Start server
108
+ python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --api_port PORT
109
+
110
+ # Launch UI
111
+ python run_ui.py --ip STREAMING_SERVER_IP --port PORT
112
+ ```
113
+
114
+ ## Citation
115
+
116
+ ```bibtex
117
+ @article{shikhar2025llmvox,
118
+ title={LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM},
119
+ author={Shikhar, Sambal and Kurpath, Mohammed Irfan and Mullappilly, Sahal Shaji and Lahoud, Jean and Khan, Fahad and Anwer, Rao Muhammad and Khan, Salman and Cholakkal, Hisham},
120
+ journal={arXiv preprint arXiv:2503.04724},
121
+ year={2025}
122
+ }
123
+ ```
124
+
125
+ ## Acknowledgments
126
+
127
+ - [Andrej Karpathy's NanoGPT](https://github.com/karpathy/nanoGPT)
128
+ - [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
129
+ - [Whisper](https://github.com/openai/whisper)
130
+ - [Neural G2P](https://github.com/lingjzhu/CharsiuG2P)
131
+
132
+ ## License
133
+
134
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.