Update README.md
Browse files
README.md
CHANGED
@@ -7,4 +7,152 @@ tags:
|
|
7 |
- gpt4-o
|
8 |
- tokenizer
|
9 |
- codec-representation
|
10 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
- gpt4-o
|
8 |
- tokenizer
|
9 |
- codec-representation
|
10 |
+
---
|
11 |
+
# WavTokenizer
|
12 |
+
SOTA Discrete Codec Models With Forty Tokens Per Second for Audio Language Modeling
|
13 |
+
|
14 |
+
|
15 |
+
|
16 |
+
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://github.com/jishengpeng/wavtokenizer)
|
17 |
+
[![demo](https://img.shields.io/badge/WanTokenizer-Demo-red)](https://wavtokenizer.github.io/)
|
18 |
+
[![model](https://img.shields.io/badge/%F0%9F%A4%97%20WavTokenizer-Models-blue)](https://github.com/jishengpeng/wavtokenizer)
|
19 |
+
|
20 |
+
|
21 |
+
|
22 |
+
### ππ with WavTokenizer, you can represent speech, music, and audio with only 40 tokens one second!
|
23 |
+
### ππ with WavTokenizer, You can get strong reconstruction results.
|
24 |
+
### ππ WavTokenizer owns rich semantic information and is build for audio language models such as GPT4-o.
|
25 |
+
|
26 |
+
# π₯ News
|
27 |
+
- *2024.08*: We release WavTokenizer on arxiv.
|
28 |
+
|
29 |
+
![result](result.png)
|
30 |
+
|
31 |
+
|
32 |
+
## Installation
|
33 |
+
|
34 |
+
To use WavTokenizer, install it using:
|
35 |
+
|
36 |
+
```bash
|
37 |
+
conda create -n wavtokenizer python=3.9
|
38 |
+
conda activate wavtokenizer
|
39 |
+
pip install -r requirements.txt
|
40 |
+
```
|
41 |
+
|
42 |
+
## Infer
|
43 |
+
|
44 |
+
### Part1: Reconstruct audio from raw wav
|
45 |
+
|
46 |
+
```python
|
47 |
+
|
48 |
+
from encoder.utils import convert_audio
|
49 |
+
import torchaudio
|
50 |
+
import torch
|
51 |
+
from decoder.pretrained import WavTokenizer
|
52 |
+
|
53 |
+
|
54 |
+
device=torch.device('cpu')
|
55 |
+
|
56 |
+
config_path = "./configs/xxx.yaml"
|
57 |
+
model_path = "./xxx.ckpt"
|
58 |
+
audio_outpath = "xxx"
|
59 |
+
|
60 |
+
wavtokenizer = WavTokenizer.from_pretrained0802(config_path, model_path)
|
61 |
+
wavtokenizer = wavtokenizer.to(device)
|
62 |
+
|
63 |
+
|
64 |
+
wav, sr = torchaudio.load(audio_path)
|
65 |
+
wav = convert_audio(wav, sr, 24000, 1)
|
66 |
+
bandwidth_id = torch.tensor([0])
|
67 |
+
wav=wav.to(device)
|
68 |
+
features,discrete_code= wavtokenizer.encode_infer(wav, bandwidth_id=bandwidth_id)
|
69 |
+
audio_out = wavtokenizer.decode(features, bandwidth_id=bandwidth_id)
|
70 |
+
torchaudio.save(audio_outpath, audio_out, sample_rate=24000, encoding='PCM_S', bits_per_sample=16)
|
71 |
+
```
|
72 |
+
|
73 |
+
|
74 |
+
### Part2: Generating discrete codecs
|
75 |
+
```python
|
76 |
+
|
77 |
+
from encoder.utils import convert_audio
|
78 |
+
import torchaudio
|
79 |
+
import torch
|
80 |
+
from decoder.pretrained import WavTokenizer
|
81 |
+
|
82 |
+
device=torch.device('cpu')
|
83 |
+
|
84 |
+
config_path = "./configs/xxx.yaml"
|
85 |
+
model_path = "./xxx.ckpt"
|
86 |
+
|
87 |
+
wavtokenizer = WavTokenizer.from_pretrained0802(config_path, model_path)
|
88 |
+
wavtokenizer = wavtokenizer.to(device)
|
89 |
+
|
90 |
+
wav, sr = torchaudio.load(audio_path)
|
91 |
+
wav = convert_audio(wav, sr, 24000, 1)
|
92 |
+
bandwidth_id = torch.tensor([0])
|
93 |
+
wav=wav.to(device)
|
94 |
+
_,discrete_code= wavtokenizer.encode_infer(wav, bandwidth_id=bandwidth_id)
|
95 |
+
print(discrete_code)
|
96 |
+
```
|
97 |
+
|
98 |
+
|
99 |
+
|
100 |
+
### Part3: Audio reconstruction through codecs
|
101 |
+
```python
|
102 |
+
# audio_tokens [n_q,1,t]/[n_q,t]
|
103 |
+
features = wavtokenizer.codes_to_features(audio_tokens)
|
104 |
+
bandwidth_id = torch.tensor([0])
|
105 |
+
audio_out = wavtokenizer.decode(features, bandwidth_id=bandwidth_id)
|
106 |
+
```
|
107 |
+
|
108 |
+
## Available models
|
109 |
+
π€ links to the Huggingface model hub.
|
110 |
+
|
111 |
+
| Model name | HuggingFace | Corpus | aa | Parameters | Open-Source |
|
112 |
+
|:--------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:---------:|:----------:|:------:|
|
113 |
+
| WavTokenizer-small-600-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | LibriTTS | 40 | Speech | β |
|
114 |
+
| WavTokenizer-small-320-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | LibriTTS | 75 | Speech | β|
|
115 |
+
| WavTokenizer-medium-600-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | 10000 Hours | 40 | Speech, Audio, Music | Coming Soon|
|
116 |
+
| WavTokenizer-medium-320-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | 10000 Hours | 75 | Speech, Audio, Music | Coming Soon|
|
117 |
+
| WavTokenizer-large-600-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | LibriTTS | 40 | Speech, Audio, Music | Coming Soon|
|
118 |
+
| WavTokenizer-large-320-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | 80000 Hours | 75 | Speech, Audio, Music | Comming Soon |
|
119 |
+
|
120 |
+
|
121 |
+
|
122 |
+
## Training
|
123 |
+
|
124 |
+
### Step1: Prepare train dataset
|
125 |
+
```python
|
126 |
+
# Process the data into a form similar to ./data/demo.txt
|
127 |
+
```
|
128 |
+
|
129 |
+
### Step2: Modifying configuration files
|
130 |
+
```python
|
131 |
+
# ./configs/xxx.yaml
|
132 |
+
# Modify the values of parameters such as batch_size, filelist_path, save_dir, device
|
133 |
+
```
|
134 |
+
|
135 |
+
### Step3: Start training process
|
136 |
+
Refer to [Pytorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/) for details about customizing the
|
137 |
+
training pipeline.
|
138 |
+
|
139 |
+
```bash
|
140 |
+
cd ./WavTokenizer
|
141 |
+
python train.py fit --config ./configs/xxx.yaml
|
142 |
+
```
|
143 |
+
|
144 |
+
|
145 |
+
## Citation
|
146 |
+
|
147 |
+
If this code contributes to your research, please cite our work, Language-Codec and WavTokenizer:
|
148 |
+
|
149 |
+
```
|
150 |
+
@misc{ji2024languagecodec,
|
151 |
+
title={Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models},
|
152 |
+
author={Shengpeng Ji and Minghui Fang and Ziyue Jiang and Rongjie Huang and Jialung Zuo and Shulei Wang and Zhou Zhao},
|
153 |
+
year={2024},
|
154 |
+
eprint={2402.12208},
|
155 |
+
archivePrefix={arXiv},
|
156 |
+
primaryClass={eess.AS}
|
157 |
+
}
|
158 |
+
```
|