File size: 20,489 Bytes
62e9ca6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
# SpeechLM

<!--**Pre-trained models for speech related tasks**-->

 [**SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data**](https://arxiv.org/abs/2209.15329)

- June 2023: We have corrected the errors in the pre-training data for SpeechLM-P Base models, and new results are updated.

- April 2023: We discovered some errors about the data in the pre-training experiments, which will affect all the results about SpeechLM-P Base models. We are re-conducting the related experiments and will update the paper with the new results.

- (Done) Oct 2022: release the code and models
- Oct 2022: release preprint in [arXiv](https://arxiv.org/abs/2209.15329)

## Pre-Trained and Fine-tuned Models

|  Model            |               Pre-training Dataset                                                                            | Fine-tuning Dataset                                               | Model |
| :------:          | :----------------------------------------------:                                                              | :-----------------:                                               | :-----: |
| SpeechLM-P Base   | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11)                      |                      -                                            | [Azure Storage](https://valle.blob.core.windows.net/share/speechlm/update/speechlm_checkpoint_298_400000.pt?sv=2020-04-08&st=2023-06-19T10%3A35%3A37Z&se=2033-06-20T10%3A35%3A00Z&sr=b&sp=r&sig=xPzDV3Zm7l7Mp4dgMxAYMOcoZfVJjlbBglqD7uw2XW0%3D)  |
| SpeechLM-P Base   | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11)                      | [100 hrs LibriSpeech](http://www.openslr.org/12)                  | [Azure Storage](https://valle.blob.core.windows.net/share/speechlm/update/checkpoint_best_asr_ft.pt?sv=2020-04-08&st=2023-06-19T10%3A36%3A39Z&se=2033-06-20T10%3A36%3A00Z&sr=b&sp=r&sig=xbS2hGAlTr7K6JJdBN0nKrPtITZE62eT%2FoEK3MBsnZs%3D)  |
| SpeechLM-H Base   | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11)                      |                      -                                            | [Google drive](https://drive.google.com/file/d/1eblW8U8f9t-NTuCNRrNHwr-8BeLAUAmQ/view?usp=sharing)  |
| SpeechLM-H Base   | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11)                      | [100 hrs LibriSpeech](http://www.openslr.org/12)                  | [Google drive](https://drive.google.com/file/d/1vXyO5DolbiWiTYZ6pkkKQsu2wJetaPlv/view?usp=sharing)  |
| SpeechLM-P Base   | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11)                      | [En-De CoVoST-2](https://github.com/facebookresearch/covost)      | [Azure Storage](https://valle.blob.core.windows.net/share/speechlm/update/checkpoint_best_ende_ft.pt?sv=2020-04-08&st=2023-06-19T10%3A37%3A23Z&se=2033-06-20T10%3A37%3A00Z&sr=b&sp=r&sig=bNET3bF240rQg%2B%2F87WC%2FJ1cMojI0WEIoqwEfM7PyQUE%3D)  |
| SpeechLM-P Base   | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11)                      | [En-Ca CoVoST-2](https://github.com/facebookresearch/covost)      | [Azure Storage](https://valle.blob.core.windows.net/share/speechlm/update/checkpoint_best_enca_ft.pt?sv=2020-04-08&st=2023-06-19T10%3A37%3A46Z&se=2033-06-20T10%3A37%3A00Z&sr=b&sp=r&sig=9H1XMRiAU8tz%2B9Ri4sUGP0kZFiiQ5cSVqAqShZAhIzY%3D)  |
| SpeechLM-P Base   | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11)                      | [En-Ar CoVoST-2](https://github.com/facebookresearch/covost)      | [Azure Storage](https://valle.blob.core.windows.net/share/speechlm/update/checkpoint_best_enar_ft.pt?sv=2020-04-08&st=2023-06-19T10%3A38%3A05Z&se=2033-06-20T10%3A38%3A00Z&sr=b&sp=r&sig=mvlF1vmbW9mr66dP3wW9M%2BiU7ASluD4xqCbxblYPCOw%3D)  |
| SpeechLM-P Base   | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11)                      | [En-Tr CoVoST-2](https://github.com/facebookresearch/covost)      | [Azure Storage](https://valle.blob.core.windows.net/share/speechlm/update/checkpoint_best_entr_ft.pt?sv=2020-04-08&st=2023-06-19T10%3A38%3A29Z&se=2033-06-20T10%3A38%3A00Z&sr=b&sp=r&sig=Wda6nh9AVlcJAI6PamiEuHeeCwi4Yudva060qGORbSc%3D)  |
| SpeechLM-P Large  | [60k hrs LibriLight](https://github.com/facebookresearch/libri-light) + [40M Text](http://www.openslr.org/11) |                      -                                            | [Google drive](https://drive.google.com/file/d/1QjLIgTJKIylVIp5hUkfSjGPtz8Xo7Lky/view?usp=sharing)  |
| SpeechLM-P Large  | [60k hrs LibriLight](https://github.com/facebookresearch/libri-light) + [40M Text](http://www.openslr.org/11) | [960 hrs LibriSpeech](http://www.openslr.org/12)                  | [Google drive](https://drive.google.com/file/d/1YZQDVv096o8Opt0RBnkRiZXYPRDqKZnP/view?usp=sharing)  |
| SpeechLM-P Large  | [60k hrs LibriLight](https://github.com/facebookresearch/libri-light) + [40M Text](http://www.openslr.org/11) | [En-De CoVoST-2](https://github.com/facebookresearch/covost)      | [Google drive](https://drive.google.com/file/d/1qYygNWSc11TQbBI1OzC4ChlR-dNh8t9S/view?usp=sharing)  |
| SpeechLM-P Large  | [60k hrs LibriLight](https://github.com/facebookresearch/libri-light) + [40M Text](http://www.openslr.org/11) | [En-Ca CoVoST-2](https://github.com/facebookresearch/covost)      | [Google drive](https://drive.google.com/file/d/162U88mwso2aVfzzPkEM2nP_vwTpcb57T/view?usp=sharing)  |
| SpeechLM-P Large  | [60k hrs LibriLight](https://github.com/facebookresearch/libri-light) + [40M Text](http://www.openslr.org/11) | [En-Ar CoVoST-2](https://github.com/facebookresearch/covost)      | [Google drive](https://drive.google.com/file/d/1lbTSRXewEeb2t45URunD6EiJcbniyjWW/view?usp=sharing)  |
| SpeechLM-P Large  | [60k hrs LibriLight](https://github.com/facebookresearch/libri-light) + [40M Text](http://www.openslr.org/11) | [En-Tr CoVoST-2](https://github.com/facebookresearch/covost)      | [Google drive](https://drive.google.com/file/d/1Er4I_jHS175pQQph223yKtiiLQ378VvH/view?usp=sharing)  |


## Extract features using pre-trained models
For easier use of our pre-trained models, we merge all inference-related code to [`SpeechLM.py`](SpeechLM.py) and make cleaned checkpoints [~~`SpeechLM-P Base`~~](https://valle.blob.core.windows.net/share/speechlm/speechlmp_base_checkpoint_clean.pt?sv=2020-04-08&st=2023-04-04T05%3A42%3A17Z&se=2033-04-05T05%3A42%3A00Z&sr=b&sp=r&sig=DN7VwaEWhrhRPiyuT84mJpohrMeJsEPq4o6qRr8BNsk%3D) [`SpeechLM-H Base`](https://valle.blob.core.windows.net/share/speechlm/speechlmh_base_checkpoint_clean.pt?sv=2020-04-08&st=2023-04-04T05%3A43%3A07Z&se=2033-04-05T05%3A43%3A00Z&sr=b&sp=r&sig=T9oaIvrb3z3Wo5GTZp8eP2x7B7yuQ%2B80Ff1KhuWrrKg%3D) [`SpeechLM-P Large`](https://valle.blob.core.windows.net/share/speechlm/speechlmp_large_checkpoint_clean.pt?sv=2020-04-08&st=2023-04-04T05%3A43%3A33Z&se=2033-04-05T05%3A43%3A00Z&sr=b&sp=r&sig=qfWBNdiIGuDgkgUiHXaWnPiVbUHm3VSp%2FHTlWrCghRk%3D) by removing non-required modules. Now you can directly use the following script to extract your speech features:
```python
import torch
import torch.nn.functional as F
from SpeechLM import SpeechLMConfig, SpeechLM

checkpoint = torch.load('path/to/the/cleaned/checkpoint.pt')
cfg = SpeechLMConfig(checkpoint['cfg']['model'])
model = SpeechLM(cfg)
model.load_state_dict(checkpoint['model'])
model.eval()

wav_input_16khz = torch.randn(1,10000)
normalize = checkpoint['cfg']['task']['normalize']  # False for base model, True for large model
if normalize:
    wav_input_16khz = F.layer_norm(wav_input_16khz[0], wav_input_16khz[0].shape).unsqueeze(0)

# extract the representation of last layer
rep = model.extract_features(wav_input_16khz)[0]

# extract the representation of each layer
output_layer = model.cfg.encoder_layers + model.cfg.text_transformer.encoder.layers
rep, layer_results = model.extract_features(wav_input_16khz, output_layer=output_layer, ret_layer_results=True)[0]
layer_reps = [x.transpose(0, 1) for x in layer_results]
```


## Setup
To fine-tune or pre-train more models, please follow the instructions below.

```bash
git submodule update --init SpeechLM/fairseq
cd SpeechLM/
pip install --editable fairseq/
pip install sacrebleu==1.5.1
```

## ASR on LibriSpeech
### Data preparation
Please follow the steps of wav2vec 2.0 manifest [here](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec#prepare-training-data-manifest) to prepare `train.tsv` and `train.ltr`. You should make sure the vocabulary [`dict.ltr.txt`](dataset/LibriSpeech/asr/dict.ltr.txt) is the same as that used for the pre-trained model.

Put yout prepared data into `$data_dir`, we provided eamples in [`dataset/LibriSpeech/asr`](dataset/LibriSpeech/asr/). 

### Fine-tune a CTC model
- Fine-tune the base model
    ```bash
    # Usage: speechlm/scripts/tune_speechlm_asr/finetune_base_ctc.sh <model_path> <data_dir> <cpt_tag> [mount=$PWD] [world_size=8] [update_freq=1]
    model_path=path/to/your/pre-trained/model
    data_dir=dataset/LibriSpeech/asr
    bash speechlm/scripts/tune_speechlm_asr/finetune_base_ctc.sh $model_path $data_dir 'tag400k'
    ```
- Fine-tune the large model
    ```bash
    # Usage: speechlm/scripts/tune_speechlm_asr/finetune_large_ctc.sh <model_path> <data_dir> <cpt_tag> [mount=$PWD] [world_size=8] [update_freq=4]
    model_path=path/to/your/pre-trained/model
    data_dir=dataset/LibriSpeech/asr
    bash speechlm/scripts/tune_speechlm_asr/finetune_large_ctc.sh $model_path $data_dir 'tag400k'
    ```
### Decode
- Directly decode a CTC model.
    ```bash
    # Usage: speechlm/scripts/tune_speechlm_asr/inference_ctc.sh <model_path> <data_dir> [gen-set=dev_clean,dev_other,test_clean,test_other]
    model_path=path/to/your/fine-tuned/model
    data_dir=dataset/LibriSpeech/asr
    bash speechlm/scripts/tune_speechlm_asr/inference_ctc.sh $model_path $data_dir
    # for large models
    # bash speechlm/scripts/tune_speechlm_asr/inference_ctc_large.sh $model_path $data_dir
    ```
- Decode with 4-gram language model using [flashlight](https://github.com/flashlight/flashlight/tree/main/bindings/python) and [kenlm](https://github.com/kpu/kenlm).
    > Please put [4-gram.arpa](https://www.openslr.org/resources/11/4-gram.arpa.gz) and the word-to-letter lexicon [librispeech_lexicon.lst](https://drive.google.com/file/d/1q7IbNGqtwXnctjvuvpviQ4ZmepFHQmTO/view?usp=sharing) into `$data_dir`.
    ```bash
    # Usage: speechlm/scripts/tune_speechlm_asr/inference_ctc_kenlm.sh <model_path> <data_dir> [gen-set=dev_clean,dev_other,test_clean,test_other]
    model_path=path/to/your/fine-tuned/model
    data_dir=dataset/LibriSpeech/asr
    bash speechlm/scripts/tune_speechlm_asr/inference_ctc_kenlm.sh $model_path $data_dir
    ```
- Decode large models with fairseq-lm using [flashlight](https://github.com/flashlight/flashlight/tree/main/bindings/python).
    > Please put [lm_librispeech_word_transformer.pt](https://dl.fbaipublicfiles.com/wav2letter/sota/2019/lm/lm_librispeech_word_transformer.pt) and its vocabulary [`dict.txt`](https://dl.fbaipublicfiles.com/wav2letter/sota/2019/lm/lm_librispeech_word_transformer.dict) into `$data_dir/fairseq_word_lm`, and the word-to-letter lexicon [librispeech_lexicon.lst](https://drive.google.com/file/d/1q7IbNGqtwXnctjvuvpviQ4ZmepFHQmTO/view?usp=sharing) into `$data_dir`. Capitalize the `dict.txt` to amke it compatible with the word-to-letter lexicon.
    ```bash
    # Usage: speechlm/scripts/tune_speechlm_asr/inference_ctc_large_fsqlm.sh <model_path> <data_dir> [gen-set=dev_clean,dev_other,test_clean,test_other]
    model_path=path/to/your/fine-tuned/model
    data_dir=dataset/LibriSpeech/asr
    bash speechlm/scripts/tune_speechlm_asr/inference_ctc_large_fsqlm.sh $model_path $data_dir dev_other
    ```

## ST on CoVoST-2
### Data Preparation
1. Download [Common Voice audio clips](https://commonvoice.mozilla.org/en/datasets) (version 4) for English into `$cv_root/en`.
2. Get data manifest. The following script will convert mp3 files to waveform, create tsv file containing speech/translation paires, create data config files.
    ```bash
    lang=de # ca,ar,tr
    cv_root=dataset/CommonVoice/v4
    bash speechlm/data_process/prepare_covost2_enxx.sh $lang $cv_root
    ```
    We provided examples in [`dataset/CommonVoice/v4/en/en-de`](dataset/CommonVoice/v4/en/en-de).

### Fine-tune a encoder-decoder model
- Fine-tune the Base model (fine-tuned models will be stored in `$mount/exp/finetune_covost`).

    ```bash
    model_path=path/to/your/pre-trained/model
    lang=de # ca,ar,tr
    data_dir=dataset/CommonVoice/v4/en/en-${lang}
    # Usage (Base model): speechlm/scripts/tune_speechlm_st/ft_base_covost_enxx.sh <model_path> <data_dir> <lang> <cpt-tag> [mount=$PWD] [world_size=8] [update_freq=2]
    bash speechlm/scripts/tune_speechlm_st/ft_base_covost_enxx.sh $model_path $data_dir $lang 'tag400k'
    ```
- Fine-tune the Large model (fine-tuned models will be stored in `$mount/exp/finetune_covost`).
    ```bash
    # Usage (Large model): speechlm/scripts/tune_speechlm_st/ft_large_covost_enxx.sh <model_path> <data_dir> <lang> <cpt-tag> [mount=$PWD] [world_size=8] [update_freq=4]
    bash speechlm/scripts/tune_speechlm_st/ft_large_covost_enxx.sh $model_path $data_dir $lang 'tag400k'
    ```

### Decode
- Decode the base model
    ```bash
    # Usage: speechlm/scripts/tune_speechlm_st/inference_base.sh <model_path> <data_dir> <lang> [gen-set=dev] [beam_size=5]
    model_path=path/to/your/fine-tuned/model
    lang=de # ca,ar,tr
    data_dir=dataset/CommonVoice/v4/en/en-${lang}
    bash speechlm/scripts/tune_speechlm_st/inference_base.sh $model_path $data_dir $lang dev
    ```
- Decode the large model
    ```bash
    # Usage: speechlm/scripts/tune_speechlm_st/inference_large.sh <model_path> <data_dir> <lang> [gen-set=dev] [beam_size=5]
    bash speechlm/scripts/tune_speechlm_st/inference_large.sh $model_path $data_dir $lang dev
    ```

## Universal Representation Evaluation on SUPERB

Please refer to [**SUPERB**](https://superbbenchmark.org/) for the downstreaming tasks.

## Pre-train
Please follow the instructions of [Tokenizer](README.md#Tokenizers) to prepare the pre-training data. We provided examples in [`dataset`](dataset).
- SpeechLM-P Base model

    Models will be stored in `$mount/pretrain`.
    ```bash
    data_dir=dataset/LibriSpeech/phone_unit   # should contain train_960.{tsv,phn}
    text_data_dir=dataset/LibriLM/phone_unit/bin-idx     # should contain train_text.phn-ltr.{phn,ltr}.{bin,idx}
    # Usage: speechlm/scripts/pretrain_speechlm/base_speechlmp.sh <data_dir> <text_data_dir> [mount=$PWD] [world_size=32] [update_freq=1]
    bash speechlm/scripts/pretrain_speechlm/base_speechlmp.sh $data_dir $text_data_dir
    ```
- SpeechLM-H Base model
    ```bash
    data_dir=dataset/LibriSpeech/hidden_unit  # should contain train_960.{tsv,phn}
    text_data_dir=dataset/LibriLM/km-ltr/bin-idx     # should contain train_text.km-ltr.{km,ltr}.{bin,idx}
    # Usage: speechlm/scripts/pretrain_speechlm/base_speechlmh.sh <data_dir> <text_data_dir> [mount=$PWD] [world_size=32] [update_freq=1]
    bash speechlm/scripts/pretrain_speechlm/base_speechlmp.sh $data_dir $text_data_dir
    ```
- SpeechLM-P Large model
    ```bash
    data_dir=dataset/LibriSpeech/phone_unit   # should contain train_960.{tsv,phn}
    text_data_dir=dataset/LibriLM/phone_unit/bin-idx     # should contain train_text.phn-ltr.{phn,ltr}.{bin,idx}
    # Usage: speechlm/scripts/pretrain_speechlm/base_speechlmp.sh <data_dir> <text_data_dir> [mount=$PWD] [world_size=32] [update_freq=1]
    bash speechlm/scripts/pretrain_speechlm/large_speechlmp.sh $data_dir $text_data_dir
    ```


## Tokenizers
### Phoneme-unit Tokenizer for Speech
This tokenizer is used to produce the frame-laigned phonemes for unlabeled speech, which is actually a hybrid HMM ASR model.

In the Base setting, we use 100h LibriSpeech labeled data to train the HMM model under Kaldi recipe, then decode the unpaired speech and get the aligned phonemes from the lattice.
Here we provided the processed phonemes of 960h speech here: [`train_960.tsv`](https://drive.google.com/file/d/1rxlikMglL2kEsF4NfqekZRoA02klY7CE/view?usp=sharing), [`train_960.phn`](), [`dev_clean.tsv`](https://drive.google.com/file/d/1NuVwe687jLBFkDLRy1EV2A2uXyV_kBo2/view?usp=sharing), [`dev_clean.phn`](https://drive.google.com/file/d/1cq_gbS-UgCALOoaE5QmhWrhkTdXuc_Uc/view?usp=sharing). Note that the label-rate is 100 (10ms).

> The phoneme inventory is 300+ word-position-dependent phones including silence phones.

### Phoneme-unit Tokenizer for Text
This tokenizer is used to phonemize the unpaired text data to (phonemes, letters) paired data, following a `words -> phonemes -> upsampled phones` pipeline.

The following script will download LibriSpeech LM corpus and produce the required data: `train_text.phn-ltr.phn.{idx,bin}` and `train_text.phn-ltr.ltr.{idx,bin}`. 
> Before runing it, make sure you have our provided [`dict.phn.txt`](dataset/LibriLM/phone_unit/bin-idx/dict.phn.txt) and [`dict.ltr.txt`](dataset/LibriLM/phone_unit/bin-idx/dict.ltr.txt) in the output dir `dataset/LibriLM/phone_unit/bin-idx/`.

> The phoneme inventory is 300+ word-position-dependent phones including silence phones.

```bash
# data will be in dataset/LibriLM/phone_unit/
bash speechlm/data_process/prepare_phn2ltr_librilm.sh
```
### Hidden-unit Tokenizer for Speech
Please follow the steps of data preparation for HuBERT [here](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#data-preparation) to prepare 1) wav recordings [`train.tsv`](dataset/LibriSpeech/hidden_unit/train_sample100.tsv) and 2) corresponding hidden-units [`train.km`](dataset/LibriSpeech/hidden_unit/train_sample100.km), and 3) unit vocabulary [`dict.km.txt`](dataset/LibriSpeech/hidden_unit/dict.km.txt).

### Hidden-unit Tokenizer for Text
This tokenizer is used to produce the speech-style hidden units from unpaired text.
We train a [FastSpeech](https://arxiv.org/abs/2006.04558)-like model (instead generating continuous spectrum in the original paper, here we generate discrete units) on a small amount of ASR data ([100 hrs LibriSpeech](http://www.openslr.org/12)) as the tokenizer.

Train:
1. Convert asr transcripts to phoneme sequence with duration information.
2. Extract hidden-units from speech, using the [Hidden-unit Tokenizer for Speech](#hidden-unit-tokenizer-for-speech).
3. Train the [model](speechlm/models/fasttext2unit.py) on the paired data:
    ```bash
    data_dir=dataset/LibriSpeech/fast_phone2unit
    bash speechlm/scripts/tokenizer_fastT2U/train_s_5e-4.sh $data_dir
    ```
> The phoneme inventory is 41 mono phones including silence phones.

Inference:

4. Convert text data to phoneme sequence by [`lexicon`](https://drive.google.com/file/d/1dh9NEx_cCF9_Aa0UcKyl9j00GXs6LmLQ/view?usp=sharing).
5. [Generate](speechlm/scripts/tokenizer_fastT2U/generate.sh) hidden units for a large text corpus:
    ```bash
    gen_set=dataset/LibriSpeech/fast_phone2unit/genset_examples
    bash speechlm/scripts/tokenizer_fastT2U/generate.sh $model_path $gen_set
    ```
We provided train/generate data examples in [`dataset/LibriSpeech/fast_phone2unit`](dataset/LibriSpeech/fast_phone2unit), and the model checkpoint [here](https://drive.google.com/file/d/1e-aYf8hPXuly8DEvNg5SISOlcUxsgED0/view?usp=sharing).

## License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq).

[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

## Reference

If you find our work is useful in your research, please cite the following paper:

```bibtex
@article{zhang2022speechlm,
  title   = {SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data},
  author  = {Zhang, Ziqiang and Chen, Sanyuan and Zhou, Long and Wu, Yu and Ren, Shuo and Liu, Shujie and Yao, Zhuoyuan and Gong, Xun and Dai, Lirong and Li, Jinyu and Wei, Furu},
  eprint={2209.15329},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2022}
}
```

### Contact Information

For help or issues using SpeechLM models, please submit a GitHub issue.

For other communications related to SpeechLM, please contact Long Zhou (`[email protected]`).