Spaces:
Runtime error
Runtime error
File size: 14,872 Bytes
62e9ca6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
# SpeechUT
<!--**Pre-trained models for speech related tasks**-->
[**SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training**](https://arxiv.org/abs/2210.03730)
- (Done) Oct 2022: release the code and models
- Oct 2022: release preprint in [arXiv](https://arxiv.org/abs/2210.03730)
## Pre-Trained and Fine-tuned Models
| Model | Pre-training Dataset (unlabeled) | Fine-tuning Dataset (labeled) | Model |
| :------: | :----------------------------------------------: | :-----------------: | :-----: |
| SpeechUT Base (ASR) | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11) | - | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/base_speechut4asr_32gpu_1accum/checkpoint_298_400000.pt?sv=2020-04-08&st=2023-03-08T01%3A39%3A48Z&se=2024-03-09T01%3A39%3A00Z&sr=b&sp=r&sig=l3gJS1D%2BJfLfNfS3z33WjmSMGrOBJ63CvqGGewC6WeU%3D)|
| SpeechUT Base (ASR) | [960 hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11) | [100 hrs LibriSpeech](http://www.openslr.org/12) | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/speechut_base_asr100h_checkpoint_best.pt?sv=2020-04-08&st=2023-03-08T01%3A41%3A22Z&se=2024-03-09T01%3A41%3A00Z&sr=b&sp=r&sig=%2B9lpGrqtZXa%2F6n1uZT%2Biey54ky31bYKSJytgfnBbbN4%3D)|
| SpeechUT Large (ASR) | [60k hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11) | - | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/large_speechut4asr_32gpu_4accum/checkpoint_22_400k.pt?sv=2020-04-08&st=2023-03-08T01%3A42%3A10Z&se=2024-03-09T01%3A42%3A00Z&sr=b&sp=r&sig=TZNcsHQAqapyj%2BAvpHtl749kZy9flTkWm8P5L4W26qs%3D)|
| SpeechUT Large (ASR) | [60k hrs LibriSpeech](http://www.openslr.org/12) + [40M Text](http://www.openslr.org/11) | [960 hrs LibriSpeech](http://www.openslr.org/12) | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/speechut_large_asr960h_checkpoint_best.pt?sv=2020-04-08&st=2023-03-08T01%3A43%3A02Z&se=2024-03-09T01%3A43%3A00Z&sr=b&sp=r&sig=PmO%2BgSAMXRgMC7GfpS4c%2BrDPsfJGekqUzD5AJm7RrYU%3D)|
| SpeechUT Base (En-De) | [960 hrs LibriSpeech](http://www.openslr.org/12) + [408 hrs MuST-C v1](https://ict.fbk.eu/must-c/) + [4.6M Text](https://www.statmt.org/wmt16/) | - | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/base_speechut4ende_32gpu_1accum/checkpoint_217_400000.pt?sv=2020-04-08&st=2023-03-08T01%3A43%3A47Z&se=2024-03-09T01%3A43%3A00Z&sr=b&sp=r&sig=XDEesMdGQ027j7YtpSql1kZtwgfv39gruOuWYlKlJ7w%3D)|
| SpeechUT Base (En-De) | [960 hrs LibriSpeech](http://www.openslr.org/12) + [408 hrs MuST-C v1](https://ict.fbk.eu/must-c/) + [4.6M Text](https://www.statmt.org/wmt16/) | [En-De MuST-C v1](https://ict.fbk.eu/must-c/) | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/base_speechut4ende_32gpu_1accum/fineutne_ende_checkpoint_avg.pt?sv=2020-04-08&st=2023-03-08T01%3A44%3A15Z&se=2024-03-09T01%3A44%3A00Z&sr=b&sp=r&sig=8dcenahRg46EJdwiHUalVBJgKra6JoSN7tUxdLAwzOM%3D)|
| SpeechUT Base (En-Es) | [960 hrs LibriSpeech](http://www.openslr.org/12) + [504 hrs MuST-C v1](https://ict.fbk.eu/must-c/) + [15M Text](https://www.statmt.org/wmt13/) | - | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/base_speechut4enes_32gpu_1accum/checkpoint_204_400000.pt?sv=2020-04-08&st=2023-03-08T01%3A48%3A16Z&se=2024-03-09T01%3A48%3A00Z&sr=b&sp=r&sig=hWoCM0y0SGZTD4CznC%2F5CejFczkqDYTOaISmlhCAYAU%3D)|
| SpeechUT Base (En-Es) | [960 hrs LibriSpeech](http://www.openslr.org/12) + [504 hrs MuST-C v1](https://ict.fbk.eu/must-c/) + [15M Text](https://www.statmt.org/wmt13/) | [En-Es MuST-C v1](https://ict.fbk.eu/must-c/) | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/base_speechut4enes_32gpu_1accum/fineutne_enes_checkpoint_avg.pt?sv=2020-04-08&st=2023-03-08T01%3A48%3A41Z&se=2024-03-09T01%3A48%3A00Z&sr=b&sp=r&sig=KGfzgKfKkDVQI0JxxnS%2BsYdBQzhUqFLQAVYG0OSGBtk%3D)|
| SpeechUT Base (En-Fr) | [960 hrs LibriSpeech](http://www.openslr.org/12) + [492 hrs MuST-C v1](https://ict.fbk.eu/must-c/) + [40M Text](https://www.statmt.org/wmt14/) | - | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/base_speechut4enfr_32gpu_1accum/checkpoint_297_600000.pt?sv=2020-04-08&st=2023-03-08T01%3A49%3A09Z&se=2024-03-09T01%3A49%3A00Z&sr=b&sp=r&sig=1eqpXMLCjWpfyd7AiOHGzfk%2B8ZYqWwVWdHk1GqXgoeg%3D)|
| SpeechUT Base (En-Fr) | [960 hrs LibriSpeech](http://www.openslr.org/12) + [492 hrs MuST-C v1](https://ict.fbk.eu/must-c/) + [40M Text](https://www.statmt.org/wmt14/) | [En-Fr MuST-C v1](https://ict.fbk.eu/must-c/) | [Azure Storage](https://valle.blob.core.windows.net/share/speechut/base_speechut4enfr_32gpu_1accum/fineutne_enfr_checkpoint.pt?sv=2020-04-08&st=2023-03-08T01%3A49%3A34Z&se=2024-03-09T01%3A49%3A00Z&sr=b&sp=r&sig=i3jMAqvL1Vp7DRjACAbrdoQKhlv2Cmi40%2F14SJ6%2BoiU%3D)|
## Language Model
See [here](https://github.com/microsoft/SpeechT5/tree/main/Speech2C#language-model-and-vocabulary).
## Setup
```bash
git submodule update --init SpeechUT/fairseq
cd SpeechUT/
pip install --editable fairseq/
pip install sacrebleu==1.5.1
```
## ASR on LibriSpeech
### Data preparation
Please follow the steps of wav2vec 2.0 manifest [here](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec#prepare-training-data-manifest) to prepare `train.tsv` and `train.ltr`. You should make sure the vocabulary [`dict.ltr.txt`](dataset/LibriSpeech/dict.ltr.txt) is the same as that used for the pre-trained model. Put yout prepared data into `$data_dir`.
### Fine-tune a hybrid CTC-ED model
- Fine-tune the base model on 100h subset
```bash
# Usage: speechut/scripts/tune_speechut_asr/finetune_base_edctc.sh <model_path> <data_dir> <cpt_tag> [mount=$PWD] [world_size=8] [update_freq=2]
model_path=path/to/your/pre-trained/model
data_dir=dataset/LibriSpeech/asr
bash speechut/scripts/tune_speechut_asr/finetune_base_edctc.sh $model_path $data_dir 'tag400k'
```
- Fine-tune the large model on 960h subset
```bash
# Usage: speechut/scripts/tune_speechut_asr/finetune960h_large_edctc.sh <model_path> <data_dir> <cpt_tag> [mount=$PWD] [world_size=8] [update_freq=3]
model_path=path/to/your/pre-trained/model
data_dir=dataset/LibriSpeech/asr
bash speechut/scripts/tune_speechut_asr/finetune960h_large_edctc.sh $model_path $data_dir 'tag400k'
```
### Decode
- CTC-ED joint decoding
```bash
# Usage: speechut/scripts/tune_speechut_asr/inference_edctc.sh <model_path> <data_dir> [gen-set=dev_other] [beam_size=10] [ctc_weight=0.2] [--normalize]
model_path=path/to/your/fine-tuned/model
data_dir=dataset/LibriSpeech/asr
# for base model
bash speechut/scripts/tune_speechut_asr/inference_edctc.sh $model_path $data_dir test_clean 10 0.2
# for large model, you should set --normalize at the end
bash speechut/scripts/tune_speechut_asr/inference_edctc.sh $model_path $data_dir test_clean 10 0.2 --normalize
```
> We use the [espnet](https://github.com/espnet/espnet)-style joint decoding algorithm, currently only supporting batch_size=1. If you find it too slow, please check [`inference_nj.sh`](speechut/scripts/tune_speechut_asr/inference_nj.sh) for a multi-thread version.
- CTC-ED joint decoding with LM
```bash
# Usage: speechut/scripts/tune_speechut_asr/inference_edctclm.sh <model_path> <data_dir> [gen-set=dev_other] [beam_size=30] [ctc_weight=0.3] [lm_weight=0.7] [lm_path] [--normalize]
model_path=path/to/your/fine-tuned/model
data_dir=dataset/LibriSpeech/asr
lm_path=path/to/char_lm/model
# for base model
bash speechut/scripts/tune_speechut_asr/inference_edctclm.sh $model_path $data_dir test_clean 30 0.3 0.7 $lm_path
# for large model, you should set --normalize at the end
bash speechut/scripts/tune_speechut_asr/inference_edctclm.sh $model_path $data_dir test_clean 30 0.3 0.7 $lm_path --normalize
```
> We currently only support batch_size=1. If you find it too slow, please check [`inference_lm_nj.sh`](speechut/scripts/tune_speechut_asr/inference_lm_nj.sh) for a multi-thread version.
> The released language model uses a different vocaburary [`dict.txt`](dataset/LibriSpeech/dict.txt), put it into `$data_dir` and the script will access it.
## ST on MuST-C
### Data preparation
ST models are fine-tuned with [fairseq speech-to-text](https://github.com/facebookresearch/fairseq/tree/main/examples/speech_to_text) task, so just follow the data preparation instructions [here](https://github.com/facebookresearch/fairseq/tree/main/examples/speech_to_text#data-preparation).
To fine-tune our released models, you should use the same sentecepiece models and dictionaries as ours:
- En-De: [sentencepiece_model](dataset/MuSTC/en_de/spm_unigram10000.model), [dict](dataset/MuSTC/en_de/dict.spm.txt)
- En-Es: [sentencepiece_model](dataset/MuSTC/en_es/spm_unigram10000.model), [dict](dataset/MuSTC/en_es/dict.spm.txt)
- En-Fr: [sentencepiece_model](dataset/MuSTC/en_fr/spm_unigram10000.model), [dict](dataset/MuSTC/en_fr/dict.spm.txt)
We provided examples in [`dataset`](dataset/MuSTC).
### Fine-tune an encoder-decoder model
```bash
# Usage: speechut/scripts/tune_speechut_st/finetune_base_mustc_enxx.sh <model_path> <data_dir> <lang> <cpt-tag> [mount=$PWD] [world_size=8] [update_freq=4/6]
model_path=path/to/your/pre-trained/model
data_dir=dataset/MuSTC/en-${lang}
bash speechut/scripts/tune_speechut_st/finetune_base_mustc_enxx.sh $model_path $data_dir ${lang} tag400k
```
Please check the script [`finetune_base_mustc_enxx.sh`](speechut/scripts/tune_speechut_st/finetune_base_mustc_enxx.sh) for detailed configuration.
### Decode
You might average several model checkpoints with the best dev accuracy to stablize the performance,
```bash
python fairseq/scripts/average_checkpoints.py --inputs $model_dir/checkpoint.best_acc*.pt --output $model_dir/checkpoint.avgnbest.pt
```
Then decode the model with beam search,
```bash
# Usage: speechut/scripts/tune_speechut_st/inference_st.sh <model_path> <data_dir> <lang> [gen-set=dev] [beam_size=10] [lenpen=1.0]
model_path=path/to/your/fine-tuned/model
data_dir=dataset/MuSTC/en-${lang}
bash speechut/scripts/tune_speechut_st/inference_st.sh $model_path $data_dir ${lang} tst-COMMON
```
## Pre-train for ASR
### Data preparation
The model is pre-trained by speech-to-unit, unit-to-text and mask-unit-lm tasks.
1. For speech-to-unit task, please follow the steps of data preparation for HuBERT [here](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#data-preparation).
2. For unit-to-text task, follow the steps below:
- Generate units from unpaired text by [T2U Generator](#T2U-Generator).
- Pair the generated units and text data, convert them to binary files.
3. For mask-unit-lm task, combine the units generated from step1 and step2 together.
You should use [`dict.ltr.txt`](dataset/LibriSpeech/dict.ltr.txt) when preparing the text data, make sure the dictionary is the same as that used for fine-tuning.
### Pre-train base model
```bash
# Usage: speechut/scripts/pretrain_speechut/base_speechut_for_asr.sh <data_dir> <text_data_dir> [mount=$PWD] [world_size=32] [update_freq=1]
data_dir=
text_data_dir=
bash speechut/scripts/pretrain_speechut/base_speechut_for_asr.sh $data_dir $text_data_dir
```
## Pre-train for ST
### Data preparation
The model is pre-trained by speech-to-unit, unit-to-text and mask-unit-lm tasks.
1. For speech-to-unit task, please follow the steps of data preparation for HuBERT [here](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#data-preparation).
2. For unit-to-text task, we use bilingual text where the source side (i.e. English) is used to generate unit and the target side serves as the output. Follow the steps below:
- Normalize the source (English) text by removing punctuation, converting capital letters.
- Generate units from the source (English) text by [T2U Generator](#T2U-Generator).
- Pair the generated units and text data, convert them to binary files.
3. For mask-unit-lm task, combine the units generated from step1 and step2 together.
You should use the same sentencepiece models and dictionaries as that used for [fine-tuning](#ST-on-MuST-C).
### Pre-train base model
```bash
# Usage: speechut/scripts/pretrain_speechut/base_speechut_for_st.sh <data_dir> <text_data_dir> <lang> [mount=$PWD] [world_size=32] [update_freq=1]
data_dir=
text_data_dir=
bash speechut/scripts/pretrain_speechut/base_speechut_for_st.sh $data_dir $text_data_dir ${lang}
```
## T2U Generator
The original paper trains an encoder-decoder model to generate reduced units from text, which is time consuming due to the autoregressive generation.
We recently update the T2U generator to a non-autoregressive model, which generates non-reduced units (can be easily post-processed to reduced units). Please follow the usage provided by [Hidden-unit Tokenizer for Text](https://github.com/microsoft/SpeechT5/tree/main/SpeechLM#hidden-unit-tokenizer-for-text) (they used the same HuBERT units as this work).
## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq).
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
## Reference
If you find our work is useful in your research, please cite the following paper:
```bibtex
@article{zhang2022speechut,
title = {SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training},
author = {Zhang, Ziqiang and Zhou, Long and Ao, Junyi and Liu, Shujie and Dai, Lirong and Li, Jinyu and Wei, Furu},
eprint={2210.03730},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2022}
}
```
### Contact Information
For help or issues using SpeechUT models, please submit a GitHub issue.
For other communications related to SpeechUT, please contact Long Zhou (`[email protected]`). |