Spaces:
Runtime error
SpeechUT
- (Done) Oct 2022: release the code and models
- Oct 2022: release preprint in arXiv
Pre-Trained and Fine-tuned Models
Model | Pre-training Dataset (unlabeled) | Fine-tuning Dataset (labeled) | Model |
---|---|---|---|
SpeechUT Base (ASR) | 960 hrs LibriSpeech + 40M Text | - | Azure Storage |
SpeechUT Base (ASR) | 960 hrs LibriSpeech + 40M Text | 100 hrs LibriSpeech | Azure Storage |
SpeechUT Large (ASR) | 60k hrs LibriSpeech + 40M Text | - | Azure Storage |
SpeechUT Large (ASR) | 60k hrs LibriSpeech + 40M Text | 960 hrs LibriSpeech | Azure Storage |
SpeechUT Base (En-De) | 960 hrs LibriSpeech + 408 hrs MuST-C v1 + 4.6M Text | - | Azure Storage |
SpeechUT Base (En-De) | 960 hrs LibriSpeech + 408 hrs MuST-C v1 + 4.6M Text | En-De MuST-C v1 | Azure Storage |
SpeechUT Base (En-Es) | 960 hrs LibriSpeech + 504 hrs MuST-C v1 + 15M Text | - | Azure Storage |
SpeechUT Base (En-Es) | 960 hrs LibriSpeech + 504 hrs MuST-C v1 + 15M Text | En-Es MuST-C v1 | Azure Storage |
SpeechUT Base (En-Fr) | 960 hrs LibriSpeech + 492 hrs MuST-C v1 + 40M Text | - | Azure Storage |
SpeechUT Base (En-Fr) | 960 hrs LibriSpeech + 492 hrs MuST-C v1 + 40M Text | En-Fr MuST-C v1 | Azure Storage |
Language Model
See here.
Setup
git submodule update --init SpeechUT/fairseq
cd SpeechUT/
pip install --editable fairseq/
pip install sacrebleu==1.5.1
ASR on LibriSpeech
Data preparation
Please follow the steps of wav2vec 2.0 manifest here to prepare train.tsv
and train.ltr
. You should make sure the vocabulary dict.ltr.txt
is the same as that used for the pre-trained model. Put yout prepared data into $data_dir
.
Fine-tune a hybrid CTC-ED model
Fine-tune the base model on 100h subset
# Usage: speechut/scripts/tune_speechut_asr/finetune_base_edctc.sh <model_path> <data_dir> <cpt_tag> [mount=$PWD] [world_size=8] [update_freq=2] model_path=path/to/your/pre-trained/model data_dir=dataset/LibriSpeech/asr bash speechut/scripts/tune_speechut_asr/finetune_base_edctc.sh $model_path $data_dir 'tag400k'
Fine-tune the large model on 960h subset
# Usage: speechut/scripts/tune_speechut_asr/finetune960h_large_edctc.sh <model_path> <data_dir> <cpt_tag> [mount=$PWD] [world_size=8] [update_freq=3] model_path=path/to/your/pre-trained/model data_dir=dataset/LibriSpeech/asr bash speechut/scripts/tune_speechut_asr/finetune960h_large_edctc.sh $model_path $data_dir 'tag400k'
Decode
CTC-ED joint decoding
# Usage: speechut/scripts/tune_speechut_asr/inference_edctc.sh <model_path> <data_dir> [gen-set=dev_other] [beam_size=10] [ctc_weight=0.2] [--normalize] model_path=path/to/your/fine-tuned/model data_dir=dataset/LibriSpeech/asr # for base model bash speechut/scripts/tune_speechut_asr/inference_edctc.sh $model_path $data_dir test_clean 10 0.2 # for large model, you should set --normalize at the end bash speechut/scripts/tune_speechut_asr/inference_edctc.sh $model_path $data_dir test_clean 10 0.2 --normalize
We use the espnet-style joint decoding algorithm, currently only supporting batch_size=1. If you find it too slow, please check
inference_nj.sh
for a multi-thread version.CTC-ED joint decoding with LM
# Usage: speechut/scripts/tune_speechut_asr/inference_edctclm.sh <model_path> <data_dir> [gen-set=dev_other] [beam_size=30] [ctc_weight=0.3] [lm_weight=0.7] [lm_path] [--normalize] model_path=path/to/your/fine-tuned/model data_dir=dataset/LibriSpeech/asr lm_path=path/to/char_lm/model # for base model bash speechut/scripts/tune_speechut_asr/inference_edctclm.sh $model_path $data_dir test_clean 30 0.3 0.7 $lm_path # for large model, you should set --normalize at the end bash speechut/scripts/tune_speechut_asr/inference_edctclm.sh $model_path $data_dir test_clean 30 0.3 0.7 $lm_path --normalize
We currently only support batch_size=1. If you find it too slow, please check
inference_lm_nj.sh
for a multi-thread version.The released language model uses a different vocaburary
dict.txt
, put it into$data_dir
and the script will access it.
ST on MuST-C
Data preparation
ST models are fine-tuned with fairseq speech-to-text task, so just follow the data preparation instructions here. To fine-tune our released models, you should use the same sentecepiece models and dictionaries as ours:
- En-De: sentencepiece_model, dict
- En-Es: sentencepiece_model, dict
- En-Fr: sentencepiece_model, dict
We provided examples in dataset
.
Fine-tune an encoder-decoder model
# Usage: speechut/scripts/tune_speechut_st/finetune_base_mustc_enxx.sh <model_path> <data_dir> <lang> <cpt-tag> [mount=$PWD] [world_size=8] [update_freq=4/6]
model_path=path/to/your/pre-trained/model
data_dir=dataset/MuSTC/en-${lang}
bash speechut/scripts/tune_speechut_st/finetune_base_mustc_enxx.sh $model_path $data_dir ${lang} tag400k
Please check the script finetune_base_mustc_enxx.sh
for detailed configuration.
Decode
You might average several model checkpoints with the best dev accuracy to stablize the performance,
python fairseq/scripts/average_checkpoints.py --inputs $model_dir/checkpoint.best_acc*.pt --output $model_dir/checkpoint.avgnbest.pt
Then decode the model with beam search,
# Usage: speechut/scripts/tune_speechut_st/inference_st.sh <model_path> <data_dir> <lang> [gen-set=dev] [beam_size=10] [lenpen=1.0]
model_path=path/to/your/fine-tuned/model
data_dir=dataset/MuSTC/en-${lang}
bash speechut/scripts/tune_speechut_st/inference_st.sh $model_path $data_dir ${lang} tst-COMMON
Pre-train for ASR
Data preparation
The model is pre-trained by speech-to-unit, unit-to-text and mask-unit-lm tasks.
- For speech-to-unit task, please follow the steps of data preparation for HuBERT here.
- For unit-to-text task, follow the steps below:
- Generate units from unpaired text by T2U Generator.
- Pair the generated units and text data, convert them to binary files.
- For mask-unit-lm task, combine the units generated from step1 and step2 together.
You should use dict.ltr.txt
when preparing the text data, make sure the dictionary is the same as that used for fine-tuning.
Pre-train base model
# Usage: speechut/scripts/pretrain_speechut/base_speechut_for_asr.sh <data_dir> <text_data_dir> [mount=$PWD] [world_size=32] [update_freq=1]
data_dir=
text_data_dir=
bash speechut/scripts/pretrain_speechut/base_speechut_for_asr.sh $data_dir $text_data_dir
Pre-train for ST
Data preparation
The model is pre-trained by speech-to-unit, unit-to-text and mask-unit-lm tasks.
- For speech-to-unit task, please follow the steps of data preparation for HuBERT here.
- For unit-to-text task, we use bilingual text where the source side (i.e. English) is used to generate unit and the target side serves as the output. Follow the steps below:
- Normalize the source (English) text by removing punctuation, converting capital letters.
- Generate units from the source (English) text by T2U Generator.
- Pair the generated units and text data, convert them to binary files.
- For mask-unit-lm task, combine the units generated from step1 and step2 together. You should use the same sentencepiece models and dictionaries as that used for fine-tuning.
Pre-train base model
# Usage: speechut/scripts/pretrain_speechut/base_speechut_for_st.sh <data_dir> <text_data_dir> <lang> [mount=$PWD] [world_size=32] [update_freq=1]
data_dir=
text_data_dir=
bash speechut/scripts/pretrain_speechut/base_speechut_for_st.sh $data_dir $text_data_dir ${lang}
T2U Generator
The original paper trains an encoder-decoder model to generate reduced units from text, which is time consuming due to the autoregressive generation. We recently update the T2U generator to a non-autoregressive model, which generates non-reduced units (can be easily post-processed to reduced units). Please follow the usage provided by Hidden-unit Tokenizer for Text (they used the same HuBERT units as this work).
License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ.
Microsoft Open Source Code of Conduct
Reference
If you find our work is useful in your research, please cite the following paper:
@article{zhang2022speechut,
title = {SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training},
author = {Zhang, Ziqiang and Zhou, Long and Ao, Junyi and Liu, Shujie and Dai, Lirong and Li, Jinyu and Wei, Furu},
eprint={2210.03730},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2022}
}
Contact Information
For help or issues using SpeechUT models, please submit a GitHub issue.
For other communications related to SpeechUT, please contact Long Zhou ([email protected]
).