SpeechLM

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

June 2023: We have corrected the errors in the pre-training data for SpeechLM-P Base models, and new results are updated.
April 2023: We discovered some errors about the data in the pre-training experiments, which will affect all the results about SpeechLM-P Base models. We are re-conducting the related experiments and will update the paper with the new results.
(Done) Oct 2022: release the code and models
Oct 2022: release preprint in arXiv

Pre-Trained and Fine-tuned Models

Model	Pre-training Dataset	Fine-tuning Dataset	Model
SpeechLM-P Base	960 hrs LibriSpeech + 40M Text	-	Azure Storage
SpeechLM-P Base	960 hrs LibriSpeech + 40M Text	100 hrs LibriSpeech	Azure Storage
SpeechLM-H Base	960 hrs LibriSpeech + 40M Text	-	Google drive
SpeechLM-H Base	960 hrs LibriSpeech + 40M Text	100 hrs LibriSpeech	Google drive
SpeechLM-P Base	960 hrs LibriSpeech + 40M Text	En-De CoVoST-2	Azure Storage
SpeechLM-P Base	960 hrs LibriSpeech + 40M Text	En-Ca CoVoST-2	Azure Storage
SpeechLM-P Base	960 hrs LibriSpeech + 40M Text	En-Ar CoVoST-2	Azure Storage
SpeechLM-P Base	960 hrs LibriSpeech + 40M Text	En-Tr CoVoST-2	Azure Storage
SpeechLM-P Large	60k hrs LibriLight + 40M Text	-	Google drive
SpeechLM-P Large	60k hrs LibriLight + 40M Text	960 hrs LibriSpeech	Google drive
SpeechLM-P Large	60k hrs LibriLight + 40M Text	En-De CoVoST-2	Google drive
SpeechLM-P Large	60k hrs LibriLight + 40M Text	En-Ca CoVoST-2	Google drive
SpeechLM-P Large	60k hrs LibriLight + 40M Text	En-Ar CoVoST-2	Google drive
SpeechLM-P Large	60k hrs LibriLight + 40M Text	En-Tr CoVoST-2	Google drive

Extract features using pre-trained models

For easier use of our pre-trained models, we merge all inference-related code to SpeechLM.py and make cleaned checkpoints ~~SpeechLM-P Base~~ SpeechLM-H Base SpeechLM-P Large by removing non-required modules. Now you can directly use the following script to extract your speech features:

import torch
import torch.nn.functional as F
from SpeechLM import SpeechLMConfig, SpeechLM

checkpoint = torch.load('path/to/the/cleaned/checkpoint.pt')
cfg = SpeechLMConfig(checkpoint['cfg']['model'])
model = SpeechLM(cfg)
model.load_state_dict(checkpoint['model'])
model.eval()

wav_input_16khz = torch.randn(1,10000)
normalize = checkpoint['cfg']['task']['normalize']  # False for base model, True for large model
if normalize:
    wav_input_16khz = F.layer_norm(wav_input_16khz[0], wav_input_16khz[0].shape).unsqueeze(0)

# extract the representation of last layer
rep = model.extract_features(wav_input_16khz)[0]

# extract the representation of each layer
output_layer = model.cfg.encoder_layers + model.cfg.text_transformer.encoder.layers
rep, layer_results = model.extract_features(wav_input_16khz, output_layer=output_layer, ret_layer_results=True)[0]
layer_reps = [x.transpose(0, 1) for x in layer_results]

Setup

To fine-tune or pre-train more models, please follow the instructions below.

git submodule update --init SpeechLM/fairseq
cd SpeechLM/
pip install --editable fairseq/
pip install sacrebleu==1.5.1

ASR on LibriSpeech

Data preparation

Please follow the steps of wav2vec 2.0 manifest here to prepare train.tsv and train.ltr. You should make sure the vocabulary dict.ltr.txt is the same as that used for the pre-trained model.

Put yout prepared data into $data_dir, we provided eamples in dataset/LibriSpeech/asr.

Fine-tune a CTC model

Fine-tune the base model

# Usage: speechlm/scripts/tune_speechlm_asr/finetune_base_ctc.sh <model_path> <data_dir> <cpt_tag> [mount=$PWD] [world_size=8] [update_freq=1]
model_path=path/to/your/pre-trained/model
data_dir=dataset/LibriSpeech/asr
bash speechlm/scripts/tune_speechlm_asr/finetune_base_ctc.sh $model_path $data_dir 'tag400k'

Fine-tune the large model

# Usage: speechlm/scripts/tune_speechlm_asr/finetune_large_ctc.sh <model_path> <data_dir> <cpt_tag> [mount=$PWD] [world_size=8] [update_freq=4]
model_path=path/to/your/pre-trained/model
data_dir=dataset/LibriSpeech/asr
bash speechlm/scripts/tune_speechlm_asr/finetune_large_ctc.sh $model_path $data_dir 'tag400k'

Decode

Directly decode a CTC model.

# Usage: speechlm/scripts/tune_speechlm_asr/inference_ctc.sh <model_path> <data_dir> [gen-set=dev_clean,dev_other,test_clean,test_other]
model_path=path/to/your/fine-tuned/model
data_dir=dataset/LibriSpeech/asr
bash speechlm/scripts/tune_speechlm_asr/inference_ctc.sh $model_path $data_dir
# for large models
# bash speechlm/scripts/tune_speechlm_asr/inference_ctc_large.sh $model_path $data_dir

Decode with 4-gram language model using flashlight and kenlm.

Please put 4-gram.arpa and the word-to-letter lexicon librispeech_lexicon.lst into $data_dir.

# Usage: speechlm/scripts/tune_speechlm_asr/inference_ctc_kenlm.sh <model_path> <data_dir> [gen-set=dev_clean,dev_other,test_clean,test_other]
model_path=path/to/your/fine-tuned/model
data_dir=dataset/LibriSpeech/asr
bash speechlm/scripts/tune_speechlm_asr/inference_ctc_kenlm.sh $model_path $data_dir

Decode large models with fairseq-lm using flashlight.

Please put lm_librispeech_word_transformer.pt and its vocabulary dict.txt into $data_dir/fairseq_word_lm, and the word-to-letter lexicon librispeech_lexicon.lst into $data_dir. Capitalize the dict.txt to amke it compatible with the word-to-letter lexicon.

# Usage: speechlm/scripts/tune_speechlm_asr/inference_ctc_large_fsqlm.sh <model_path> <data_dir> [gen-set=dev_clean,dev_other,test_clean,test_other]
model_path=path/to/your/fine-tuned/model
data_dir=dataset/LibriSpeech/asr
bash speechlm/scripts/tune_speechlm_asr/inference_ctc_large_fsqlm.sh $model_path $data_dir dev_other

ST on CoVoST-2

Data Preparation

Download Common Voice audio clips (version 4) for English into $cv_root/en.
Get data manifest. The following script will convert mp3 files to waveform, create tsv file containing speech/translation paires, create data config files.
```
lang=de # ca,ar,tr
cv_root=dataset/CommonVoice/v4
bash speechlm/data_process/prepare_covost2_enxx.sh $lang $cv_root
```
We provided examples in dataset/CommonVoice/v4/en/en-de.

Fine-tune a encoder-decoder model

Fine-tune the Base model (fine-tuned models will be stored in $mount/exp/finetune_covost).

model_path=path/to/your/pre-trained/model
lang=de # ca,ar,tr
data_dir=dataset/CommonVoice/v4/en/en-${lang}
# Usage (Base model): speechlm/scripts/tune_speechlm_st/ft_base_covost_enxx.sh <model_path> <data_dir> <lang> <cpt-tag> [mount=$PWD] [world_size=8] [update_freq=2]
bash speechlm/scripts/tune_speechlm_st/ft_base_covost_enxx.sh $model_path $data_dir $lang 'tag400k'

Fine-tune the Large model (fine-tuned models will be stored in $mount/exp/finetune_covost).

# Usage (Large model): speechlm/scripts/tune_speechlm_st/ft_large_covost_enxx.sh <model_path> <data_dir> <lang> <cpt-tag> [mount=$PWD] [world_size=8] [update_freq=4]
bash speechlm/scripts/tune_speechlm_st/ft_large_covost_enxx.sh $model_path $data_dir $lang 'tag400k'

Decode

Decode the base model

# Usage: speechlm/scripts/tune_speechlm_st/inference_base.sh <model_path> <data_dir> <lang> [gen-set=dev] [beam_size=5]
model_path=path/to/your/fine-tuned/model
lang=de # ca,ar,tr
data_dir=dataset/CommonVoice/v4/en/en-${lang}
bash speechlm/scripts/tune_speechlm_st/inference_base.sh $model_path $data_dir $lang dev

Decode the large model

# Usage: speechlm/scripts/tune_speechlm_st/inference_large.sh <model_path> <data_dir> <lang> [gen-set=dev] [beam_size=5]
bash speechlm/scripts/tune_speechlm_st/inference_large.sh $model_path $data_dir $lang dev

Universal Representation Evaluation on SUPERB

Please refer to SUPERB for the downstreaming tasks.

Pre-train

Please follow the instructions of Tokenizer to prepare the pre-training data. We provided examples in dataset.

SpeechLM-P Base model

Models will be stored in $mount/pretrain.

data_dir=dataset/LibriSpeech/phone_unit   # should contain train_960.{tsv,phn}
text_data_dir=dataset/LibriLM/phone_unit/bin-idx     # should contain train_text.phn-ltr.{phn,ltr}.{bin,idx}
# Usage: speechlm/scripts/pretrain_speechlm/base_speechlmp.sh <data_dir> <text_data_dir> [mount=$PWD] [world_size=32] [update_freq=1]
bash speechlm/scripts/pretrain_speechlm/base_speechlmp.sh $data_dir $text_data_dir

SpeechLM-H Base model

data_dir=dataset/LibriSpeech/hidden_unit  # should contain train_960.{tsv,phn}
text_data_dir=dataset/LibriLM/km-ltr/bin-idx     # should contain train_text.km-ltr.{km,ltr}.{bin,idx}
# Usage: speechlm/scripts/pretrain_speechlm/base_speechlmh.sh <data_dir> <text_data_dir> [mount=$PWD] [world_size=32] [update_freq=1]
bash speechlm/scripts/pretrain_speechlm/base_speechlmp.sh $data_dir $text_data_dir

SpeechLM-P Large model

data_dir=dataset/LibriSpeech/phone_unit   # should contain train_960.{tsv,phn}
text_data_dir=dataset/LibriLM/phone_unit/bin-idx     # should contain train_text.phn-ltr.{phn,ltr}.{bin,idx}
# Usage: speechlm/scripts/pretrain_speechlm/base_speechlmp.sh <data_dir> <text_data_dir> [mount=$PWD] [world_size=32] [update_freq=1]
bash speechlm/scripts/pretrain_speechlm/large_speechlmp.sh $data_dir $text_data_dir

Tokenizers

Phoneme-unit Tokenizer for Speech

This tokenizer is used to produce the frame-laigned phonemes for unlabeled speech, which is actually a hybrid HMM ASR model.

In the Base setting, we use 100h LibriSpeech labeled data to train the HMM model under Kaldi recipe, then decode the unpaired speech and get the aligned phonemes from the lattice. Here we provided the processed phonemes of 960h speech here: train_960.tsv, train_960.phn, dev_clean.tsv, dev_clean.phn. Note that the label-rate is 100 (10ms).

The phoneme inventory is 300+ word-position-dependent phones including silence phones.

Phoneme-unit Tokenizer for Text

This tokenizer is used to phonemize the unpaired text data to (phonemes, letters) paired data, following a words -> phonemes -> upsampled phones pipeline.

The following script will download LibriSpeech LM corpus and produce the required data: train_text.phn-ltr.phn.{idx,bin} and train_text.phn-ltr.ltr.{idx,bin}.

Before runing it, make sure you have our provided dict.phn.txt and dict.ltr.txt in the output dir dataset/LibriLM/phone_unit/bin-idx/.

The phoneme inventory is 300+ word-position-dependent phones including silence phones.

# data will be in dataset/LibriLM/phone_unit/
bash speechlm/data_process/prepare_phn2ltr_librilm.sh

Hidden-unit Tokenizer for Speech

Please follow the steps of data preparation for HuBERT here to prepare 1) wav recordings train.tsv and 2) corresponding hidden-units train.km, and 3) unit vocabulary dict.km.txt.

Hidden-unit Tokenizer for Text

This tokenizer is used to produce the speech-style hidden units from unpaired text. We train a FastSpeech-like model (instead generating continuous spectrum in the original paper, here we generate discrete units) on a small amount of ASR data (100 hrs LibriSpeech) as the tokenizer.

Train:

Convert asr transcripts to phoneme sequence with duration information.
Extract hidden-units from speech, using the Hidden-unit Tokenizer for Speech.

Train the model on the paired data:

data_dir=dataset/LibriSpeech/fast_phone2unit
bash speechlm/scripts/tokenizer_fastT2U/train_s_5e-4.sh $data_dir

The phoneme inventory is 41 mono phones including silence phones.

Inference:

Convert text data to phoneme sequence by lexicon.

Generate hidden units for a large text corpus:

gen_set=dataset/LibriSpeech/fast_phone2unit/genset_examples
bash speechlm/scripts/tokenizer_fastT2U/generate.sh $model_path $gen_set

We provided train/generate data examples in dataset/LibriSpeech/fast_phone2unit, and the model checkpoint here.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@article{zhang2022speechlm,
  title   = {SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data},
  author  = {Zhang, Ziqiang and Chen, Sanyuan and Zhou, Long and Wu, Yu and Ren, Shuo and Liu, Shujie and Yao, Zhuoyuan and Gong, Xun and Dai, Lirong and Li, Jinyu and Wei, Furu},
  eprint={2209.15329},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2022}
}

Contact Information

For help or issues using SpeechLM models, please submit a GitHub issue.

For other communications related to SpeechLM, please contact Long Zhou ([email protected]).