---
library_name: transformers
license: other
license_name: meralion-public-license
license_link: https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf
tags:
- speech
- best-rq
- meralion
language:
- en
---
# MERaLiON-SpeechEncoder-v1
The MERaLiON-SpeechEncoder is a speech foundation model designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version was trained on **200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech**, to cater to the speech processing needs in Singapore and beyond. Gradual support for other languages, starting with major Southeast Asian ones are planned for subsequent releases.
- **Developed by:** I2R, A\*STAR
- **Model type:** Speech Encoder
- **Language(s):** Primarily English (Global & Singapore)
- **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
For details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report](https://arxiv.org/abs/2412.11538).
## Acknowledgement
This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative.
## Model Description
MERaLiON-SpeechEncoder was pre-trained from scratch with a self-supervised learning approach using a **BERT-based speech pre-training with random-projection quantizer (BEST-RQ)** objective. Analogous to BERT's mask language modelling criterion for text, this entails predicting the correct discrete label from a codebook, over the masked frames of an input speech signal. MERaLiON-SpeechEncoder-v1 contains approximately 630M parameters.
The model takes in speech as input in the form of mel-spectrograms and returns compressed latent features which can then be passed to a task-specific downstream model, relevant to the user's application. Note that the model provided here is the base foundation model itself and the user has to fine-tune the model with task-specific data for a complete inference pipeline. We provide some examples below to get one started.
## Capabilities
We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the [SUPERB](https://superbbenchmark.org/) benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER). Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders such as WavLM and HuBERT across SUPERB tasks.
This version of the MERaLiON-SpeechEncoder is specifically tailored for English, both global and Singapore-specific, including Singlish. Although the encoder was trained on a portion of multilingual data, this has not been substantially evaluated.
We provide a code snippet below for the direct usage of retrieving latent features from the model, followed by an example of how to set up the model for ASR fine-tuning. Speech input should be sampled at 16kHz.
### Get Features
```python
import torch
from datasets import load_dataset
from transformers import AutoModel, AutoFeatureExtractor
repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load model and feature extractor
model = AutoModel.from_pretrained(
repo_id,
trust_remote_code=True,
)
model = model.to(device)
feature_extractor = AutoFeatureExtractor.from_pretrained(
repo_id,
trust_remote_code=True
)
# prepare data
data = load_dataset("distil-whisper/librispeech_long", "clean",
split="validation")
def batch_collater(data):
tensors = []
for idx, sample in enumerate(data):
tensors.append(sample['audio']['array'])
return tensors
audio_array = batch_collater(data)
inputs = feature_extractor(audio_array, sampling_rate=16_000,
return_attention_mask=True,
return_tensors='pt', do_normalize=False)
input_values = inputs['input_values']
input_lengths = torch.sum(inputs['attention_mask'], dim=-1)
input_values, input_lengths = input_values.to(device), input_lengths.to(device)
# model inference to obtain features
with torch.no_grad():
model.eval()
output = model(input_values=input_values,
input_lengths=input_lengths,
output_hidden_states=True)
```
### Downstream Use
```python
import torch
import json
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoFeatureExtractor, Wav2Vec2CTCTokenizer
repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# prepare data
def pre_processing(batch):
batch["text"] = batch["text"].lower()
return batch
def extract_all_chars(batch):
all_text = " ".join(batch["text"])
vocab = list(set(all_text))
return {"vocab": [vocab], "all_text": [all_text]}
librispeech100h_train = load_dataset("openslr/librispeech_asr", split="train.clean.100")
librispeech100h_test = load_dataset("openslr/librispeech_asr", split="validation.clean")
librispeech100h_train = librispeech100h_train.remove_columns(
['file', 'speaker_id', 'chapter_id', 'id'])
librispeech100h_test = librispeech100h_test.remove_columns(
['file', 'speaker_id', 'chapter_id', 'id'])
librispeech100h_train = librispeech100h_train.map(pre_processing)
librispeech100h_test = librispeech100h_test.map(pre_processing)
vocab_train = librispeech100h_train.map(extract_all_chars, batched=True,
batch_size=-1, keep_in_memory=True,
remove_columns=librispeech100h_train.column_names)
vocab_test = librispeech100h_test.map(extract_all_chars, batched=True,
batch_size=-1, keep_in_memory=True,
remove_columns=librispeech100h_test.column_names)
vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
with open('ls_vocab.json', 'w') as vocab_file:
json.dump(vocab_dict, vocab_file)
# load model, feature extractor and tokenizer
feature_extractor = AutoFeatureExtractor.from_pretrained(
repo_id,
trust_remote_code = True,
)
tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json",
unk_token="[UNK]", pad_token="[PAD]",
word_delimiter_token="|")
model = AutoModelForCTC.from_pretrained(
repo_id,
trust_remote_code=True,
vocab_size=len(vocab_dict),
feat_proj_dropout=0.1,
activation_dropout=0.1,
hidden_dropout=0.1,
conformer_conv_dropout=0.1,
ctc_loss_reduction="mean",
pad_token_id=tokenizer.pad_token_id,
attention_dropout=0.1,
)
model = model.to(device)
```
Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for further ASR fine-tuning recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom fine-tuning loops.
## Technical Specifications
### Training Data
MERaLiON-SpeechEncoder has been trained on a diverse set of unsupervised speech data, primarily in English. Our collection is curated from various publicly available datasets and covers a wide range of conditions, encompassing factors such as domain, style, speaker, gender, and accent. The combined dataset comprises around 170,000 hours of English, including 10,000 hours of Singapore-based English that incorporates code-switching; plus 30,000 additional hours of multilingual speech from 113 languages, totalling 200,000 hours. Consult our technical report for the full breakdown.
### Training Procedure and Compute
MERaLiON-SpeechEncoder was trained in two phases, initially on a 60,000 hours subset of data, before continued pre-trainining on the full 200,000 hours dataset using this prior checkpoint as initialisation. The initial model was trained on the **ASPIRE 2A** Supercomputer Cluster provided by the **National Supercomputing Centre (NSCC)** for 325K steps on 12 Nvidia A100 40GB GPUs. The full pre-training run was carried out on the **LUMI** Supercomputer Cluster with 128 AMD MI250x GPUs for a further 382K steps taking about 25 days of active GPU time.
## Citation
If you find our work useful, please cite our technical report:
```
@misc{huzaifah2024speechfoundationmodelsingapore,
title={MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond},
author={{MERaLiON Team}},
year={2024},
eprint={2412.11538},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.11538},
}
```