---
library_name: transformers
license: other
license_name: meralion-public-license
license_link: https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf
tags:
- speech
- best-rq
- meralion
language:
- en
---

# MERaLiON-SpeechEncoder-v1

The MERaLiON-SpeechEncoder is a speech foundation model designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version was trained on **200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech**, to cater to the speech processing needs in Singapore and beyond. Gradual support for other languages, starting with major Southeast Asian ones are planned for subsequent releases.

- **Developed by:** I<sup>2</sup>R, A\*STAR
- **Model type:** Speech Encoder
- **Language(s):** Primarily English (Global & Singapore)
- **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)

For details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report](https://arxiv.org/abs/2412.11538).

## Acknowledgement
This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative.

## Model Description

<img src="bestrq_model_training.png" alt="model_architecture" width="400" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

MERaLiON-SpeechEncoder was pre-trained from scratch with a self-supervised learning approach using a **BERT-based speech pre-training with random-projection quantizer (BEST-RQ)** objective. Analogous to BERT's mask language modelling criterion for text, this entails predicting the correct discrete label from a codebook, over the masked frames of an input speech signal. MERaLiON-SpeechEncoder-v1 contains approximately 630M parameters.

The model takes in speech as input in the form of mel-spectrograms and returns compressed latent features which can then be passed to a task-specific downstream model, relevant to the user's application. Note that the model provided here is the base foundation model itself and the user has to fine-tune the model with task-specific data for a complete inference pipeline. We provide some examples below to get one started.

## Capabilities

We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the [SUPERB](https://superbbenchmark.org/) benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER). Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders such as WavLM and HuBERT across SUPERB tasks.

This version of the MERaLiON-SpeechEncoder is specifically tailored for English, both global and Singapore-specific, including Singlish. Although the encoder was trained on a portion of multilingual data, this has not been substantially evaluated.

We provide a code snippet below for the direct usage of retrieving latent features from the model, followed by an example of how to set up the model for ASR fine-tuning. Speech input should be sampled at 16kHz.

### Get Features

```python
import torch
from datasets import load_dataset
from transformers import AutoModel, AutoFeatureExtractor

repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load model and feature extractor
model = AutoModel.from_pretrained(
    repo_id,
    trust_remote_code=True,
)
model = model.to(device)

feature_extractor = AutoFeatureExtractor.from_pretrained(
    repo_id,
    trust_remote_code=True
)

# prepare data
data = load_dataset("distil-whisper/librispeech_long", "clean",
                split="validation")

def batch_collater(data):
    tensors = []
    for idx, sample in enumerate(data):
        tensors.append(sample['audio']['array'])
    return tensors

audio_array = batch_collater(data)
inputs = feature_extractor(audio_array, sampling_rate=16_000,
                        return_attention_mask=True,
                        return_tensors='pt', do_normalize=False)
input_values = inputs['input_values']
input_lengths = torch.sum(inputs['attention_mask'], dim=-1)

input_values, input_lengths = input_values.to(device), input_lengths.to(device)

# model inference to obtain features
with torch.no_grad():
    model.eval()
    output = model(input_values=input_values,
                input_lengths=input_lengths,
                output_hidden_states=True)
```

### Downstream Use

```python
import torch
import json
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoFeatureExtractor, Wav2Vec2CTCTokenizer

repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# prepare data
def pre_processing(batch):
    batch["text"] = batch["text"].lower()
    return batch

def extract_all_chars(batch):
  all_text = " ".join(batch["text"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

librispeech100h_train = load_dataset("openslr/librispeech_asr", split="train.clean.100")
librispeech100h_test = load_dataset("openslr/librispeech_asr", split="validation.clean")
librispeech100h_train = librispeech100h_train.remove_columns(
                                    ['file', 'speaker_id', 'chapter_id', 'id'])
librispeech100h_test = librispeech100h_test.remove_columns(
                                    ['file', 'speaker_id', 'chapter_id', 'id'])

librispeech100h_train = librispeech100h_train.map(pre_processing)
librispeech100h_test = librispeech100h_test.map(pre_processing)

vocab_train = librispeech100h_train.map(extract_all_chars, batched=True,
                                    batch_size=-1, keep_in_memory=True,
                                    remove_columns=librispeech100h_train.column_names)
vocab_test = librispeech100h_test.map(extract_all_chars, batched=True,
                                    batch_size=-1, keep_in_memory=True,
                                    remove_columns=librispeech100h_test.column_names)
vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}

vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)

with open('ls_vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

# load model, feature extractor and tokenizer
feature_extractor = AutoFeatureExtractor.from_pretrained(
    repo_id,
    trust_remote_code = True,
)

tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json",
                            unk_token="[UNK]", pad_token="[PAD]",
                            word_delimiter_token="|")

model = AutoModelForCTC.from_pretrained(
    repo_id,
    trust_remote_code=True,
    vocab_size=len(vocab_dict),
    feat_proj_dropout=0.1,
    activation_dropout=0.1,
    hidden_dropout=0.1,
    conformer_conv_dropout=0.1,
    ctc_loss_reduction="mean",
    pad_token_id=tokenizer.pad_token_id,
    attention_dropout=0.1,
)
model = model.to(device)
```
Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for further ASR fine-tuning recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom fine-tuning loops.

## Technical Specifications 

### Training Data

MERaLiON-SpeechEncoder has been trained on a diverse set of unsupervised speech data, primarily in English. Our collection is curated from various publicly available datasets and covers a wide range of conditions, encompassing factors such as domain, style, speaker, gender, and accent. The combined dataset comprises around 170,000 hours of English, including 10,000 hours of Singapore-based English that incorporates code-switching; plus 30,000 additional hours of multilingual speech from 113 languages, totalling 200,000 hours. Consult our technical report for the full breakdown.

### Training Procedure and Compute

MERaLiON-SpeechEncoder was trained in two phases, initially on a 60,000 hours subset of data, before continued pre-trainining on the full 200,000 hours dataset using this prior checkpoint as initialisation. The initial model was trained on the **ASPIRE 2A** Supercomputer Cluster provided by the **National Supercomputing Centre (NSCC)** for 325K steps on 12 Nvidia A100 40GB GPUs. The full pre-training run was carried out on the **LUMI** Supercomputer Cluster with 128 AMD MI250x GPUs for a further 382K steps taking about 25 days of active GPU time.

## Citation

If you find our work useful, please cite our technical report:

```
@misc{huzaifah2024speechfoundationmodelsingapore,
      title={MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond}, 
      author={{MERaLiON Team}},
      year={2024},
      eprint={2412.11538},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.11538}, 
}
```