File size: 9,618 Bytes
f903aee
 
5da5a73
c59c54a
 
400c007
 
 
 
 
 
f903aee
 
400c007
f903aee
ba5f1d8
f903aee
400c007
 
0ef98f9
c59c54a
f903aee
0ef98f9
ba5f1d8
f09bf85
0c23626
 
f09bf85
f903aee
5a25468
 
0ef98f9
f903aee
400c007
f903aee
ba5f1d8
 
 
f903aee
43fe950
f903aee
43fe950
f903aee
0ef98f9
f903aee
400c007
 
 
 
f903aee
4802c8b
400c007
f903aee
400c007
 
 
 
 
4802c8b
f903aee
400c007
 
4802c8b
400c007
f903aee
400c007
d6f27db
 
f903aee
400c007
 
 
4802c8b
400c007
f903aee
400c007
d6f27db
 
 
400c007
 
f903aee
400c007
f903aee
400c007
 
 
43fe950
d6f27db
 
400c007
f903aee
4802c8b
f903aee
4802c8b
 
 
 
 
f903aee
4802c8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6f27db
 
 
 
4802c8b
 
 
 
43fe950
 
 
 
 
 
4802c8b
 
 
 
 
 
 
f903aee
4802c8b
 
 
 
 
 
 
 
 
43fe950
d6f27db
 
4802c8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ef98f9
f903aee
400c007
f903aee
400c007
f903aee
182d1f5
f903aee
400c007
f903aee
ea50f21
f903aee
ba5f1d8
f903aee
0ef98f9
f903aee
0ef98f9
 
0d795c8
0ef98f9
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
library_name: transformers
license: other
license_name: meralion-public-license
license_link: https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf
tags:
- speech
- best-rq
- meralion
language:
- en
---

# MERaLiON-SpeechEncoder-v1

The MERaLiON-SpeechEncoder is a speech foundation model designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version was trained on **200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech**, to cater to the speech processing needs in Singapore and beyond. Gradual support for other languages, starting with major Southeast Asian ones are planned for subsequent releases.

- **Developed by:** I<sup>2</sup>R, A\*STAR
- **Model type:** Speech Encoder
- **Language(s):** Primarily English (Global & Singapore)
- **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)

For details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report](https://arxiv.org/abs/2412.11538).

## Acknowledgement
This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative.

## Model Description

<img src="bestrq_model_training.png" alt="model_architecture" width="400" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

MERaLiON-SpeechEncoder was pre-trained from scratch with a self-supervised learning approach using a **BERT-based speech pre-training with random-projection quantizer (BEST-RQ)** objective. Analogous to BERT's mask language modelling criterion for text, this entails predicting the correct discrete label from a codebook, over the masked frames of an input speech signal. MERaLiON-SpeechEncoder-v1 contains approximately 630M parameters.

The model takes in speech as input in the form of mel-spectrograms and returns compressed latent features which can then be passed to a task-specific downstream model, relevant to the user's application. Note that the model provided here is the base foundation model itself and the user has to fine-tune the model with task-specific data for a complete inference pipeline. We provide some examples below to get one started.

## Capabilities

We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the [SUPERB](https://superbbenchmark.org/) benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER). Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders such as WavLM and HuBERT across SUPERB tasks.

This version of the MERaLiON-SpeechEncoder is specifically tailored for English, both global and Singapore-specific, including Singlish. Although the encoder was trained on a portion of multilingual data, this has not been substantially evaluated.

We provide a code snippet below for the direct usage of retrieving latent features from the model, followed by an example of how to set up the model for ASR fine-tuning. Speech input should be sampled at 16kHz.

### Get Features

```python
import torch
from datasets import load_dataset
from transformers import AutoModel, AutoFeatureExtractor

repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load model and feature extractor
model = AutoModel.from_pretrained(
    repo_id,
    trust_remote_code=True,
)
model = model.to(device)

feature_extractor = AutoFeatureExtractor.from_pretrained(
    repo_id,
    trust_remote_code=True
)

# prepare data
data = load_dataset("distil-whisper/librispeech_long", "clean",
                split="validation")

def batch_collater(data):
    tensors = []
    for idx, sample in enumerate(data):
        tensors.append(sample['audio']['array'])
    return tensors

audio_array = batch_collater(data)
inputs = feature_extractor(audio_array, sampling_rate=16_000,
                        return_attention_mask=True,
                        return_tensors='pt', do_normalize=False)
input_values = inputs['input_values']
input_lengths = torch.sum(inputs['attention_mask'], dim=-1)

input_values, input_lengths = input_values.to(device), input_lengths.to(device)

# model inference to obtain features
with torch.no_grad():
    model.eval()
    output = model(input_values=input_values,
                input_lengths=input_lengths,
                output_hidden_states=True)
```

### Downstream Use

```python
import torch
import json
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoFeatureExtractor, Wav2Vec2CTCTokenizer

repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# prepare data
def pre_processing(batch):
    batch["text"] = batch["text"].lower()
    return batch

def extract_all_chars(batch):
  all_text = " ".join(batch["text"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

librispeech100h_train = load_dataset("openslr/librispeech_asr", split="train.clean.100")
librispeech100h_test = load_dataset("openslr/librispeech_asr", split="validation.clean")
librispeech100h_train = librispeech100h_train.remove_columns(
                                    ['file', 'speaker_id', 'chapter_id', 'id'])
librispeech100h_test = librispeech100h_test.remove_columns(
                                    ['file', 'speaker_id', 'chapter_id', 'id'])

librispeech100h_train = librispeech100h_train.map(pre_processing)
librispeech100h_test = librispeech100h_test.map(pre_processing)

vocab_train = librispeech100h_train.map(extract_all_chars, batched=True,
                                    batch_size=-1, keep_in_memory=True,
                                    remove_columns=librispeech100h_train.column_names)
vocab_test = librispeech100h_test.map(extract_all_chars, batched=True,
                                    batch_size=-1, keep_in_memory=True,
                                    remove_columns=librispeech100h_test.column_names)
vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}

vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)

with open('ls_vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

# load model, feature extractor and tokenizer
feature_extractor = AutoFeatureExtractor.from_pretrained(
    repo_id,
    trust_remote_code = True,
)

tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json",
                            unk_token="[UNK]", pad_token="[PAD]",
                            word_delimiter_token="|")

model = AutoModelForCTC.from_pretrained(
    repo_id,
    trust_remote_code=True,
    vocab_size=len(vocab_dict),
    feat_proj_dropout=0.1,
    activation_dropout=0.1,
    hidden_dropout=0.1,
    conformer_conv_dropout=0.1,
    ctc_loss_reduction="mean",
    pad_token_id=tokenizer.pad_token_id,
    attention_dropout=0.1,
)
model = model.to(device)
```
Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for further ASR fine-tuning recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom fine-tuning loops.

## Technical Specifications 

### Training Data

MERaLiON-SpeechEncoder has been trained on a diverse set of unsupervised speech data, primarily in English. Our collection is curated from various publicly available datasets and covers a wide range of conditions, encompassing factors such as domain, style, speaker, gender, and accent. The combined dataset comprises around 170,000 hours of English, including 10,000 hours of Singapore-based English that incorporates code-switching; plus 30,000 additional hours of multilingual speech from 113 languages, totalling 200,000 hours. Consult our technical report for the full breakdown.

### Training Procedure and Compute

MERaLiON-SpeechEncoder was trained in two phases, initially on a 60,000 hours subset of data, before continued pre-trainining on the full 200,000 hours dataset using this prior checkpoint as initialisation. The initial model was trained on the **ASPIRE 2A** Supercomputer Cluster provided by the **National Supercomputing Centre (NSCC)** for 325K steps on 12 Nvidia A100 40GB GPUs. The full pre-training run was carried out on the **LUMI** Supercomputer Cluster with 128 AMD MI250x GPUs for a further 382K steps taking about 25 days of active GPU time.

## Citation

If you find our work useful, please cite our technical report:

```
@misc{huzaifah2024speechfoundationmodelsingapore,
      title={MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond}, 
      author={{MERaLiON Team}},
      year={2024},
      eprint={2412.11538},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.11538}, 
}
```