huzy0 commited on
Commit
4802c8b
·
verified ·
1 Parent(s): 400c007

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -13
README.md CHANGED
@@ -11,7 +11,7 @@ language:
11
 
12
  # MERaLiON-SpeechEncoder-v1
13
 
14
- The MERaLiON-SpeechEncoder is a speech foundation model that is designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version is trained on 200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech, to address the speech processing needs in Singapore and beyond.
15
 
16
  - **Developed by:** I<sup>2</sup>R, A\*STAR
17
  - **Funded by:** Singapore NRF
@@ -27,7 +27,6 @@ The model takes in speech as input in the form of mel-spectrograms and returns c
27
 
28
  For more details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report]().
29
 
30
-
31
  ## Uses
32
 
33
  We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the SUPERB benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER).
@@ -44,7 +43,7 @@ import torch
44
  from datasets import load_dataset
45
  from transformers import AutoModel, AutoFeatureExtractor
46
 
47
- repo_id = "MERaLiON/MERaLiON-SpeechEncoder-V1"
48
  device = 'cuda' if torch.cuda.is_available() else 'cpu'
49
 
50
  # load model and feature extractor
@@ -52,19 +51,20 @@ model = AutoModel.from_pretrained(
52
  repo_id,
53
  trust_remote_code=True,
54
  )
 
55
 
56
  feature_extractor = AutoFeatureExtractor.from_pretrained(
57
  repo_id,
58
- trust_remote_code = True
59
  )
60
 
61
  # prepare data
62
- data = load_dataset("openslr/librispeech_asr", split="validation.clean")
63
 
64
  def batch_collater(data):
65
  tensors = []
66
  for idx, sample in enumerate(data):
67
- tensors.append(sample["audio"]["array"])
68
  return tensors
69
 
70
  audio_array = batch_collater(data)
@@ -80,24 +80,81 @@ with torch.no_grad():
80
  output = model(input_values=input_values, input_lengths=input_lengths, output_hidden_states=True)
81
  ```
82
 
83
- ### Downstream Use [optional]
84
 
85
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
 
 
86
 
87
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  ## Technical Specifications
91
 
92
  ### Training Data
93
 
94
- MERaLiON-SpeechEncoder has been trained on a diverse set of unsupervised speech data, predominantly in English. Our collection is derived and processed from various publicly available datasets. The combined dataset covers a wide range of conditions, encompassing factors such as domain, style, speaker, gender, and accent. Duration-wise it comprises around 170,000 hours of English, including 10,000 hours of Singapore-based English that incorporates code-switching, plus 30,000 additional hours of multilingual speech from 113 languages, totalling 200,000 hours.
95
-
96
- [More Information Needed]
97
 
98
  ### Training Procedure and Compute
99
 
100
- MERaLiON-SpeechEncoder was trained in two phases, initially on a 60,00 hours subset of data, before continued pre-trainining on the full 200,000 hours dataset using this prior checkpoint as initialisation. The initial model was trained on **ASPIRE 2A** Supercomputer Cluster, provided by the **National Supercomputing Centre (NSCC)** for 325K steps on 12 Nvidia A100 40GB GPUs. The full pre-training run was carried out on the **LUMI** Supercomputer Cluster with 128 AMD MI250x GPUs for a further 382K steps taking about 25 days of active GPU time.
101
 
102
 
103
  ## Citation [optional]
 
11
 
12
  # MERaLiON-SpeechEncoder-v1
13
 
14
+ The MERaLiON-SpeechEncoder is a speech foundation model designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version was trained on 200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech, to cater to the speech processing needs in Singapore and beyond. Gradual support for other languages, starting with major South-East Asian ones are planned for subsequent releases.
15
 
16
  - **Developed by:** I<sup>2</sup>R, A\*STAR
17
  - **Funded by:** Singapore NRF
 
27
 
28
  For more details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report]().
29
 
 
30
  ## Uses
31
 
32
  We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the SUPERB benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER).
 
43
  from datasets import load_dataset
44
  from transformers import AutoModel, AutoFeatureExtractor
45
 
46
+ repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1'
47
  device = 'cuda' if torch.cuda.is_available() else 'cpu'
48
 
49
  # load model and feature extractor
 
51
  repo_id,
52
  trust_remote_code=True,
53
  )
54
+ model = model.to(device)
55
 
56
  feature_extractor = AutoFeatureExtractor.from_pretrained(
57
  repo_id,
58
+ trust_remote_code=True
59
  )
60
 
61
  # prepare data
62
+ data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
63
 
64
  def batch_collater(data):
65
  tensors = []
66
  for idx, sample in enumerate(data):
67
+ tensors.append(sample['audio']['array'])
68
  return tensors
69
 
70
  audio_array = batch_collater(data)
 
80
  output = model(input_values=input_values, input_lengths=input_lengths, output_hidden_states=True)
81
  ```
82
 
83
+ ### Downstream Use
84
 
85
+ ```python
86
+ import torch
87
+ import json
88
+ from datasets import load_dataset
89
+ from transformers import AutoModelForCTC, AutoFeatureExtractor, Wav2Vec2CTCTokenizer
90
 
91
+ repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1'
92
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
93
+
94
+ # prepare data
95
+ def pre_processing(batch):
96
+ batch["text"] = batch["text"].lower()
97
+ return batch
98
+
99
+ def extract_all_chars(batch):
100
+ all_text = " ".join(batch["text"])
101
+ vocab = list(set(all_text))
102
+ return {"vocab": [vocab], "all_text": [all_text]}
103
+
104
+ librispeech100h_train = load_dataset("openslr/librispeech_asr", split="train.clean.100")
105
+ librispeech100h_test = load_dataset("openslr/librispeech_asr", split="validation.clean")
106
+ librispeech100h_train = librispeech100h_train.remove_columns(['file', 'speaker_id', 'chapter_id', 'id'])
107
+ librispeech100h_test = librispeech100h_test.remove_columns(['file', 'speaker_id', 'chapter_id', 'id'])
108
+
109
+ librispeech100h_train = librispeech100h_train.map(pre_processing)
110
+ librispeech100h_test = librispeech100h_test.map(pre_processing)
111
+
112
+ vocab_train = librispeech100h_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=librispeech100h_train.column_names)
113
+ vocab_test = librispeech100h_test.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=librispeech100h_test.column_names)
114
+ vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
115
+ vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
116
+
117
+ vocab_dict["|"] = vocab_dict[" "]
118
+ del vocab_dict[" "]
119
+ vocab_dict["[UNK]"] = len(vocab_dict)
120
+ vocab_dict["[PAD]"] = len(vocab_dict)
121
 
122
+ with open('ls_vocab.json', 'w') as vocab_file:
123
+ json.dump(vocab_dict, vocab_file)
124
+
125
+ # load model, feature extractor and tokenizer
126
+ feature_extractor = AutoFeatureExtractor.from_pretrained(
127
+ repo_id,
128
+ trust_remote_code = True,
129
+ )
130
+
131
+ tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
132
+
133
+ model = AutoModelForCTC.from_pretrained(
134
+ repo_id,
135
+ trust_remote_code=True,
136
+ vocab_size=len(vocab_dict),
137
+ feat_proj_dropout=0.1,
138
+ activation_dropout=0.1,
139
+ hidden_dropout=0.1,
140
+ conformer_conv_dropout=0.1,
141
+ ctc_loss_reduction="mean",
142
+ pad_token_id=tokenizer.pad_token_id,
143
+ attention_dropout=0.1,
144
+ )
145
+ model = model.to(device)
146
+ ```
147
+ Consult this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for further training recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom fine-tuning loops.
148
 
149
  ## Technical Specifications
150
 
151
  ### Training Data
152
 
153
+ MERaLiON-SpeechEncoder has been trained on a diverse set of unsupervised speech data, primarily in English. Our collection is derived and processed from various publicly available datasets. The combined dataset covers a wide range of conditions, encompassing factors such as domain, style, speaker, gender, and accent. Duration-wise it comprises around 170,000 hours of English, including 10,000 hours of Singapore-based English that incorporates code-switching, plus 30,000 additional hours of multilingual speech from 113 languages, totalling 200,000 hours. Consult our technical report for the full breakdown.
 
 
154
 
155
  ### Training Procedure and Compute
156
 
157
+ MERaLiON-SpeechEncoder was trained in two phases, initially on a 60,00 hours subset of data, before continued pre-trainining on the full 200,000 hours dataset using this prior checkpoint as initialisation. The initial model was trained on the **ASPIRE 2A** Supercomputer Cluster provided by the **National Supercomputing Centre (NSCC)** for 325K steps on 12 Nvidia A100 40GB GPUs. The full pre-training run was carried out on the **LUMI** Supercomputer Cluster with 128 AMD MI250x GPUs for a further 382K steps taking about 25 days of active GPU time.
158
 
159
 
160
  ## Citation [optional]