mzboito commited on
Commit
962e673
1 Parent(s): 0c11514

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -10
README.md CHANGED
@@ -139,7 +139,7 @@ language:
139
 
140
  mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
141
  Different from *traditional* HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units.
142
- Training employs a two-level language, data source up-sampling during training. See more information in our paper.
143
 
144
  **This repository contains:**
145
  * Fairseq checkpoint (original);
@@ -147,24 +147,22 @@ Training employs a two-level language, data source up-sampling during training.
147
  * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
148
 
149
  **Related Models:**
150
- * Second Iteration repository: https://huggingface.co/utter-project/mHuBERT-147-base-2nd-iter
151
- * First Iteration repository: https://huggingface.co/utter-project/mHuBERT-147-base-1st-iter
152
- * CommonVoice Prototype (12 languages): https://huggingface.co/utter-project/hutter-12-3rd-base
153
 
154
  # Training
155
 
156
- **Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
157
 
158
- Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
159
 
160
- **Fairseq fork:** https://github.com/utter-project/fairseq
161
-
162
- **Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts
163
 
164
  # ML-SUPERB Scores
165
 
166
  mHubert-147 reaches second and first position in the 10min and 1h leaderboards respectively. We achieve new SOTA scores for three LID tasks.
167
- See more information in our paper.
168
 
169
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62262e19d36494a6f743a28d/chXjExnWc3rhhtdsyiU-W.png)
170
 
 
139
 
140
  mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
141
  Different from *traditional* HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units.
142
+ Training employs a two-level language, data source up-sampling during training. See more information in [our paper](https://arxiv.org/pdf/2406.06371).
143
 
144
  **This repository contains:**
145
  * Fairseq checkpoint (original);
 
147
  * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
148
 
149
  **Related Models:**
150
+ * [Second Iteration repository](https://huggingface.co/utter-project/mHuBERT-147-base-2nd-iter)
151
+ * [First Iteration repository](https://huggingface.co/utter-project/mHuBERT-147-base-1st-iter)
152
+ * [CommonVoice Prototype (12 languages)](https://huggingface.co/utter-project/hutter-12-3rd-base)
153
 
154
  # Training
155
 
156
+ * **[Manifest list available here.](https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest)** Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
157
 
158
+ * **[Fairseq fork](https://github.com/utter-project/fairseq)** contains the scripts for training with multilingual batching with two-level up-sampling.
159
 
160
+ * **[Scripts for pre-processing/faiss clustering available here.](https://github.com/utter-project/mHuBERT-147-scripts)**
 
 
161
 
162
  # ML-SUPERB Scores
163
 
164
  mHubert-147 reaches second and first position in the 10min and 1h leaderboards respectively. We achieve new SOTA scores for three LID tasks.
165
+ See more information in [our paper](https://arxiv.org/pdf/2406.06371).
166
 
167
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62262e19d36494a6f743a28d/chXjExnWc3rhhtdsyiU-W.png)
168