language:
- en
Model Card for MiniLMv2
Small and fast pre-trained models for language understanding and generation
Model Details
Model Description
MiniLM v2: the pre-trained models for the paper entitled "MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers". We generalize deep self-attention distillation in MiniLMv1 by using self-attention relation distillation for task-agnostic compression of pre-trained Transformers. The proposed method eliminates the restriction on the number of student’s attention heads. Our monolingual and multilingual small models distilled from different base and large size teacher models achieve competitive performance.
- Developed by: More information needed
- Shared by [Optional]: More information needed
- Model type: Language model
- Language(s) (NLP): en
- License: Microsoft Open Source Code of Conduct
- Related Models: {{ related_models | join(', ') | default("More information needed", true)}}
- Parent Model: xlm-roberta
- Resources for more information:
Uses
Direct Use
More information is needed.
Downstream Use [Optional]
More information is needed
Out-of-Scope Use
The model should not be used to intentionally create hostile or alienating environments for people.
Bias, Risks, and Limitations
Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information is needed for further recomendations.
Training Details
Training Data
Following Lewis et al. (2019b), we adopt SQuAD 1.1 as training data and use MLQA English development data for early stopping.
Training Procedure
Preprocessing
More information needed
Speeds, Sizes, Times
We compress XLMR-Large into 12-layer and 6-layer models with 384 hidden size and report the zero-shot performance on XNLI and MLQA test set.
[English] Pre-trained Models
Model | Teacher Model | Speedup | #Param | MNLI-m (Acc) | SQuAD 2.0 (F1) |
---|---|---|---|---|---|
L6xH768 MiniLMv2 | RoBERTa-Large | 2.0x | 81M | 87.0 | 81.6 |
L12xH384 MiniLMv2 | RoBERTa-Large | 2.7x | 41M | 86.9 | 82.3 |
L6xH384 MiniLMv2 | RoBERTa-Large | 5.3x | 30M | 84.4 | 76.4 |
L6xH768 MiniLMv2 | BERT-Large Uncased | 2.0x | 66M | 85.0 | 77.7 |
L6xH384 MiniLMv2 | BERT-Large Uncased | 5.3x | 22M | 83.0 | 74.3 |
L6xH768 MiniLMv2 | BERT-Base Uncased | 2.0x | 66M | 84.2 | 76.3 |
L6xH384 MiniLMv2 | BERT-Base Uncased | 5.3x | 22M | 82.8 | 72.9 |
Evaluation
Testing Data, Factors & Metrics
Testing Data
Fine-tuning on NLU tasks
MiniLM has the same Transformer architecture as BERT. For NLU tasks, our models in Pytorch version can be loaded using the BERT code in huggingface/transformers. The config file is needed to be replaced with MiniLM's.
We present the dev results on SQuAD 2.0 and several GLUE benchmark tasks.
Model | #Param | SQuAD 2.0 | MNLI-m | SST-2 | QNLI | CoLA | RTE | MRPC | QQP |
---|---|---|---|---|---|---|---|---|---|
BERT-Base | 109M | 76.8 | 84.5 | 93.2 | 91.7 | 58.9 | 68.6 | 87.3 | 91.3 |
MiniLM-L12xH384 | 33M | 81.7 | 85.7 | 93.0 | 91.5 | 58.5 | 73.3 | 89.5 | 91.3 |
MiniLM-L6xH384 | 22M | 75.6 | 83.3 | 91.5 | 90.5 | 47.5 | 68.8 | 88.9 | 90.6 |
Factors
More information needed
Metrics
We evaluate the multilingual MiniLM on cross-lingual natural language inference benchmark (XNLI) and cross-lingual question answering benchmark (MLQA).
Cross-Lingual Natural Language Inference - XNLI
We evaluate our model on cross-lingual transfer from English to other languages. Following Conneau et al. (2019), we select the best single model on the joint dev set of all the languages.
Model | #Layers | #Hidden | #Transformer Parameters | Average | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mBERT | 12 | 768 | 85M | 66.3 | 82.1 | 73.8 | 74.3 | 71.1 | 66.4 | 68.9 | 69.0 | 61.6 | 64.9 | 69.5 | 55.8 | 69.3 | 60.0 | 50.4 | 58.0 |
XLM-100 | 16 | 1280 | 315M | 70.7 | 83.2 | 76.7 | 77.7 | 74.0 | 72.7 | 74.1 | 72.7 | 68.7 | 68.6 | 72.9 | 68.9 | 72.5 | 65.6 | 58.2 | 62.4 |
XLM-R Base | 12 | 768 | 85M | 74.5 | 84.6 | 78.4 | 78.9 | 76.8 | 75.9 | 77.3 | 75.4 | 73.2 | 71.5 | 75.4 | 72.5 | 74.9 | 71.1 | 65.2 | 66.5 |
mMiniLM-L12xH384 | 12 | 384 | 21M | 71.1 | 81.5 | 74.8 | 75.7 | 72.9 | 73.0 | 74.5 | 71.3 | 69.7 | 68.8 | 72.1 | 67.8 | 70.0 | 66.2 | 63.3 | 64.2 |
Results
We present the results following the same data split as in (Du et al., 2017).
Model | #Param | BLEU-4 | METEOR | ROUGE-L |
---|---|---|---|---|
MiniLM-L12xH384 | 33M | 21.07 | 24.09. | 49.14 |
MiniLM-L6xH384 | 22M | 20.31 | 23.43 | 48.21 |
We also report the results following the data split as in (Zhao et al., 2018), which uses the reversed dev-test setup.
Model | #Param | BLEU-4 | METEOR | ROUGE-L |
---|---|---|---|---|
MiniLM-L12xH384 | 33M | 23.27 | 25.15 | 50.60 |
MiniLM-L6xH384 | 22M | 22.01 | 24.24 | 49.51 |
Model Examination
More information needed
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: More information needed
- Hours used: More information needed
- **Cloud Provider:**More information needed
- Compute Region: More information needed
- Carbon Emitted: More information needed
Technical Specifications [optional]
Model Architecture and Objective
More information needed
Compute Infrastructure
More information needed
Hardware
More information needed
Software
More information needed
Citation
BibTeX:
If you find MiniLM useful in your research, please cite the following paper:
@misc{wang2020minilm,
title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
year={2020},
eprint={2002.10957},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Glossary [optional]
More information needed
More Information [optional]
More information needed
Model Card Authors [optional]
Wenhui Wang , Furu Wei
Model Card Contact
For other communications related to MiniLM, please contact Wenhui Wang ([email protected]
), Furu Wei ([email protected]
).
How to Get Started with the Model
Use the code below to get started with the model.
Click to expand
>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
>>> model = AutoModel.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
>>> inputs = tokenizer("Hello world!", return_tensors="pt")
>>> outputs = model(**inputs)