metadata

language:
  - en

Model Card for MiniLMv2

Small and fast pre-trained models for language understanding and generation

Model Details

Model Description

MiniLM v2: the pre-trained models for the paper entitled "MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers". We generalize deep self-attention distillation in MiniLMv1 by using self-attention relation distillation for task-agnostic compression of pre-trained Transformers. The proposed method eliminates the restriction on the number of student’s attention heads. Our monolingual and multilingual small models distilled from different base and large size teacher models achieve competitive performance.

Developed by: More information needed
Shared by [Optional]: More information needed
Model type: Language model
Language(s) (NLP): en
License: Microsoft Open Source Code of Conduct
Related Models: {{ related_models | join(', ') | default("More information needed", true)}}
- Parent Model: xlm-roberta
Resources for more information:
- GitHub Repo
- Associated Paper

Uses

Direct Use

More information is needed.

Downstream Use [Optional]

More information is needed

Out-of-Scope Use

The model should not be used to intentionally create hostile or alienating environments for people.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information is needed for further recomendations.

Training Details

Training Data

Following Lewis et al. (2019b), we adopt SQuAD 1.1 as training data and use MLQA English development data for early stopping.

Training Procedure

Preprocessing

More information needed

Speeds, Sizes, Times

We compress XLMR-Large into 12-layer and 6-layer models with 384 hidden size and report the zero-shot performance on XNLI and MLQA test set.

[English] Pre-trained Models

Model	Teacher Model	Speedup	#Param	MNLI-m (Acc)	SQuAD 2.0 (F1)
L6xH768 MiniLMv2	RoBERTa-Large	2.0x	81M	87.0	81.6
L12xH384 MiniLMv2	RoBERTa-Large	2.7x	41M	86.9	82.3
L6xH384 MiniLMv2	RoBERTa-Large	5.3x	30M	84.4	76.4
L6xH768 MiniLMv2	BERT-Large Uncased	2.0x	66M	85.0	77.7
L6xH384 MiniLMv2	BERT-Large Uncased	5.3x	22M	83.0	74.3
L6xH768 MiniLMv2	BERT-Base Uncased	2.0x	66M	84.2	76.3
L6xH384 MiniLMv2	BERT-Base Uncased	5.3x	22M	82.8	72.9

Evaluation

Testing Data, Factors & Metrics

Testing Data

Fine-tuning on NLU tasks

MiniLM has the same Transformer architecture as BERT. For NLU tasks, our models in Pytorch version can be loaded using the BERT code in huggingface/transformers. The config file is needed to be replaced with MiniLM's.

We present the dev results on SQuAD 2.0 and several GLUE benchmark tasks.

Model	#Param	SQuAD 2.0	MNLI-m	SST-2	QNLI	CoLA	RTE	MRPC	QQP
BERT-Base	109M	76.8	84.5	93.2	91.7	58.9	68.6	87.3	91.3
MiniLM-L12xH384	33M	81.7	85.7	93.0	91.5	58.5	73.3	89.5	91.3
MiniLM-L6xH384	22M	75.6	83.3	91.5	90.5	47.5	68.8	88.9	90.6

Factors

More information needed

Metrics

We evaluate the multilingual MiniLM on cross-lingual natural language inference benchmark (XNLI) and cross-lingual question answering benchmark (MLQA).

Cross-Lingual Natural Language Inference - XNLI

We evaluate our model on cross-lingual transfer from English to other languages. Following Conneau et al. (2019), we select the best single model on the joint dev set of all the languages.

Model	#Layers	#Hidden	#Transformer Parameters	Average	en	fr	es	de	el	bg	ru	tr	ar	vi	th	zh	hi	sw	ur
mBERT	12	768	85M	66.3	82.1	73.8	74.3	71.1	66.4	68.9	69.0	61.6	64.9	69.5	55.8	69.3	60.0	50.4	58.0
XLM-100	16	1280	315M	70.7	83.2	76.7	77.7	74.0	72.7	74.1	72.7	68.7	68.6	72.9	68.9	72.5	65.6	58.2	62.4
XLM-R Base	12	768	85M	74.5	84.6	78.4	78.9	76.8	75.9	77.3	75.4	73.2	71.5	75.4	72.5	74.9	71.1	65.2	66.5
mMiniLM-L12xH384	12	384	21M	71.1	81.5	74.8	75.7	72.9	73.0	74.5	71.3	69.7	68.8	72.1	67.8	70.0	66.2	63.3	64.2

Results

We present the results following the same data split as in (Du et al., 2017).

Model	#Param	BLEU-4	METEOR	ROUGE-L
MiniLM-L12xH384	33M	21.07	24.09.	49.14
MiniLM-L6xH384	22M	20.31	23.43	48.21

We also report the results following the data split as in (Zhao et al., 2018), which uses the reversed dev-test setup.

Model	#Param	BLEU-4	METEOR	ROUGE-L
MiniLM-L12xH384	33M	23.27	25.15	50.60
MiniLM-L6xH384	22M	22.01	24.24	49.51

Model Examination

More information needed

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: More information needed
Hours used: More information needed
**Cloud Provider:**More information needed
Compute Region: More information needed
Carbon Emitted: More information needed

Technical Specifications [optional]

Model Architecture and Objective

More information needed

Compute Infrastructure

More information needed

Hardware

More information needed

Software

More information needed

Citation

BibTeX:

If you find MiniLM useful in your research, please cite the following paper:

@misc{wang2020minilm,
    title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
    author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
    year={2020},
    eprint={2002.10957},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Glossary [optional]

More information needed

More Information [optional]

More information needed

Model Card Authors [optional]

Wenhui Wang , Furu Wei

Model Card Contact

For other communications related to MiniLM, please contact Wenhui Wang ([email protected]), Furu Wei ([email protected]).

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand

>>> from transformers import AutoTokenizer, AutoModel
  
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
>>> model = AutoModel.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
 
>>> inputs = tokenizer("Hello world!", return_tensors="pt")
>>> outputs = model(**inputs)