AutoDisProxyT-RTE / README.md
Jinawei's picture
Create README.md
ae482c6
metadata
language: en
thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
tags:
  - text-classification
license: mit

AutoDisProxyT-RTE for Distilling Massive Neural Networks

AutoDisProxyT is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models.

This AutoDisProxyT checkpoint with 7 layers, 160 hidden size, 10 attention heads corresponds to 6.88 million parameters and 0.27G FLOPs.

The following table shows the results on GLUE dev set.

Models #Params (M) #FLOPs (G) MNLI QNLI QQP RTE SST-2 MRPC CoLA Avg
BERT 109 11.2 84.5 91.7 91.3 68.6 93.2 87.3 53.5 82.2
BERTSMALL 66 5.66 81.8 89.8 90.6 67.9 91.2 84.9 53.5 80.0
TruncatedBERT 66 5.66 81.2 87.9 90.4 65.5 90.8 82.7 41.4 77.1
DistilBERT 66 5.66 82.2 89.2 88.5 59.9 91.3 87.5 51.3 78.6
TinyBERT 66 5.66 83.5 90.5 90.6 72.2 91.6 88.4 42.8 79.9
MiniLM 66 5.66 84.0 91.0 91.0 71.5 92.0 88.4 49.2 81.0
AutoTinyBERT-KD-S1 30.0 1.69 82.3 89.7 89.9 71.1 91.4 88.5 47.3 80.0
DynaBERT 37.7 1.81 82.3 88.5 90.4 63.2 92.0 81.4 76.4 43.7
NAS-BERT10 10.0 2.30 76.4 86.3 88.5 66.6 88.6 79.1 34.0 74.2
AutoTinyBERT-KD-S4 66 5.66 76.0 85.5 86.9 64.9 86.8 81.4 20.4 71.7
NAS-BERT5 66 5.66 74.4 84.9 85.8 66.6 87.3 79.6 19.8 71.2
AutoDisProxyT 6.88 0.27 79.0 86.4 89.1 64.3 85.9 78.5 24.8 72.6

Tested with torch 1.6.0

If you use this checkpoint in your work, please cite:

@article{xu2022autodistil,
  title={AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models},
  author={Xu, Dongkuan and Mukherjee, Subhabrata and Liu, Xiaodong and Dey, Debadeepta and Wang, Wenhui and Zhang, Xiang and Awadallah, Ahmed Hassan and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2201.12507},
  year={2022}
}