ncsu-dk-lab
/

AutoDisProxyT-RTE

Text Classification

Inference Endpoints

Model card Files Files and versions Community

Jinawei commited on Apr 7, 2023

Commit

345ee39

·

1 Parent(s): 28939a1

Create Readme.md

Files changed (1) hide show

Readme.md +44 -0

Readme.md ADDED Viewed

	@@ -0,0 +1,44 @@

+---
+language: en
+thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
+tags:
+- text-classification
+license: mit
+---
+# AutoDisProxyT-RTE for Distilling Massive Neural Networks
+AutoDisProxyT is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [Few-shot Task-agnostic Neural Architecture Search for
+Distilling Large Language Models](https://proceedings.neurips.cc/paper_files/paper/2022/file/b7c12689a89e98a61bcaa65285a41b7c-Paper-Conference.pdf).
+This AutoDisProxyT checkpoint with **7** layers, **160** hidden size, **10** attention heads corresponds to **6.88 million** parameters and **0.27G** FLOPs.
+The following table shows the results on GLUE dev set.
+| Models         | #Params (M) | #FLOPs (G) | MNLI | QNLI | QQP  | RTE  | SST-2  | MRPC | CoLA | Avg   |
+|----------------|--------|---------|------|------|------|------|------|------|--------|-------|
+| BERT        | 109    | 11.2       | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 53.5   | 82.2 |
+| BERT<sub>SMALL</sub>  | 66    | 5.66       | 81.8 | 89.8 | 90.6 | 67.9 | 91.2 | 84.9 | 53.5   | 80.0 |
+| TruncatedBERT  | 66    | 5.66       | 81.2 | 87.9 | 90.4 | 65.5 | 90.8 | 82.7 | 41.4   | 77.1 |
+| DistilBERT  | 66     | 5.66       | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 51.3   | 78.6 |
+| TinyBERT    | 66     | 5.66       | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 42.8   | 79.9 |
+| MiniLM      | 66     | 5.66       | 84.0   | 91.0   | 91.0   | 71.5 | 92.0   | 88.4 | 49.2  | 81.0  |
+| AutoTinyBERT-KD-S1 | 30.0 |  1.69  | 82.3 | 89.7 | 89.9 | 71.1 | 91.4 | 88.5  | 47.3 | 80.0  |
+| DynaBERT | 37.7 | 1.81 | 82.3 | 88.5  | 90.4   | 63.2 | 92.0   | 81.4 | 76.4   | 43.7  |
+| NAS-BERT<sub>10</sub>| 10.0 | 2.30 | 76.4 | 86.3 | 88.5   | 66.6 | 88.6 | 79.1 | 34.0 | 74.2  |
+| AutoTinyBERT-KD-S4 | 66 | 5.66 | 76.0 | 85.5  | 86.9 | 64.9 | 86.8 | 81.4 | 20.4 | 71.7  |
+| NAS-BERT<sub>5</sub> | 66 | 5.66 | 74.4 | 84.9  | 85.8  | 66.6 | 87.3 | 79.6 | 19.8   | 71.2  |
+| **AutoDisProxyT**   | 6.88  | 0.27  | 79.0 | 86.4 | 89.1  | 64.3 | 85.9 |  78.5  |  24.8  | 72.6 |
+Tested with `torch 1.6.0`
+If you use this checkpoint in your work, please cite:
+``` latex
+@article{xu2022autodistil,
+  title={AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models},
+  author={Xu, Dongkuan and Mukherjee, Subhabrata and Liu, Xiaodong and Dey, Debadeepta and Wang, Wenhui and Zhang, Xiang and Awadallah, Ahmed Hassan and Gao, Jianfeng},
+  journal={arXiv preprint arXiv:2201.12507},
+  year={2022}
+}
+```