Update README.md
Browse files
README.md
CHANGED
@@ -12,4 +12,107 @@ library_name: transformers
|
|
12 |
|
13 |
|
14 |
CNMBert-MoE
|
15 |
-
[Github](https://github.com/IgarashiAkatuki/zh-CN-Multi-Mask-Bert)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
|
14 |
CNMBert-MoE
|
15 |
+
[Github](https://github.com/IgarashiAkatuki/zh-CN-Multi-Mask-Bert)
|
16 |
+
|
17 |
+
# zh-CN-Multi-Mask-Bert (CNMBert)
|
18 |
+
|
19 |
+
![image](https://github.com/user-attachments/assets/a888fde7-6766-43f1-a753-810399418bda)
|
20 |
+
|
21 |
+
---
|
22 |
+
|
23 |
+
一个用来翻译拼音缩写的模型
|
24 |
+
|
25 |
+
此模型基于[Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)训练而来,通过修改其预训练任务来使其适配拼音缩写翻译任务,相较于微调过的GPT模型以及GPT-4o达到了sota
|
26 |
+
|
27 |
+
---
|
28 |
+
|
29 |
+
## 什么是拼音缩写
|
30 |
+
|
31 |
+
形如:
|
32 |
+
|
33 |
+
> "bhys" -> "不好意思"
|
34 |
+
>
|
35 |
+
> "ys" -> "原神"
|
36 |
+
|
37 |
+
这样的,使用拼音首字母来代替汉字的缩写,我们姑且称之为拼音缩写。
|
38 |
+
|
39 |
+
如果对拼音缩写感兴趣可以看看这个↓
|
40 |
+
|
41 |
+
[大家为什么会讨厌缩写? - 远方青木的回答 - 知乎](https://www.zhihu.com/question/269016377/answer/2654824753)
|
42 |
+
|
43 |
+
### CNMBert
|
44 |
+
|
45 |
+
| Model | 模型权重 | Memory Usage (FP16) | QPS | MRR | Acc |
|
46 |
+
| --------------- | ----------------------------------------------------------- | ------------------- | ----- | ----- | ----- |
|
47 |
+
| CNMBert-Default | [Huggingface](https://huggingface.co/Midsummra/CNMBert) | 0.4GB | 12.56 | 58.88 | 49.13 |
|
48 |
+
| CNMBert-MoE | [Huggingface](https://huggingface.co/Midsummra/CNMBert-MoE) | 0.8GB | 3.20 | 60.56 | 51.09 |
|
49 |
+
|
50 |
+
* 所有模型均在相同的150万条wiki以及知乎语料下训练
|
51 |
+
* QPS 为 queries per second (由于没有使用c重写predict所以现在性能很糟...)
|
52 |
+
* MRR 为平均倒数排名(mean reciprocal rank)
|
53 |
+
* Acc 为准确率(accuracy)
|
54 |
+
|
55 |
+
### Usage
|
56 |
+
|
57 |
+
```python
|
58 |
+
from transformers import AutoTokenizer, BertConfig
|
59 |
+
|
60 |
+
from CustomBertModel import fixed_predict
|
61 |
+
from MoELayer import BertWwmMoE
|
62 |
+
```
|
63 |
+
|
64 |
+
加载模型
|
65 |
+
|
66 |
+
```python
|
67 |
+
# use CNMBert with MoE
|
68 |
+
# To use CNMBert without MoE, replace all "Midsummra/CNMBert-MoE" with "Midsummra/CNMBert" and use BertForMaskedLM instead of using BertWwmMoE
|
69 |
+
tokenizer = AutoTokenizer.from_pretrained("Midsummra/CNMBert-MoE")
|
70 |
+
config = BertConfig.from_pretrained('Midsummra/CNMBert-MoE')
|
71 |
+
model = BertWwmMoE.from_pretrained('Midsummra/CNMBert-MoE', config=config).to('cuda')
|
72 |
+
|
73 |
+
# model = BertForMaskedLM.from_pretrained('Midsummra/CNMBert').to('cuda')
|
74 |
+
```
|
75 |
+
|
76 |
+
预测词语
|
77 |
+
|
78 |
+
```python
|
79 |
+
print(fixed_predict("我有两千kq", "kq", model, tokenizer)[:5])
|
80 |
+
print(fixed_predict("快去给魔理沙看b吧", "b", model, tokenizer[:5]))
|
81 |
+
```
|
82 |
+
|
83 |
+
> ['块钱', 1.2056937473156175], ['块前', 0.05837443749364857], ['开千', 0.0483869208528063], ['可千', 0.03996622172280445], ['口气', 0.037183335575008414]
|
84 |
+
|
85 |
+
> ['病', 1.6893256306648254], ['吧', 0.1642467901110649], ['呗', 0.026976384222507477], ['包', 0.021441461518406868], ['报', 0.01396679226309061]
|
86 |
+
|
87 |
+
### 如何微调模型
|
88 |
+
|
89 |
+
请参考TrainExample.ipynb,在数据集的格式上,只要保证csv的第一列为要训练的语料即可。
|
90 |
+
|
91 |
+
### Q&A
|
92 |
+
|
93 |
+
Q: 这玩意的速度太慢啦!!!
|
94 |
+
|
95 |
+
A: 已经有计划拿C重写predict了,,,
|
96 |
+
|
97 |
+
|
98 |
+
|
99 |
+
Q: 这玩意的准确度好差啊
|
100 |
+
|
101 |
+
A: 因为是在很小的数据集(200w)上进行的预训练,所以泛化能力很差很正常,,,可以在更大数据集或者更加细分的领域进行微调,具体微调方式和[Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)差别不大,只需要将`DataCollactor`替换为`CustomBertModel.py`中的`DataCollatorForMultiMask`。
|
102 |
+
|
103 |
+
### 引用
|
104 |
+
|
105 |
+
如果您对CNMBert的具体实现感兴趣的话,可以参考
|
106 |
+
|
107 |
+
```
|
108 |
+
@misc{feng2024cnmbertmodelhanyupinyin,
|
109 |
+
title={CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task},
|
110 |
+
author={Zishuo Feng and Feng Cao},
|
111 |
+
year={2024},
|
112 |
+
eprint={2411.11770},
|
113 |
+
archivePrefix={arXiv},
|
114 |
+
primaryClass={cs.CL},
|
115 |
+
url={https://arxiv.org/abs/2411.11770},
|
116 |
+
}
|
117 |
+
```
|
118 |
+
|