Update README.md
Browse files
README.md
CHANGED
@@ -12,11 +12,7 @@ library_name: transformers
|
|
12 |
---
|
13 |
|
14 |
|
15 |
-
CNMBert-MoE
|
16 |
-
[Github](https://github.com/IgarashiAkatuki/zh-CN-Multi-Mask-Bert)
|
17 |
-
|
18 |
# zh-CN-Multi-Mask-Bert (CNMBert)
|
19 |
-
|
20 |
![image](https://github.com/user-attachments/assets/a888fde7-6766-43f1-a753-810399418bda)
|
21 |
|
22 |
---
|
@@ -43,12 +39,12 @@ CNMBert-MoE
|
|
43 |
|
44 |
### CNMBert
|
45 |
|
46 |
-
| Model | 模型权重 | Memory Usage (FP16) | QPS | MRR | Acc |
|
47 |
-
| --------------- | ----------------------------------------------------------- | ------------------- | ----- | ----- | ----- |
|
48 |
-
| CNMBert-Default | [Huggingface](https://huggingface.co/Midsummra/CNMBert) | 0.4GB | 12.56 |
|
49 |
-
| CNMBert-MoE | [Huggingface](https://huggingface.co/Midsummra/CNMBert-MoE) | 0.8GB | 3.20 |
|
50 |
|
51 |
-
* 所有模型均在相同的
|
52 |
* QPS 为 queries per second (由于没有使用c重写predict所以现在性能很糟...)
|
53 |
* MRR 为平均倒数排名(mean reciprocal rank)
|
54 |
* Acc 为准确率(accuracy)
|
@@ -58,7 +54,7 @@ CNMBert-MoE
|
|
58 |
```python
|
59 |
from transformers import AutoTokenizer, BertConfig
|
60 |
|
61 |
-
from CustomBertModel import
|
62 |
from MoELayer import BertWwmMoE
|
63 |
```
|
64 |
|
@@ -77,34 +73,54 @@ model = BertWwmMoE.from_pretrained('Midsummra/CNMBert-MoE', config=config).to('c
|
|
77 |
预测词语
|
78 |
|
79 |
```python
|
80 |
-
print(
|
81 |
-
print(
|
82 |
```
|
83 |
|
84 |
> ['块钱', 1.2056937473156175], ['块前', 0.05837443749364857], ['开千', 0.0483869208528063], ['可千', 0.03996622172280445], ['口气', 0.037183335575008414]
|
85 |
|
86 |
> ['病', 1.6893256306648254], ['吧', 0.1642467901110649], ['呗', 0.026976384222507477], ['包', 0.021441461518406868], ['报', 0.01396679226309061]
|
87 |
|
88 |
-
|
89 |
|
90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
92 |
-
|
93 |
|
94 |
-
|
95 |
|
96 |
-
|
97 |
|
|
|
98 |
|
|
|
99 |
|
100 |
-
Q:
|
101 |
|
102 |
-
A:
|
103 |
|
104 |
### 引用
|
105 |
-
|
106 |
如果您对CNMBert的具体实现感兴趣的话,可以参考
|
107 |
-
|
108 |
```
|
109 |
@misc{feng2024cnmbertmodelhanyupinyin,
|
110 |
title={CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task},
|
@@ -115,4 +131,4 @@ A: 因为是在很小的数据集(200w)上进行的预训练,所以泛化能
|
|
115 |
primaryClass={cs.CL},
|
116 |
url={https://arxiv.org/abs/2411.11770},
|
117 |
}
|
118 |
-
```
|
|
|
12 |
---
|
13 |
|
14 |
|
|
|
|
|
|
|
15 |
# zh-CN-Multi-Mask-Bert (CNMBert)
|
|
|
16 |
![image](https://github.com/user-attachments/assets/a888fde7-6766-43f1-a753-810399418bda)
|
17 |
|
18 |
---
|
|
|
39 |
|
40 |
### CNMBert
|
41 |
|
42 |
+
| Model | 模型权重 | Memory Usage (FP16) | Model Size | QPS | MRR | Acc |
|
43 |
+
| --------------- | ----------------------------------------------------------- | ------------------- | ---------- | ----- | ----- | ----- |
|
44 |
+
| CNMBert-Default | [Huggingface](https://huggingface.co/Midsummra/CNMBert) | 0.4GB | 131M | 12.56 | 59.70 | 49.74 |
|
45 |
+
| CNMBert-MoE | [Huggingface](https://huggingface.co/Midsummra/CNMBert-MoE) | 0.8GB | 329M | 3.20 | 61.53 | 51.86 |
|
46 |
|
47 |
+
* 所有模型均在相同的200万条wiki以及知乎语料下训练
|
48 |
* QPS 为 queries per second (由于没有使用c重写predict所以现在性能很糟...)
|
49 |
* MRR 为平均倒数排名(mean reciprocal rank)
|
50 |
* Acc 为准确率(accuracy)
|
|
|
54 |
```python
|
55 |
from transformers import AutoTokenizer, BertConfig
|
56 |
|
57 |
+
from CustomBertModel import predict
|
58 |
from MoELayer import BertWwmMoE
|
59 |
```
|
60 |
|
|
|
73 |
预测词语
|
74 |
|
75 |
```python
|
76 |
+
print(predict("我有两千kq", "kq", model, tokenizer)[:5])
|
77 |
+
print(predict("快去给魔理沙看b吧", "b", model, tokenizer[:5]))
|
78 |
```
|
79 |
|
80 |
> ['块钱', 1.2056937473156175], ['块前', 0.05837443749364857], ['开千', 0.0483869208528063], ['可千', 0.03996622172280445], ['口气', 0.037183335575008414]
|
81 |
|
82 |
> ['病', 1.6893256306648254], ['吧', 0.1642467901110649], ['呗', 0.026976384222507477], ['包', 0.021441461518406868], ['报', 0.01396679226309061]
|
83 |
|
84 |
+
---
|
85 |
|
86 |
+
```python
|
87 |
+
# 默认的predict函数使用束搜索
|
88 |
+
def predict(sentence: str,
|
89 |
+
predict_word: str,
|
90 |
+
model,
|
91 |
+
tokenizer,
|
92 |
+
top_k=8,
|
93 |
+
beam_size=16, # 束宽
|
94 |
+
threshold=0.005, # 阈值
|
95 |
+
fast_mode=True, # 是否使用快速模式
|
96 |
+
strict_mode=True): # 是否对输出结果进行检查
|
97 |
+
|
98 |
+
# 使用回溯的无剪枝暴力搜索
|
99 |
+
def backtrack_predict(sentence: str,
|
100 |
+
predict_word: str,
|
101 |
+
model,
|
102 |
+
tokenizer,
|
103 |
+
top_k=10,
|
104 |
+
fast_mode=True,
|
105 |
+
strict_mode=True):
|
106 |
+
```
|
107 |
|
108 |
+
> 由于BERT的自编码特性,导致其在预测MASK时,顺序不同会导致预测结果不同,如果启用`fast_mode`,则会正向和反向分别对输入进行预测,可以提升一点准确率(2%左右),但是会带来更大的性能开销。
|
109 |
|
110 |
+
> `strict_mode`会对输入进行检查,以判断其是否为一个真实存在的汉语词汇。
|
111 |
|
112 |
+
### 如何微调模型
|
113 |
|
114 |
+
请参考[TrainExample.ipynb](https://github.com/IgarashiAkatuki/CNMBert/blob/main/TrainExample.ipynb),在数据集的格式上,只要保证csv的第一列为要训练的语料即可。
|
115 |
|
116 |
+
### Q&A
|
117 |
|
118 |
+
Q: 感觉这个东西准确度有点低啊
|
119 |
|
120 |
+
A: 可以尝试设置`fast_mode`和`strict_mode`为`False`。 模型是在很小的数据集(200w)上进行的预训练,所以泛化能力不足很正常,,,可以在更大数据集或者更加细分的领域进行微调,具体微调方式和[Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)差别不大,只需要将`DataCollactor`替换为`CustomBertModel.py`中的`DataCollatorForMultiMask`。
|
121 |
|
122 |
### 引用
|
|
|
123 |
如果您对CNMBert的具体实现感兴趣的话,可以参考
|
|
|
124 |
```
|
125 |
@misc{feng2024cnmbertmodelhanyupinyin,
|
126 |
title={CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task},
|
|
|
131 |
primaryClass={cs.CL},
|
132 |
url={https://arxiv.org/abs/2411.11770},
|
133 |
}
|
134 |
+
```
|