Upload 8 files
Browse files- README.eval_brief.md +280 -0
- config.json +30 -0
- csc.config +47 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +13 -0
- vocab.txt +0 -0
README.eval_brief.md
ADDED
@@ -0,0 +1,280 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# relm_v1
|
2 |
+
## 概述(relm_v1)
|
3 |
+
- macro-correct, 中文拼写纠错CSC测评(文本纠错), 权重使用
|
4 |
+
- 项目地址在[https://github.com/yongzhuo/macro-correct](https://github.com/yongzhuo/macro-correct)
|
5 |
+
- 本模型权重为relm_v1, 使用relm架构, 其特点是将生成任务转化为句对任务(MLM); sent1 + [mask]*n -->> sent1 + sent1_correct
|
6 |
+
- 如何使用: 1.使用transformers调用; 2.使用[macro-correct](https://github.com/yongzhuo/macro-correct)项目调用; 详情见***三、调用(Usage)***;
|
7 |
+
|
8 |
+
## 目录
|
9 |
+
* [一、测评(Test)](#一、测评(Test))
|
10 |
+
* [二、结论(Conclusion)](#二、结论(Conclusion))
|
11 |
+
* [三、调用(Usage)](#三、调用(Usage))
|
12 |
+
* [四、论文(Paper)](#四、论文(Paper))
|
13 |
+
* [五、参考(Refer)](#五、参考(Refer))
|
14 |
+
* [六、引用(Cite)](#六、引用(Cite))
|
15 |
+
|
16 |
+
|
17 |
+
## 一、测评(Test)
|
18 |
+
### 1.1 测评数据来源
|
19 |
+
地址为[Macropodus/csc_eval_public](https://huggingface.co/datasets/Macropodus/csc_eval_public), 所有训练数据均来自公网或开源数据, 训练数据为1千万左右, 混淆词典较大;
|
20 |
+
```
|
21 |
+
1.gen_de3.json(5545): '的地得'纠错, 由人民日报/学习强国/chinese-poetry等高质量数据人工生成;
|
22 |
+
2.lemon_v2.tet.json(1053): relm论文提出的数据, 多领域拼写纠错数据集(7个领域), ; 包括game(GAM), encyclopedia (ENC), contract (COT), medical care(MEC), car (CAR), novel (NOV), and news (NEW)等领域;
|
23 |
+
3.acc_rmrb.tet.json(4636): 来自NER-199801(人民日报高质量语料);
|
24 |
+
4.acc_xxqg.tet.json(5000): 来自学习强国网站的高质量语料;
|
25 |
+
5.gen_passage.tet.json(10000): 源数据为qwen生成的好词好句, 由几乎所有的开源数据汇总的混淆词典生成;
|
26 |
+
6.textproof.tet.json(1447): NLP竞赛数据, TextProofreadingCompetition;
|
27 |
+
7.gen_xxqg.tet.json(5000): 源数据为学习强国网站的高质量语料, 由几乎所有的开源数据汇总的混淆词典生成;
|
28 |
+
8.faspell.dev.json(1000): 视频字幕通过OCR后获取的数据集; 来自爱奇艺的论文faspell;
|
29 |
+
9.lomo_tet.json(5000): 主要为音似中文拼写纠错数据集; 来自腾讯; 人工标注的数据集CSCD-NS;
|
30 |
+
10.mcsc_tet.5000.json(5000): 医学拼写纠错; 来自腾讯医典APP的真实历史日志; 注意论文说该数据集只关注医学实体的纠错, 常用字等的纠错并不关注;
|
31 |
+
11.ecspell.dev.json(1500): 来自ECSpell论文, 包括(law/med/gov)等三个领域;
|
32 |
+
12.sighan2013.dev.json(1000): 来自sighan13会议;
|
33 |
+
13.sighan2014.dev.json(1062): 来自sighan14会议;
|
34 |
+
14.sighan2015.dev.json(1100): 来自sighan15会议;
|
35 |
+
```
|
36 |
+
### 1.2 测评数据预处理
|
37 |
+
```
|
38 |
+
测评数据都经过 全角转半角,繁简转化,标点符号标准化等操作;
|
39 |
+
```
|
40 |
+
|
41 |
+
### 1.3 其他说明
|
42 |
+
```
|
43 |
+
1.指标带common的极为宽松指标, 同开源项目pycorrector的评估指标;
|
44 |
+
2.指标带strict的极为严格指标, 同开源项目[wangwang110/CSC](https://github.com/wangwang110/CSC);
|
45 |
+
3.macbert4mdcspell_v1模型为训练使用mdcspell架构+bert的mlm-loss, 但是推理的时候只用bert-mlm;
|
46 |
+
4.acc_rmrb/acc_xxqg数据集没有错误, 用于评估模型的误纠率(过度纠错);
|
47 |
+
5.qwen25_1-5b_pycorrector的模型为shibing624/chinese-text-correction-1.5b, 其训练数据包括了lemon_v2/mcsc_tet/ecspell的验证集和测试集, 其他的bert类模型的训练不包括验证集和测试集;
|
48 |
+
```
|
49 |
+
|
50 |
+
|
51 |
+
## 二、重要指标
|
52 |
+
### 2.1 F1(common_cor_f1)
|
53 |
+
| model/common_cor_f1| avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
|
54 |
+
|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
|
55 |
+
| macbert4csc_pycorrector| 45.8| 42.44| 42.89| 31.49| 46.31| 26.06| 32.7| 44.83| 27.93| 55.51| 70.89| 61.72| 66.81 |
|
56 |
+
| bert4csc_v1| 62.28| 93.73| 61.99| 44.79| 68.0| 35.03| 48.28| 61.8| 64.41| 79.11| 77.66| 51.01| 61.54 |
|
57 |
+
| macbert4csc_v1| 68.55| 96.67| 65.63| 48.4| 75.65| 38.43| 51.76| 70.11| 80.63| 85.55| 81.38| 57.63| 70.7 |
|
58 |
+
| macbert4csc_v2| 68.6| 96.74| 66.02| 48.26| 75.78| 38.84| 51.91| 70.17| 80.71| 85.61| 80.97| 58.22| 69.95 |
|
59 |
+
| macbert4mdcspell_v1| 71.1| 96.42| 70.06| 52.55| 79.61| 43.37| 53.85| 70.9| 82.38| 87.46| 84.2| 61.08| 71.32 |
|
60 |
+
| relm_v1| 54.12| 89.86| 51.79| 38.4| 63.74| 30.6| 31.95| 49.82| 64.7| 73.57| 66.4| 39.87| 48.8 |
|
61 |
+
| qwen25_1-5b_pycorrector| 45.11| 27.29| 89.48| 14.61| 83.9| 13.84| 18.2| 36.71| 96.29| 88.2| 36.41| 15.64| 20.73 |
|
62 |
+
|
63 |
+
### 2.2 acc(common_cor_acc)
|
64 |
+
| model/common_cor_acc| avg| gen_de3| lemon_v2| gen_passage| text_proof| gen_xxqg| faspell| lomo_tet| mcsc_tet| ecspell| sighan2013| sighan2014| sighan2015 |
|
65 |
+
|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|:-----------------|
|
66 |
+
| macbert4csc_pycorrector| 48.26| 26.96| 28.68| 34.16| 55.29| 28.38| 22.2| 60.96| 57.16| 67.73| 55.9| 68.93| 72.73 |
|
67 |
+
| bert4csc_v1| 60.76| 88.21| 45.96| 43.13| 68.97| 35.0| 34.0| 65.86| 73.26| 81.8| 64.5| 61.11| 67.27 |
|
68 |
+
| macbert4csc_v1| 65.34| 93.56| 49.76| 44.98| 74.64| 36.1| 37.0| 73.0| 83.6| 86.87| 69.2| 62.62| 72.73 |
|
69 |
+
| macbert4csc_v2| 65.22| 93.69| 50.14| 44.92| 74.64| 36.26| 37.0| 72.72| 83.66| 86.93| 68.5| 62.43| 71.73 |
|
70 |
+
| macbert4mdcspell_v1| 67.15| 93.09| 54.8| 47.71| 78.09| 39.52| 38.8| 71.92| 84.78| 88.27| 73.2| 63.28| 72.36 |
|
71 |
+
| relm_v1| 51.9| 81.71| 36.18| 37.04| 63.99| 29.34| 22.9| 51.98| 74.1| 76.0| 50.3| 45.76| 53.45 |
|
72 |
+
| qwen25_1-5b_pycorrector| 46.09| 15.82| 81.29| 22.96| 82.17| 19.04| 12.8| 50.2| 96.4| 89.13| 22.8| 27.87| 32.55 |
|
73 |
+
|
74 |
+
### 2.3 acc(acc_true, thr=0.75)
|
75 |
+
| model/acc | avg| acc_rmrb| acc_xxqg |
|
76 |
+
|:------------------------|:-----------------|:-----------------|:-----------------|
|
77 |
+
| macbert4csc_pycorrector | 99.24| 99.22| 99.26 |
|
78 |
+
| bert4csc_v1 | 98.71| 98.36| 99.06 |
|
79 |
+
| macbert4csc_v1 | 97.72| 96.72| 98.72 |
|
80 |
+
| macbert4csc_v2 | 97.89| 96.98| 98.8 |
|
81 |
+
| macbert4mdcspell_v1 | 97.75| 96.51| 98.98 |
|
82 |
+
| relm_v1 | 93.47| 90.21| 96.74 |
|
83 |
+
| qwen25_1-5b_pycorrector | 82.0| 77.14| 86.86 |
|
84 |
+
|
85 |
+
## 二、结论(Conclusion)
|
86 |
+
```
|
87 |
+
1.macbert4csc_v1/macbert4csc_v2/macbert4mdcspell_v1等模型使用多种领域数据训练, 比较均衡, 也适合作为第一步的预训练模型, 可用于专有领域数据的继续微调;
|
88 |
+
2.比较macbert4csc_pycorrector/bertbase4csc_v1/macbert4csc_v2/macbert4mdcspell_v1, 观察表2.3, 可以发现训练数据越多, 准确率提升的同时, 误纠率也会稍微高一些;
|
89 |
+
3.MFT(Mask-Correct)依旧有效, 不过对于数据量足够的情形提升不明显, 可能也是误纠率升高的一个重要原因;
|
90 |
+
4.训练数据中也存在文言文数据, 训练好的模型也支持文言文纠错;
|
91 |
+
5.训练好的模型对"地得的"等高频错误具有较高的识别率和纠错率;
|
92 |
+
```
|
93 |
+
|
94 |
+
## 三、调用(Usage)
|
95 |
+
### 3.1 使用macro-correct
|
96 |
+
```
|
97 |
+
# !/usr/bin/python
|
98 |
+
# -*- coding: utf-8 -*-
|
99 |
+
# @time : 2021/2/29 21:41
|
100 |
+
# @author : Mo
|
101 |
+
# @function: tet csc of relm_v1, 纠错
|
102 |
+
|
103 |
+
|
104 |
+
import traceback
|
105 |
+
import time
|
106 |
+
import os
|
107 |
+
|
108 |
+
from macro_correct.pytorch_user_models.csc.relm.predict import RelmPredict
|
109 |
+
|
110 |
+
|
111 |
+
path_model_dir = "../../macro_correct/output/text_correction/relm_v1"
|
112 |
+
# path_model_dir = "Macropodus/relm_v1"
|
113 |
+
|
114 |
+
model = RelmPredict(path_model_dir, path_model_dir)
|
115 |
+
texts = ["真麻烦你了。希望你们好好的跳无",
|
116 |
+
"少先队员因该为老人让坐",
|
117 |
+
"机七学习是人工智能领遇最能体现智能的一个分知",
|
118 |
+
"一只小鱼船浮在平净的河面上"
|
119 |
+
]
|
120 |
+
texts = [{"source": t} for t in texts]
|
121 |
+
res = model.predict(texts)
|
122 |
+
for r in res:
|
123 |
+
print(r)
|
124 |
+
"""
|
125 |
+
{'source': '真麻烦你了。希望你们好好的跳无', 'target': '真麻烦你了。希望你们好好跳舞', 'errors': [['的', '跳', 12], ['跳', '舞', 13]]}
|
126 |
+
{'source': '少先队员因该为老人让坐', 'target': '少先队员应该为老人让座', 'errors': [['因', '应', 4], ['坐', '座', 10]]}
|
127 |
+
{'source': '机七学习是人工智能领遇最能体现智能的一个分知', 'target': '机器学习是人工智能领域最能体现智能的一个分', 'errors': [['七', '器', 1], ['遇', '域', 10]]}
|
128 |
+
{'source': '一只小鱼船浮在平净的河面上', 'target': '一只小鱼船浮在平静的河面上', 'errors': [['净', '静', 8]]}
|
129 |
+
"""
|
130 |
+
```
|
131 |
+
|
132 |
+
### 3.2 使用 transformers
|
133 |
+
```
|
134 |
+
# !/usr/bin/python
|
135 |
+
# -*- coding: utf-8 -*-
|
136 |
+
# @time : 2021/2/29 21:41
|
137 |
+
# @author : Mo
|
138 |
+
# @function: transformers直接加载bert类模型测试
|
139 |
+
|
140 |
+
|
141 |
+
import traceback
|
142 |
+
import time
|
143 |
+
import sys
|
144 |
+
import os
|
145 |
+
os.environ["USE_TORCH"] = "1"
|
146 |
+
from transformers import BertConfig, BertTokenizer, BertForMaskedLM
|
147 |
+
import torch
|
148 |
+
|
149 |
+
|
150 |
+
pretrained_model_name_or_path = "Macropodus/relm_v1"
|
151 |
+
|
152 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
153 |
+
max_len = 128
|
154 |
+
|
155 |
+
print("load model, please wait a few minute!")
|
156 |
+
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path)
|
157 |
+
bert_config = BertConfig.from_pretrained(pretrained_model_name_or_path)
|
158 |
+
model = BertForMaskedLM.from_pretrained(pretrained_model_name_or_path)
|
159 |
+
model.to(device)
|
160 |
+
print("load model success!")
|
161 |
+
|
162 |
+
texts = [
|
163 |
+
"机七学习是人工智能领遇最能体现智能的一个分知",
|
164 |
+
"我是练习时长两念半的鸽仁练习生蔡徐坤",
|
165 |
+
"我是练习时长两年半的鸽人练习生蔡徐坤",
|
166 |
+
"真麻烦你了。希望你们好好的跳无",
|
167 |
+
"他法语说的很好,的语也不错",
|
168 |
+
"遇到一位很棒的奴生跟我疗天",
|
169 |
+
"我们为这个目标努力不解",
|
170 |
+
]
|
171 |
+
len_mid = min(max_len, max([len(t)+2 for t in texts])) * 2
|
172 |
+
|
173 |
+
with torch.no_grad():
|
174 |
+
texts_relm = [list(t) + [tokenizer.sep_token] + [tokenizer.mask_token for _ in t] for t in texts]
|
175 |
+
texts_relm = ["".join(t) for t in texts_relm]
|
176 |
+
outputs = model(**tokenizer(texts_relm, padding=True, max_length=len_mid,
|
177 |
+
return_tensors="pt").to(device))
|
178 |
+
|
179 |
+
def get_errors(source, target):
|
180 |
+
""" 极简方法获取 errors """
|
181 |
+
len_min = min(len(source), len(target))
|
182 |
+
errors = []
|
183 |
+
for idx in range(len_min):
|
184 |
+
if source[idx] != target[idx]:
|
185 |
+
errors.append([source[idx], target[idx], idx])
|
186 |
+
return errors
|
187 |
+
|
188 |
+
result = []
|
189 |
+
for probs, source in zip(outputs.logits, texts):
|
190 |
+
ids = torch.argmax(probs, dim=-1)
|
191 |
+
tokens_space = tokenizer.decode(ids, skip_special_tokens=False)
|
192 |
+
text_new = tokens_space.split(" ")
|
193 |
+
# print(text_new)
|
194 |
+
target = text_new[len(source)+2:len(source)*2+2]
|
195 |
+
errors = get_errors(list(source), target)
|
196 |
+
target = "".join(target)
|
197 |
+
print(source, " => ", target, errors)
|
198 |
+
result.append([target, errors])
|
199 |
+
print(result)
|
200 |
+
"""
|
201 |
+
机七学习是人工智能领遇最能体现智能的一个分知 => 机器学习是人工智能领域最能体现智能的一个分知 [['七', '器', 1], ['遇', '域', 10]]
|
202 |
+
我是练习时长两念半的鸽仁练习生蔡徐坤 => 我是练习时长两年半的鸽人练习生蔡徐坤 [['念', '年', 7], ['仁', '人', 11]]
|
203 |
+
我是练习时长两年半的鸽人练习生蔡徐坤 => 我是练习时长两年半的个人练习生蔡徐坤 [['鸽', '个', 10]]
|
204 |
+
真麻烦你了。希望你们好好的跳无 => 真麻烦你了。希望你们好好地跳舞 [['的', '地', 12], ['无', '舞', 14]]
|
205 |
+
他法语说的很好,的语也不错 => 他法语说得很好,德语也不错 [['的', '得', 4], ['的', '德', 8]]
|
206 |
+
遇到一位很棒的奴生跟我疗天 => 遇到一位很棒的女生跟我聊天 [['奴', '女', 7], ['疗', '聊', 11]]
|
207 |
+
我们为这个目标努力不解 => 我们为这个目标努力不懈 [['解', '懈', 10]]
|
208 |
+
"""
|
209 |
+
```
|
210 |
+
|
211 |
+
## 四、论文(Paper)
|
212 |
+
- 2024-Refining: [Refining Corpora from a Model Calibration Perspective for Chinese](https://arxiv.org/abs/2407.15498)
|
213 |
+
- 2024-ReLM: [Chinese Spelling Correction as Rephrasing Language Model](https://arxiv.org/abs/2308.08796)
|
214 |
+
- 2024-DICS: [DISC: Plug-and-Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check](https://arxiv.org/abs/2412.12863)
|
215 |
+
|
216 |
+
- 2023-Bi-DCSpell: [A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check]()
|
217 |
+
- 2023-BERT-MFT: [Rethinking Masked Language Modeling for Chinese Spelling Correction](https://arxiv.org/abs/2305.17721)
|
218 |
+
- 2023-PTCSpell: [PTCSpell: Pre-trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction](https://arxiv.org/abs/2212.04068)
|
219 |
+
- 2023-DR-CSC: [A Frustratingly Easy Plug-and-Play Detection-and-Reasoning Module for Chinese](https://aclanthology.org/2023.findings-emnlp.771)
|
220 |
+
- 2023-DROM: [Disentangled Phonetic Representation for Chinese Spelling Correction](https://arxiv.org/abs/2305.14783)
|
221 |
+
- 2023-EGCM: [An Error-Guided Correction Model for Chinese Spelling Error Correction](https://arxiv.org/abs/2301.06323)
|
222 |
+
- 2023-IGPI: [Investigating Glyph-Phonetic Information for Chinese Spell Checking: What Works and What’s Next?](https://arxiv.org/abs/2212.04068)
|
223 |
+
- 2023-CL: [Contextual Similarity is More Valuable than Character Similarity-An Empirical Study for Chinese Spell Checking]()
|
224 |
+
|
225 |
+
- 2022-CRASpell: [CRASpell: A Contextual Typo Robust Approach to Improve Chinese Spelling Correction](https://aclanthology.org/2022.findings-acl.237)
|
226 |
+
- 2022-MDCSpell: [MDCSpell: A Multi-task Detector-Corrector Framework for Chinese Spelling Correction](https://aclanthology.org/2022.findings-acl.98)
|
227 |
+
- 2022-SCOPE: [Improving Chinese Spelling Check by Character Pronunciation Prediction: The Effects of Adaptivity and Granularity](https://arxiv.org/abs/2210.10996)
|
228 |
+
- 2022-ECOPO: [The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking](https://arxiv.org/abs/2203.00991)
|
229 |
+
|
230 |
+
- 2021-MLMPhonetics: [Correcting Chinese Spelling Errors with Phonetic Pre-training](https://aclanthology.org/2021.findings-acl.198)
|
231 |
+
- 2021-ChineseBERT: [ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://aclanthology.org/2021.acl-long.161/)
|
232 |
+
- 2021-BERTCrsGad: [Global Attention Decoder for Chinese Spelling Error Correction](https://aclanthology.org/2021.findings-acl.122)
|
233 |
+
- 2021-ThinkTwice: [Think Twice: A Post-Processing Approach for the Chinese Spelling Error Correction](https://www.mdpi.com/2076-3417/11/13/5832)
|
234 |
+
- 2021-PHMOSpell: [PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Chec](https://aclanthology.org/2021.acl-long.464)
|
235 |
+
- 2021-SpellBERT: [SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check](https://aclanthology.org/2021.emnlp-main.287)
|
236 |
+
- 2021-TwoWays: [Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models](https://aclanthology.org/2021.acl-short.56)
|
237 |
+
- 2021-ReaLiSe: [Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking](https://arxiv.org/abs/2105.12306)
|
238 |
+
- 2021-DCSpell: [DCSpell: A Detector-Corrector Framework for Chinese Spelling Error Correction](https://dl.acm.org/doi/10.1145/3404835.3463050)
|
239 |
+
- 2021-PLOME: [PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction](https://aclanthology.org/2021.acl-long.233)
|
240 |
+
- 2021-DCN: [Dynamic Connected Networks for Chinese Spelling Check](https://aclanthology.org/2021.findings-acl.216/)
|
241 |
+
|
242 |
+
- 2020-SoftMaskBERT: [Spelling Error Correction with Soft-Masked BERT](https://arxiv.org/abs/2005.07421)
|
243 |
+
- 2020-SpellGCN: [SpellGCN:Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check](https://arxiv.org/abs/2004.14166)
|
244 |
+
- 2020-ChunkCSC: [Chunk-based Chinese Spelling Check with Global Optimization](https://aclanthology.org/2020.findings-emnlp.184)
|
245 |
+
- 2020-MacBERT: [Revisiting Pre-Trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922)
|
246 |
+
|
247 |
+
- 2019-FASPell: [FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm](https://aclanthology.org/D19-5522)
|
248 |
+
- 2018-Hybrid: [A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Checking](https://aclanthology.org/D18-1273)
|
249 |
+
|
250 |
+
- 2015-Sighan15: [Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check](https://aclanthology.org/W15-3106/)
|
251 |
+
- 2014-Sighan14: [Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check](https://aclanthology.org/W14-6820/)
|
252 |
+
- 2013-Sighan13: [Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013](https://aclanthology.org/W13-4406/)
|
253 |
+
|
254 |
+
## 五、参考(Refer)
|
255 |
+
- [nghuyong/Chinese-text-correction-papers](https://github.com/nghuyong/Chinese-text-correction-papers)
|
256 |
+
- [destwang/CTCResources](https://github.com/destwang/CTCResources)
|
257 |
+
- [wangwang110/CSC](https://github.com/wangwang110/CSC)
|
258 |
+
- [chinese-poetry/chinese-poetry](https://github.com/chinese-poetry/chinese-poetry)
|
259 |
+
- [chinese-poetry/huajianji](https://github.com/chinese-poetry/huajianji)
|
260 |
+
- [garychowcmu/daizhigev20](https://github.com/garychowcmu/daizhigev20)
|
261 |
+
- [yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)
|
262 |
+
- [Macropodus/xuexiqiangguo_428w](https://huggingface.co/datasets/Macropodus/xuexiqiangguo_428w)
|
263 |
+
- [Macropodus/csc_clean_wang271k](https://huggingface.co/datasets/Macropodus/csc_clean_wang271k)
|
264 |
+
- [Macropodus/csc_eval_public](https://huggingface.co/datasets//Macropodus/csc_eval_public)
|
265 |
+
- [shibing624/pycorrector](https://github.com/shibing624/pycorrector)
|
266 |
+
- [iioSnail/MDCSpell_pytorch](https://github.com/iioSnail/MDCSpell_pytorch)
|
267 |
+
- [gingasan/lemon](https://github.com/gingasan/lemon)
|
268 |
+
- [Claude-Liu/ReLM](https://github.com/Claude-Liu/ReLM)
|
269 |
+
|
270 |
+
|
271 |
+
## 六、引用(Cite)
|
272 |
+
For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
|
273 |
+
```
|
274 |
+
@software{macro-correct,
|
275 |
+
url = {https://github.com/yongzhuo/macro-correct},
|
276 |
+
author = {Yongzhuo Mo},
|
277 |
+
title = {macro-correct},
|
278 |
+
year = {2025}
|
279 |
+
```
|
280 |
+
|
config.json
ADDED
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "chinese-bert-base",
|
3 |
+
"architectures": [
|
4 |
+
"BertForMaskedLM"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"directionality": "bidi",
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 768,
|
12 |
+
"initializer_range": 0.02,
|
13 |
+
"intermediate_size": 3072,
|
14 |
+
"layer_norm_eps": 1e-12,
|
15 |
+
"max_position_embeddings": 512,
|
16 |
+
"model_type": "bert",
|
17 |
+
"num_attention_heads": 12,
|
18 |
+
"num_hidden_layers": 12,
|
19 |
+
"pad_token_id": 0,
|
20 |
+
"pooler_fc_size": 768,
|
21 |
+
"pooler_num_attention_heads": 12,
|
22 |
+
"pooler_num_fc_layers": 3,
|
23 |
+
"pooler_size_per_head": 128,
|
24 |
+
"pooler_type": "first_token_transform",
|
25 |
+
"position_embedding_type": "absolute",
|
26 |
+
"transformers_version": "4.30.2",
|
27 |
+
"type_vocab_size": 2,
|
28 |
+
"use_cache": true,
|
29 |
+
"vocab_size": 21128
|
30 |
+
}
|
csc.config
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"pretrained_model_name_or_path": "",
|
3 |
+
"path_relm": "relm-m0.3.bin",
|
4 |
+
"path_train": "csc_public.train.json",
|
5 |
+
"path_dev": "csc_public.dev.json",
|
6 |
+
"path_tet": "csc_public.tet.json",
|
7 |
+
"model_save_path": "../output/relm_v1",
|
8 |
+
"task_name": "relm_csc",
|
9 |
+
"do_lower_case": true,
|
10 |
+
"do_train": true,
|
11 |
+
"do_eval": true,
|
12 |
+
"do_test": true,
|
13 |
+
"gradient_accumulation_steps": 4,
|
14 |
+
"warmup_proportion": 0.1,
|
15 |
+
"num_warmup_steps": null,
|
16 |
+
"max_train_steps": null,
|
17 |
+
"num_train_epochs": 3,
|
18 |
+
"train_batch_size": 8,
|
19 |
+
"eval_batch_size": 8,
|
20 |
+
"learning_rate": 3e-05,
|
21 |
+
"max_seq_length": 256,
|
22 |
+
"max_grad_norm": 1.0,
|
23 |
+
"weight_decay": 0.0005,
|
24 |
+
"save_steps": 1000,
|
25 |
+
"anchor": null,
|
26 |
+
"seed": 42,
|
27 |
+
"lr_scheduler_type": "cosine",
|
28 |
+
"loss_type": "focal_loss",
|
29 |
+
"mask_mode": "noerror",
|
30 |
+
"loss_det_rate": 0.3,
|
31 |
+
"prompt_length": 0,
|
32 |
+
"mask_rate": 0.3,
|
33 |
+
"threshold": 0.5,
|
34 |
+
"flag_dynamic_encode": false,
|
35 |
+
"flag_loss_period": false,
|
36 |
+
"flag_cpo_loss": false,
|
37 |
+
"flag_fast_tokenizer": true,
|
38 |
+
"flag_pin_memory": true,
|
39 |
+
"flag_train": false,
|
40 |
+
"flag_fp16": false,
|
41 |
+
"flag_cuda": true,
|
42 |
+
"flag_skip": true,
|
43 |
+
"flag_mft": true,
|
44 |
+
"num_workers": 0,
|
45 |
+
"CUDA_VISIBLE_DEVICES": "0",
|
46 |
+
"USE_TORCH": "1"
|
47 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ee638ee68d9c61280fd196ea1cbf94999a4506ab68062ab8811320467693ce53
|
3 |
+
size 409230197
|
special_tokens_map.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"mask_token": "[MASK]",
|
4 |
+
"pad_token": "[PAD]",
|
5 |
+
"sep_token": "[SEP]",
|
6 |
+
"unk_token": "[UNK]"
|
7 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"clean_up_tokenization_spaces": true,
|
3 |
+
"cls_token": "[CLS]",
|
4 |
+
"do_lower_case": true,
|
5 |
+
"mask_token": "[MASK]",
|
6 |
+
"model_max_length": 1000000000000000019884624838656,
|
7 |
+
"pad_token": "[PAD]",
|
8 |
+
"sep_token": "[SEP]",
|
9 |
+
"strip_accents": null,
|
10 |
+
"tokenize_chinese_chars": true,
|
11 |
+
"tokenizer_class": "BertTokenizer",
|
12 |
+
"unk_token": "[UNK]"
|
13 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|