File size: 1,628 Bytes
639faa6 2600493 639faa6 2600493 4527f10 2600493 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
---
tags:
- text2text-generation
- Chinese
- seq2seq
- grammar
language: zh
license: apache-2.0
---
# Pseudo-Native-BART-CGEC
This model is a cutting-edge CGEC model based on [Chinese BART-large](https://huggingface.co/fnlp/bart-large-chinese).
It is trained with HSK and Lang8 learner CGEC data (about 1.3M) and human-annotated training data for the exam domain.
More details can be found in our [Github](https://github.com/HillZhang1999/NaSGEC) and the [paper](https://arxiv.org/pdf/2305.16023.pdf).
## Usage
pip install transformers
```
from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
tokenizer = BertTokenizer.from_pretrained("HillZhang/real_learner_bart_CGEC_exam")
model = BartForConditionalGeneration.from_pretrained("HillZhang/real_learner_bart_CGEC_exam")
encoded_input = tokenizer(["北京是中国的都。", "他说:”我最爱的运动是打蓝球“", "我每天大约喝5次水左右。", "今天,我非常开开心。"], return_tensors="pt", padding=True, truncation=True)
if "token_type_ids" in encoded_input:
del encoded_input["token_type_ids"]
output = model.generate(**encoded_input)
print(tokenizer.batch_decode(output, skip_special_tokens=True))
```
## Citation
```
@inproceedings{zhang-etal-2023-nasgec,
title = "{Na}{SGEC}: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts",
author = "Zhang, Yue and
Zhang, Bo and
Jiang, Haochen and
Li, Zhenghua and
Li, Chen and
Huang, Fei and
Zhang, Min"
booktitle = "Findings of ACL",
year = "2023"
}
``` |