ai-forever
commited on
Commit
·
547b697
1
Parent(s):
3f74453
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- spellchecking
|
7 |
+
- NLP
|
8 |
+
- T5
|
9 |
+
- pytorch
|
10 |
+
- natural language generation
|
11 |
+
---
|
12 |
+
|
13 |
+
# Card for T5-large-spell model
|
14 |
+
|
15 |
+
### Summary
|
16 |
+
The model corrects spelling errors and typos by bringing all words in the text to the standard English language.
|
17 |
+
The proofreader was trained based on the [T5-large](https://huggingface.co/t5-large) model.
|
18 |
+
An extensive dataset with “artificial” errors was taken as a training corpus: the corpus was assembled on the basis of the English-language Wikipedia and News blogs, then typos and spelling errors were automatically introduced into it using the functionality of the [SAGE] library (https://github.com /orgs/ai-forever/sage).
|
19 |
+
|
20 |
+
### Articles and speeches
|
21 |
+
- [Speech about the SAGE library](https://youtu.be/yFfkV0Qjuu0), DataFest 2023
|
22 |
+
- [Article about synthetic error generation methods](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf), Dialogue 2023
|
23 |
+
- [Article about SAGE and our best solution](https://arxiv.org/abs/2308.09435), Review EACL 2024
|
24 |
+
|
25 |
+
### Examples
|
26 |
+
| Input | Output |
|
27 |
+
| --- | --- |
|
28 |
+
| Th festeivаl was excelzecnt in many ways, and in particular it beinganinternational festjival sss a chаllenging, bet brilli an t ea. | The festival was excellent in many ways, and in particular it beinganinternational festival is a challenging, but brilliant one to see. |
|
29 |
+
| That 's why I believe in the solution which is the closest to human nature and can help us to avoid boredome. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There wo n't be any problem with being up - do - date . | That's why I believe in the solution which is the closest to human nature and can help us to avoid boredom. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There won't be any problem with being up - do - date. |
|
30 |
+
| If you bought something goregous, you well be very happy. | If you bought something gorgeous, you will be very happy. |
|
31 |
+
|
32 |
+
## Metrics
|
33 |
+
### Quality
|
34 |
+
Below are automatic metrics for determining the correctness of the spell checkers.
|
35 |
+
We present a comparison of our solution both with open automatic spell checkers and with the ChatGPT family of models on two available datasets:
|
36 |
+
- **BEA60K**: English spelling errors collected from several domains;
|
37 |
+
- **JFLEG**: 1601 sentences in English, which contain about 2 thousand spelling errors;
|
38 |
+
|
39 |
+
**BEA60K**
|
40 |
+
| Model | Precision | Recall | F1 |
|
41 |
+
| --- | --- | --- | --- |
|
42 |
+
| T5-large-spell | 66.5 | 83.1 | 73.9 |
|
43 |
+
| ChatGPT gpt-3.5-turbo-0301 | 66.9 | 84.1 | 74.5 |
|
44 |
+
| ChatGPT gpt-4-0314 | 68.6 | 85.2 | 76.0 |
|
45 |
+
| ChatGPT text-davinci-003 | 67.8 | 83.9 | 75.0 |
|
46 |
+
| Bert (https://github.com/neuspell/neuspell) | 65.8 | 79.6 | 72.0 |
|
47 |
+
| SC-LSTM (https://github.com/neuspell/neuspell) | 62.2 | 80.3 | 72.0 |
|
48 |
+
|
49 |
+
**JFLEG**
|
50 |
+
| Model | Precision | Recall | F1 |
|
51 |
+
| --- | --- | --- | --- |
|
52 |
+
| T5-large-spell | 83.4 | 84.3 | 83.8 |
|
53 |
+
| ChatGPT gpt-3.5-turbo-0301 | 77.8 | 88.6 | 82.9 |
|
54 |
+
| ChatGPT gpt-4-0314 | 77.9 | 88.3 | 82.8 |
|
55 |
+
| ChatGPT text-davinci-003 | 76.8 | 88.5 | 82.2 |
|
56 |
+
| Bert (https://github.com/neuspell/neuspell) | 78.5 | 85.4 | 81.8 |
|
57 |
+
| SC-LSTM (https://github.com/neuspell/neuspell) | 80.6 | 86.1 | 83.2 |
|
58 |
+
|
59 |
+
## How to use
|
60 |
+
```python
|
61 |
+
from transformers import T5ForConditionalGeneration, AutoTokenizer
|
62 |
+
|
63 |
+
path_to_model = "<path_to_model>"
|
64 |
+
|
65 |
+
model = T5ForConditionalGeneration.from_pretrained(path_to_model)
|
66 |
+
tokenizer = AutoTokenizer.from_pretrained(path_to_model)
|
67 |
+
prefix = "grammar: "
|
68 |
+
|
69 |
+
sentence = "If you bought something goregous, you well be very happy."
|
70 |
+
sentence = prefix + grammar
|
71 |
+
|
72 |
+
encodings = tokenizer(sentence, return_tensors="pt")
|
73 |
+
generated_tokens = model.generate(**encodings)
|
74 |
+
answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
75 |
+
print(answer)
|
76 |
+
|
77 |
+
# ["If you bought something gorgeous, you will be very happy."]
|
78 |
+
```
|
79 |
+
|
80 |
+
## Resources
|
81 |
+
- [SAGE library code with augmentation methods, access to datasets and open models](https://github.com/orgs/ai-forever/sage), GitHub
|
82 |
+
- [ruM2M100-1.2B](https://huggingface.co/ai-forever/RuM2M100-1.2B), HuggingFace
|
83 |
+
- [ruM2M100-418M](https://huggingface.co/ai-forever/RuM2M100-420M), HuggingFace
|
84 |
+
- [FredT5-large-spell](https://huggingface.co/ai-forever/FRED-T5-large-spell), HuggingFace
|
85 |
+
- [T5-large-spell](https://huggingface.co/ai-forever/T5-large-spell), HuggingFace
|
86 |
+
|
87 |
+
## License
|
88 |
+
The [T5-large](https://huggingface.co/t5-large) model, on which our solution is based, and its source code are supplied under the APACHE-2.0 license.
|
89 |
+
Our solution is supplied under the MIT license.
|
90 |
+
|
91 |
+
## Contacts
|
92 |
+
For questions related to the operation and application of the model, please contact the product manager: Pavel Lebedev [email protected].
|