ai-forever commited on
Commit
547b697
·
1 Parent(s): 3f74453

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - spellchecking
7
+ - NLP
8
+ - T5
9
+ - pytorch
10
+ - natural language generation
11
+ ---
12
+
13
+ # Card for T5-large-spell model
14
+
15
+ ### Summary
16
+ The model corrects spelling errors and typos by bringing all words in the text to the standard English language.
17
+ The proofreader was trained based on the [T5-large](https://huggingface.co/t5-large) model.
18
+ An extensive dataset with “artificial” errors was taken as a training corpus: the corpus was assembled on the basis of the English-language Wikipedia and News blogs, then typos and spelling errors were automatically introduced into it using the functionality of the [SAGE] library (https://github.com /orgs/ai-forever/sage).
19
+
20
+ ### Articles and speeches
21
+ - [Speech about the SAGE library](https://youtu.be/yFfkV0Qjuu0), DataFest 2023
22
+ - [Article about synthetic error generation methods](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf), Dialogue 2023
23
+ - [Article about SAGE and our best solution](https://arxiv.org/abs/2308.09435), Review EACL 2024
24
+
25
+ ### Examples
26
+ | Input | Output |
27
+ | --- | --- |
28
+ | Th festeivаl was excelzecnt in many ways, and in particular it beinganinternational festjival sss a chаllenging, bet brilli an t ea. | The festival was excellent in many ways, and in particular it beinganinternational festival is a challenging, but brilliant one to see. |
29
+ | That 's why I believe in the solution which is the closest to human nature and can help us to avoid boredome. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There wo n't be any problem with being up - do - date . | That's why I believe in the solution which is the closest to human nature and can help us to avoid boredom. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There won't be any problem with being up - do - date. |
30
+ | If you bought something goregous, you well be very happy. | If you bought something gorgeous, you will be very happy. |
31
+
32
+ ## Metrics
33
+ ### Quality
34
+ Below are automatic metrics for determining the correctness of the spell checkers.
35
+ We present a comparison of our solution both with open automatic spell checkers and with the ChatGPT family of models on two available datasets:
36
+ - **BEA60K**: English spelling errors collected from several domains;
37
+ - **JFLEG**: 1601 sentences in English, which contain about 2 thousand spelling errors;
38
+
39
+ **BEA60K**
40
+ | Model | Precision | Recall | F1 |
41
+ | --- | --- | --- | --- |
42
+ | T5-large-spell | 66.5 | 83.1 | 73.9 |
43
+ | ChatGPT gpt-3.5-turbo-0301 | 66.9 | 84.1 | 74.5 |
44
+ | ChatGPT gpt-4-0314 | 68.6 | 85.2 | 76.0 |
45
+ | ChatGPT text-davinci-003 | 67.8 | 83.9 | 75.0 |
46
+ | Bert (https://github.com/neuspell/neuspell) | 65.8 | 79.6 | 72.0 |
47
+ | SC-LSTM (https://github.com/neuspell/neuspell) | 62.2 | 80.3 | 72.0 |
48
+
49
+ **JFLEG**
50
+ | Model | Precision | Recall | F1 |
51
+ | --- | --- | --- | --- |
52
+ | T5-large-spell | 83.4 | 84.3 | 83.8 |
53
+ | ChatGPT gpt-3.5-turbo-0301 | 77.8 | 88.6 | 82.9 |
54
+ | ChatGPT gpt-4-0314 | 77.9 | 88.3 | 82.8 |
55
+ | ChatGPT text-davinci-003 | 76.8 | 88.5 | 82.2 |
56
+ | Bert (https://github.com/neuspell/neuspell) | 78.5 | 85.4 | 81.8 |
57
+ | SC-LSTM (https://github.com/neuspell/neuspell) | 80.6 | 86.1 | 83.2 |
58
+
59
+ ## How to use
60
+ ```python
61
+ from transformers import T5ForConditionalGeneration, AutoTokenizer
62
+
63
+ path_to_model = "<path_to_model>"
64
+
65
+ model = T5ForConditionalGeneration.from_pretrained(path_to_model)
66
+ tokenizer = AutoTokenizer.from_pretrained(path_to_model)
67
+ prefix = "grammar: "
68
+
69
+ sentence = "If you bought something goregous, you well be very happy."
70
+ sentence = prefix + grammar
71
+
72
+ encodings = tokenizer(sentence, return_tensors="pt")
73
+ generated_tokens = model.generate(**encodings)
74
+ answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
75
+ print(answer)
76
+
77
+ # ["If you bought something gorgeous, you will be very happy."]
78
+ ```
79
+
80
+ ## Resources
81
+ - [SAGE library code with augmentation methods, access to datasets and open models](https://github.com/orgs/ai-forever/sage), GitHub
82
+ - [ruM2M100-1.2B](https://huggingface.co/ai-forever/RuM2M100-1.2B), HuggingFace
83
+ - [ruM2M100-418M](https://huggingface.co/ai-forever/RuM2M100-420M), HuggingFace
84
+ - [FredT5-large-spell](https://huggingface.co/ai-forever/FRED-T5-large-spell), HuggingFace
85
+ - [T5-large-spell](https://huggingface.co/ai-forever/T5-large-spell), HuggingFace
86
+
87
+ ## License
88
+ The [T5-large](https://huggingface.co/t5-large) model, on which our solution is based, and its source code are supplied under the APACHE-2.0 license.
89
+ Our solution is supplied under the MIT license.
90
+
91
+ ## Contacts
92
+ For questions related to the operation and application of the model, please contact the product manager: Pavel Lebedev [email protected].