RichardErkhov commited on
Commit
267dd7e
·
verified ·
1 Parent(s): 3b9cd1a

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +209 -0
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ mrebel-large - bnb 4bits
11
+ - Model creator: https://huggingface.co/Babelscape/
12
+ - Original model: https://huggingface.co/Babelscape/mrebel-large/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language:
20
+ - ar
21
+ - ca
22
+ - de
23
+ - el
24
+ - en
25
+ - es
26
+ - fr
27
+ - hi
28
+ - it
29
+ - ja
30
+ - ko
31
+ - nl
32
+ - pl
33
+ - pt
34
+ - ru
35
+ - sv
36
+ - vi
37
+ - zh
38
+ widget:
39
+ - text: >-
40
+ Els Red Hot Chili Peppers es van formar a Los Angeles per Kiedis, Flea, el
41
+ guitarrista Hillel Slovak i el bateria Jack Irons.
42
+ example_title: Catalan
43
+ inference:
44
+ parameters:
45
+ decoder_start_token_id: 250058
46
+ src_lang: ca_XX
47
+ tgt_lang: <triplet>
48
+ tags:
49
+ - seq2seq
50
+ - relation-extraction
51
+ license: cc-by-nc-sa-4.0
52
+ pipeline_tag: translation
53
+ datasets:
54
+ - Babelscape/SREDFM
55
+ ---
56
+ # RED<sup>FM</sup>: a Filtered and Multilingual Relation Extraction Dataset
57
+
58
+ This is a multilingual version of [REBEL](https://huggingface.co/Babelscape/rebel-large). It can be used as a standalone multulingual Relation Extraction system, or as a pretrained system to be tuned on multilingual Relation Extraction datasets.
59
+
60
+ mREBEL is introduced in the ACL 2023 paper [RED^{FM}: a Filtered and Multilingual Relation Extraction Dataset](https://arxiv.org/abs/2306.09802). We present a new multilingual Relation Extraction dataset and train a multilingual version of REBEL which reframed Relation Extraction as a seq2seq task. The paper can be found [here](https://arxiv.org/abs/2306.09802). If you use the code or model, please reference this work in your paper:
61
+
62
+ @inproceedings{huguet-cabot-et-al-2023-redfm-dataset,
63
+ title = "RED$^{\rm FM}$: a Filtered and Multilingual Relation Extraction Dataset",
64
+ author = "Huguet Cabot, Pere-Llu{\'\i}s and Tedeschi, Simone and Ngonga Ngomo, Axel-Cyrille and
65
+ Navigli, Roberto",
66
+ booktitle = "Proc. of the 61st Annual Meeting of the Association for Computational Linguistics: ACL 2023",
67
+ month = jul,
68
+ year = "2023",
69
+ address = "Toronto, Canada",
70
+ publisher = "Association for Computational Linguistics",
71
+ url = "https://arxiv.org/abs/2306.09802",
72
+ }
73
+
74
+ The original repository for the paper can be found [here](https://github.com/Babelscape/rebel#REDFM)
75
+
76
+ Be aware that the inference widget at the right does not output special tokens, which are necessary to distinguish the subject, object and relation types. For a demo of mREBEL and its pre-training dataset check the [Spaces demo](https://huggingface.co/spaces/Babelscape/mrebel-demo).
77
+
78
+ ## Pipeline usage
79
+
80
+ ```python
81
+ from transformers import pipeline
82
+
83
+ triplet_extractor = pipeline('translation_xx_to_yy', model='Babelscape/mrebel-large', tokenizer='Babelscape/mrebel-large')
84
+ # We need to use the tokenizer manually since we need special tokens.
85
+ extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor("The Red Hot Chili Peppers were formed in Los Angeles by Kiedis, Flea, guitarist Hillel Slovak and drummer Jack Irons.", decoder_start_token_id=250058, src_lang="en_XX", tgt_lang="<triplet>", return_tensors=True, return_text=False)[0]["translation_token_ids"]]) # change en_XX for the language of the source.
86
+ print(extracted_text[0])
87
+ # Function to parse the generated text and extract the triplets
88
+ def extract_triplets_typed(text):
89
+ triplets = []
90
+ relation = ''
91
+ text = text.strip()
92
+ current = 'x'
93
+ subject, relation, object_, object_type, subject_type = '','','','',''
94
+
95
+ for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").replace("tp_XX", "").replace("__en__", "").split():
96
+ if token == "<triplet>" or token == "<relation>":
97
+ current = 't'
98
+ if relation != '':
99
+ triplets.append({'head': subject.strip(), 'head_type': subject_type, 'type': relation.strip(),'tail': object_.strip(), 'tail_type': object_type})
100
+ relation = ''
101
+ subject = ''
102
+ elif token.startswith("<") and token.endswith(">"):
103
+ if current == 't' or current == 'o':
104
+ current = 's'
105
+ if relation != '':
106
+ triplets.append({'head': subject.strip(), 'head_type': subject_type, 'type': relation.strip(),'tail': object_.strip(), 'tail_type': object_type})
107
+ object_ = ''
108
+ subject_type = token[1:-1]
109
+ else:
110
+ current = 'o'
111
+ object_type = token[1:-1]
112
+ relation = ''
113
+ else:
114
+ if current == 't':
115
+ subject += ' ' + token
116
+ elif current == 's':
117
+ object_ += ' ' + token
118
+ elif current == 'o':
119
+ relation += ' ' + token
120
+ if subject != '' and relation != '' and object_ != '' and object_type != '' and subject_type != '':
121
+ triplets.append({'head': subject.strip(), 'head_type': subject_type, 'type': relation.strip(),'tail': object_.strip(), 'tail_type': object_type})
122
+ return triplets
123
+ extracted_triplets = extract_triplets_typed(extracted_text[0])
124
+ print(extracted_triplets)
125
+ ```
126
+
127
+ ## Model and Tokenizer using transformers
128
+
129
+ ```python
130
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
131
+
132
+ def extract_triplets_typed(text):
133
+ triplets = []
134
+ relation = ''
135
+ text = text.strip()
136
+ current = 'x'
137
+ subject, relation, object_, object_type, subject_type = '','','','',''
138
+
139
+ for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").replace("tp_XX", "").replace("__en__", "").split():
140
+ if token == "<triplet>" or token == "<relation>":
141
+ current = 't'
142
+ if relation != '':
143
+ triplets.append({'head': subject.strip(), 'head_type': subject_type, 'type': relation.strip(),'tail': object_.strip(), 'tail_type': object_type})
144
+ relation = ''
145
+ subject = ''
146
+ elif token.startswith("<") and token.endswith(">"):
147
+ if current == 't' or current == 'o':
148
+ current = 's'
149
+ if relation != '':
150
+ triplets.append({'head': subject.strip(), 'head_type': subject_type, 'type': relation.strip(),'tail': object_.strip(), 'tail_type': object_type})
151
+ object_ = ''
152
+ subject_type = token[1:-1]
153
+ else:
154
+ current = 'o'
155
+ object_type = token[1:-1]
156
+ relation = ''
157
+ else:
158
+ if current == 't':
159
+ subject += ' ' + token
160
+ elif current == 's':
161
+ object_ += ' ' + token
162
+ elif current == 'o':
163
+ relation += ' ' + token
164
+ if subject != '' and relation != '' and object_ != '' and object_type != '' and subject_type != '':
165
+ triplets.append({'head': subject.strip(), 'head_type': subject_type, 'type': relation.strip(),'tail': object_.strip(), 'tail_type': object_type})
166
+ return triplets
167
+
168
+ # Load model and tokenizer
169
+ tokenizer = AutoTokenizer.from_pretrained("Babelscape/mrebel-large", src_lang="en_XX", tgt_lang="tp_XX")
170
+ # Here we set English ("en_XX") as source language. To change the source language swap the first token of the input for your desired language or change to supported language. For catalan ("ca_XX") or greek ("el_EL") (not included in mBART pretraining) you need a workaround:
171
+ # tokenizer._src_lang = "ca_XX"
172
+ # tokenizer.cur_lang_code_id = tokenizer.convert_tokens_to_ids("ca_XX")
173
+ # tokenizer.set_src_lang_special_tokens("ca_XX")
174
+ model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/mrebel-large")
175
+ gen_kwargs = {
176
+ "max_length": 256,
177
+ "length_penalty": 0,
178
+ "num_beams": 3,
179
+ "num_return_sequences": 3,
180
+ "forced_bos_token_id": None,
181
+ }
182
+
183
+ # Text to extract triplets from
184
+ text = 'The Red Hot Chili Peppers were formed in Los Angeles by Kiedis, Flea, guitarist Hillel Slovak and drummer Jack Irons.'
185
+
186
+ # Tokenizer text
187
+ model_inputs = tokenizer(text, max_length=256, padding=True, truncation=True, return_tensors = 'pt')
188
+
189
+ # Generate
190
+ generated_tokens = model.generate(
191
+ model_inputs["input_ids"].to(model.device),
192
+ attention_mask=model_inputs["attention_mask"].to(model.device),
193
+ decoder_start_token_id = tokenizer.convert_tokens_to_ids("tp_XX"),
194
+ **gen_kwargs,
195
+ )
196
+
197
+ # Extract text
198
+ decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)
199
+
200
+ # Extract triplets
201
+ for idx, sentence in enumerate(decoded_preds):
202
+ print(f'Prediction triplets sentence {idx}')
203
+ print(extract_triplets_typed(sentence))
204
+ ```
205
+
206
+ ## License
207
+
208
+ This model is licensed under the CC BY-SA 4.0 license. The text of the license can be found [here](https://creativecommons.org/licenses/by-nc-sa/4.0/).
209
+