Update README.md
Browse files
README.md
CHANGED
@@ -199,6 +199,119 @@ results = process(text, prompt)
|
|
199 |
print(results)
|
200 |
```
|
201 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
202 |
### Benchmarking
|
203 |
Below is a table that highlights the performance of UTC models on the [CrossNER](https://huggingface.co/datasets/DFKI-SLT/cross_ner) dataset. The values represent the Micro F1 scores, with the estimation done at the word level.
|
204 |
|
|
|
199 |
print(results)
|
200 |
```
|
201 |
|
202 |
+
### How to run with utca:
|
203 |
+
First of all, you need to install the package:
|
204 |
+
```bash
|
205 |
+
pip install utca -U
|
206 |
+
```
|
207 |
+
|
208 |
+
After that you to create predictor that will run UTC model:
|
209 |
+
```python
|
210 |
+
from utca.core import (
|
211 |
+
AddData,
|
212 |
+
RenameAttribute,
|
213 |
+
Flush
|
214 |
+
)
|
215 |
+
from utca.implementation.predictors import (
|
216 |
+
TokenSearcherPredictor, TokenSearcherPredictorConfig
|
217 |
+
)
|
218 |
+
from utca.implementation.tasks import (
|
219 |
+
TokenSearcherNER,
|
220 |
+
TokenSearcherNERPostprocessor,
|
221 |
+
)
|
222 |
+
|
223 |
+
predictor = TokenSearcherPredictor(
|
224 |
+
TokenSearcherPredictorConfig(
|
225 |
+
device="cuda:0",
|
226 |
+
model="knowledgator/UTC-DeBERTa-small-v2"
|
227 |
+
)
|
228 |
+
)
|
229 |
+
```
|
230 |
+
|
231 |
+
For NER model you should create the following pipeline:
|
232 |
+
|
233 |
+
```python
|
234 |
+
ner_task = TokenSearcherNER(
|
235 |
+
predictor=predictor,
|
236 |
+
postprocess=[TokenSearcherNERPostprocessor(
|
237 |
+
threshold=0.5
|
238 |
+
)]
|
239 |
+
)
|
240 |
+
|
241 |
+
ner_task = TokenSearcherNER()
|
242 |
+
|
243 |
+
pipeline = (
|
244 |
+
AddData({"labels": ["scientist", "university", "city"]})
|
245 |
+
| ner_task
|
246 |
+
| Flush(keys=["labels"])
|
247 |
+
| RenameAttribute("output", "entities")
|
248 |
+
)
|
249 |
+
```
|
250 |
+
|
251 |
+
And after that you can put your text for prediction and run the pipeline:
|
252 |
+
|
253 |
+
```python
|
254 |
+
res = pipeline.run({
|
255 |
+
"text": """Dr. Paul Hammond, a renowned neurologist at Johns Hopkins University, has recently published a paper in the prestigious journal "Nature Neuroscience".
|
256 |
+
His research focuses on a rare genetic mutation, found in less than 0.01% of the population, that appears to prevent the development of Alzheimer's disease. Collaborating with researchers at the University of California, San Francisco, the team is now working to understand the mechanism by which this mutation confers its protective effect.
|
257 |
+
Funded by the National Institutes of Health, their research could potentially open new avenues for Alzheimer's treatment."""
|
258 |
+
})
|
259 |
+
```
|
260 |
+
|
261 |
+
To use `utca` for relation extraction construct the following pipeline:
|
262 |
+
|
263 |
+
```python
|
264 |
+
from utca.implementation.tasks import (
|
265 |
+
TokenSearcherNER,
|
266 |
+
TokenSearcherNERPostprocessor,
|
267 |
+
TokenSearcherRelationExtraction,
|
268 |
+
TokenSearcherRelationExtractionPostprocessor,
|
269 |
+
)
|
270 |
+
|
271 |
+
pipe = (
|
272 |
+
TokenSearcherNER( # TokenSearcherNER task produces classified entities that will be at the "output" key.
|
273 |
+
predictor=predictor,
|
274 |
+
postprocess=TokenSearcherNERPostprocessor(
|
275 |
+
threshold=0.5 # Entity threshold
|
276 |
+
)
|
277 |
+
)
|
278 |
+
| RenameAttribute("output", "entities") # Rename output entities from TokenSearcherNER task to use them as inputs in TokenSearcherRelationExtraction
|
279 |
+
| TokenSearcherRelationExtraction( # TokenSearcherRelationExtraction is used for relation extraction.
|
280 |
+
predictor=predictor,
|
281 |
+
postprocess=TokenSearcherRelationExtractionPostprocessor(
|
282 |
+
threshold=0.5 # Relation threshold
|
283 |
+
)
|
284 |
+
)
|
285 |
+
)
|
286 |
+
```
|
287 |
+
|
288 |
+
To run pipeline you need to specify parameters for entities and relations:
|
289 |
+
|
290 |
+
```python
|
291 |
+
r = pipe.run({
|
292 |
+
"text": text, # Text to process
|
293 |
+
"labels": [ # Labels used by TokenSearcherNER for entity extraction
|
294 |
+
"scientist",
|
295 |
+
"university",
|
296 |
+
"city",
|
297 |
+
"research",
|
298 |
+
"journal",
|
299 |
+
],
|
300 |
+
"relations": [{ # Relation parameters
|
301 |
+
"relation": "published at", # Relation label. Required parameter.
|
302 |
+
"pairs_filter": [("scientist", "journal")], # Optional parameter. It specifies possible members of relations by their entity labels.
|
303 |
+
# Here, "scientist" is the entity label of the source, and "journal" is the target's entity label.
|
304 |
+
# If provided, only specified pairs will be returned.
|
305 |
+
},{
|
306 |
+
"relation": "worked at",
|
307 |
+
"pairs_filter": [("scientist", "university"), ("scientist", "other")],
|
308 |
+
"distance_threshold": 100, # Optional parameter. It specifies the max distance between spans in the text (i.e., the end of the span that is closer to the start of the text and the start of the next one).
|
309 |
+
}]
|
310 |
+
})
|
311 |
+
|
312 |
+
print(r["output"])
|
313 |
+
```
|
314 |
+
|
315 |
### Benchmarking
|
316 |
Below is a table that highlights the performance of UTC models on the [CrossNER](https://huggingface.co/datasets/DFKI-SLT/cross_ner) dataset. The values represent the Micro F1 scores, with the estimation done at the word level.
|
317 |
|