|
--- |
|
library_name: transformers |
|
tags: [disaster management, twitter] |
|
--- |
|
|
|
# Disaster-Twitter-XLM-RoBERTa-AL |
|
|
|
This is a multilingual [Twitter-XLM-RoBERTa-base model](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base) fine-tuned for the identification of disaster-related tweets. It was trained using a two-step procedure. First, we fine-tuned the model with 179,391 labelled tweets from [CrisisLex](https://crisislex.org/) in English, Spanish, German, French and Italian. Subsequently, the model was fine-tuned further using data from the 2021 Ahr Valley flood in Germany and the 2023 Chile forest fires using a greedy coreset active learning approach. |
|
|
|
- Paper: [Active Learning for Identifying Disaster-Related Tweets: A Comparison with Keyword Filtering and Generic Fine-Tuning](https://link.springer.com/chapter/10.1007/978-3-031-66428-1_8) |
|
|
|
## Labels |
|
The model classifies short texts using either one of the following two labels: |
|
- `LABEL_0`: NOT disaster-related |
|
- `LABEL_1`: Disaster-related |
|
|
|
## Example Pipeline |
|
```python |
|
from transformers import pipeline |
|
MODEL_NAME = 'hannybal/disaster-twitter-xlm-roberta-al' |
|
classifier = pipeline('text-classification', model=MODEL_NAME, tokenizer='cardiffnlp/twitter-xlm-roberta-base') |
|
classifier('I can see fire and smoke from the nearby fire!') |
|
``` |
|
|
|
Output: |
|
``` |
|
[{'label': 'LABEL_0', 'score': 0.9967854022979736}] |
|
``` |
|
|
|
|
|
## Full Classification Example |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification |
|
from transformers import AutoTokenizer, AutoConfig |
|
import numpy as np |
|
from scipy.special import softmax |
|
|
|
def preprocess(text: str) -> str: |
|
"""Pre-process texts by replacing usernames and links with placeholders. |
|
""" |
|
new_text: list[str] = [] |
|
for t in text.split(" "): |
|
t: str = '@user' if t.startswith('@') and len(t) > 1 else t |
|
t = 'http' if t.startswith('http') else t |
|
new_text.append(t) |
|
return " ".join(new_text) |
|
|
|
MODEL_NAME = 'hannybal/disaster-twitter-xlm-roberta-al' |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('cardiffnlp/twitter-xlm-roberta-base') |
|
config = AutoConfig.from_pretrained(MODEL_NAME) |
|
|
|
# example classification |
|
text = "Das ist alles, was von meinem Keller noch übrig ist... #flood #ahr @ Bad Neuenahr-Ahrweiler https://t.co/C68fBaKZWR" |
|
text = preprocess(text) |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
scores = output[0][0].detach().numpy() |
|
scores = softmax(scores) |
|
|
|
# print labels and their respective scores |
|
ranking = np.argsort(scores) |
|
ranking = ranking[::-1] |
|
for i in range(scores.shape[0]): |
|
l = config.id2label[ranking[i]] |
|
s = scores[ranking[i]] |
|
print(f"{i+1}) {l} {np.round(float(s), 4)}") |
|
``` |
|
|
|
Output: |
|
``` |
|
1) LABEL_1 0.9999 |
|
2) LABEL_0 0.0001 |
|
``` |
|
|
|
## Reference |
|
``` |
|
@inproceedings{Hanny.2024a, |
|
title = {Active {{Learning}} for~{{Identifying Disaster-Related Tweets}}: {{A Comparison}} with~{{Keyword Filtering}} and~{{Generic Fine-Tuning}}}, |
|
shorttitle = {Active {{Learning}} for~{{Identifying Disaster-Related Tweets}}}, |
|
booktitle = {Intelligent {{Systems}} and {{Applications}}}, |
|
author = {Hanny, David and Schmidt, Sebastian and Resch, Bernd}, |
|
editor = {Arai, Kohei}, |
|
year = {2024}, |
|
pages = {126--142}, |
|
publisher = {Springer Nature Switzerland}, |
|
address = {Cham}, |
|
doi = {10.1007/978-3-031-66428-1_8}, |
|
isbn = {978-3-031-66428-1}, |
|
langid = {english} |
|
} |
|
``` |
|
|
|
## Acknowledgements |
|
This work has received funding from the European Commission - European Union under HORIZON EUROPE (HORIZON Research and Innovation Actions) as part of the [TEMA project](https://tema-project.eu/) (grant agreement 101093003; HORIZON-CL4-2022-DATA-01-01). This work has also received funding from the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK) project GeoSHARING (Grant Number 878652). |