File size: 3,311 Bytes
508b2b9 cac624f 508b2b9 cac624f ec7fcb5 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 cac624f 508b2b9 ec7fcb5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# tweet-topic-latest-multi
This is a RoBERTa-base model trained on 168.86M tweets until the end of September 2022 and finetuned for multi-label topic classification on a corpus of 11,267 [tweets](https://huggingface.co/datasets/cardiffnlp/tweet_topic_multi).
The original RoBERTa-base model can be found [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-sep2022). This model is suitable for English.
- Reference Paper: [TweetTopic](https://arxiv.org/abs/2209.09824) (COLING 2022).
<b>Labels</b>:
| <span style="font-weight:normal">0: arts_&_culture</span> | <span style="font-weight:normal">5: fashion_&_style</span> | <span style="font-weight:normal">10: learning_&_educational</span> | <span style="font-weight:normal">15: science_&_technology</span> |
|-----------------------------|---------------------|----------------------------|--------------------------|
| 1: business_&_entrepreneurs | 6: film_tv_&_video | 11: music | 16: sports |
| 2: celebrity_&_pop_culture | 7: fitness_&_health | 12: news_&_social_concern | 17: travel_&_adventure |
| 3: diaries_&_daily_life | 8: food_&_dining | 13: other_hobbies | 18: youth_&_student_life |
| 4: family | 9: gaming | 14: relationships | |
## Full classification example
```python
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import expit
MODEL = f"cardiffnlp/tweet-topic-latest-multi"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
class_mapping = model.config.id2label
text = "It is great to see athletes promoting awareness for climate change."
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens)
scores = output[0][0].detach().numpy()
scores = expit(scores)
predictions = (scores >= 0.5) * 1
# TF
#tf_model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
#class_mapping = tf_model.config.id2label
#text = "It is great to see athletes promoting awareness for climate change."
#tokens = tokenizer(text, return_tensors='tf')
#output = tf_model(**tokens)
#scores = output[0][0]
#scores = expit(scores)
#predictions = (scores >= 0.5) * 1
# Map to classes
for i in range(len(predictions)):
if predictions[i]:
print(class_mapping[i])
```
Output:
```
fitness_&_health
news_&_social_concern
sports
```
### BibTeX entry and citation info
Please cite the [reference paper](https://aclanthology.org/2022.coling-1.299/) if you use this model.
```bibtex
@inproceedings{antypas-etal-2022-twitter,
title = "{T}witter Topic Classification",
author = "Antypas, Dimosthenis and
Ushio, Asahi and
Camacho-Collados, Jose and
Silva, Vitor and
Neves, Leonardo and
Barbieri, Francesco",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.299",
pages = "3386--3400"
}
```
|