File size: 2,608 Bytes
7a0cada
 
 
968fc69
34227e9
7a0cada
34227e9
7a0cada
 
 
968fc69
7a0cada
 
 
2508de5
 
806cad1
2508de5
88a0007
083e90c
2508de5
7a0cada
 
 
 
 
 
 
856553b
d80dae4
856553b
7a0cada
 
2508de5
 
 
 
54186b7
2508de5
 
 
 
 
 
 
 
 
 
 
 
8bc605c
2508de5
 
7a0cada
2508de5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a0cada
2508de5
 
 
7a0cada
2508de5
 
 
 
 
 
 
 
7a0cada
 
 
2508de5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
language: multilingual
widget:
- text: "๐Ÿค—"
- text: "T'estimo! โค๏ธ"
- text: "I love you!"
- text: "I hate you ๐Ÿคฎ"
- text: "Mahal kita!"
- text: "์‚ฌ๋ž‘ํ•ด!"
- text: "๋‚œ ๋„ˆ๊ฐ€ ์‹ซ์–ด"
- text: "๐Ÿ˜๐Ÿ˜๐Ÿ˜"
---


# twitter-XLM-roBERTa-base for Sentiment Analysis

This is a XLM-roBERTa-base model trained on ~198M tweets and finetuned for sentiment analysis. The sentiment fine-tuning was done on 8 languages (Ar, En, Fr, De, Hi, It, Sp, Pt) but it can be used for more languages (see paper for details).

- Paper: [XLM-T: A Multilingual Language Model Toolkit for Twitter](https://arxiv.org/abs/2104.12250). 
- Git Repo: [XLM-T official repository](https://github.com/cardiffnlp/xlm-t).

## Example Pipeline
```python
from transformers import pipeline
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
sentiment_task("T'estimo!")
```
```
[{'label': 'Positive', 'score': 0.6600581407546997}]
```

## Full classification example

```python
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

text = "Good night ๐Ÿ˜Š"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# # TF
# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
# model.save_pretrained(MODEL)

# text = "Good night ๐Ÿ˜Š"
# encoded_input = tokenizer(text, return_tensors='tf')
# output = model(encoded_input)
# scores = output[0][0].numpy()
# scores = softmax(scores)

# Print labels and scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = config.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

```

Output: 

```
1) Positive 0.7673
2) Neutral 0.2015
3) Negative 0.0313
```