File size: 6,944 Bytes
baed05f
 
ac77102
2b44272
ac77102
2b44272
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
baed05f
269f2cf
 
 
 
ac77102
 
 
269f2cf
ac77102
269f2cf
 
ac77102
 
 
1755d74
ac77102
 
1755d74
ac77102
269f2cf
 
3f8c41e
 
73134ca
269f2cf
 
ac77102
269f2cf
 
73134ca
 
 
 
 
 
 
 
 
 
269f2cf
 
 
 
 
 
 
 
 
ac77102
 
269f2cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73134ca
 
 
 
 
 
 
 
 
 
 
 
269f2cf
 
ac77102
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
license: mit
datasets:
  - jhu-clsp/bernice-pretrain-data
language: 
  - en
  - es
  - pt
  - ja
  - ar
  - in
  - ko
  - tr
  - fr
  - tl
  - ru
  - und
  - it
  - th
  - de
  - hi
  - pl
  - nl
  - fa
  - et
  - ht
  - ur
  - sv
  - ca
  - el
  - fi
  - cs
  - iw
  - da
  - vi
  - zh
  - ta
  - ro
  - no
  - uk
  - cy
  - ne
  - hu
  - eu
  - sl
  - lv
  - lt
  - bn
  - sr
  - bg
  - mr
  - ml
  - is
  - te
  - gu
  - kn
  - ps
  - ckb
  - si
  - hy
  - or
  - pa
  - am
  - sd
  - my
  - ka
  - km
  - dv
  - lo
  - ug
  - bo
---

# Bernice

Bernice is a multilingual pre-trained encoder exclusively for Twitter data. 
The model was released with the EMNLP 2022 paper 
[*Bernice: A Multilingual Pre-trained Encoder for Twitter*](https://aclanthology.org/2022.emnlp-main.415/) by 
Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.

Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.

# Model description
The language of Twitter differs significantly from that of other domains commonly included in large language model training. 
While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained 
language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter, 
or are trained on a limited amount of in-domain Twitter data. We introduce Bernice, the first multilingual RoBERTa language 
model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual 
and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models 
adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall. We posit that it is 
more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.

## Training data
2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata). 
The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
See [Bernice pretrain dataset](https://huggingface.co/datasets/jhu-clsp/bernice-pretrain-data) for details.

## Training procedure
RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.

## Evaluation results
We evaluated Bernice on three Twitter benchmarks: [TweetEval](https://aclanthology.org/2020.findings-emnlp.148/), [Unified Multilingual Sentiment Analysis
Benchmark (UMSAB)](https://aclanthology.org/2022.lrec-1.27/), and [Multilingual Hate Speech](https://link.springer.com/chapter/10.1007/978-3-030-67670-4_26). Summary results are shown below, see the paper appendix 
for details.

|        | **Bernice** | **BERTweet** | **XLM-R** | **XLM-T** | **TwHIN-BERT-MLM** | **TwHIN-BERT** |
|---------|-------------|--------------|-----------|-----------|--------------------|----------------|
| TweetEval | 64.80       | **67.90**    | 57.60     | 64.40     | 64.80              | 63.10          |
| UMSAB   | **70.34**   | -            | 67.71     | 66.74     | 68.10              | 67.53          |
| Hate Speech | **76.20**   | -            | 74.54     | 73.31     | 73.41              | 74.32          |


# How to use
You can use this model for tweet representation. To use with HuggingFace PyTorch interface:

```python
from transformers import AutoTokenizer, AutoModel
import re

# Load model
model = AutoModel("jhu-clsp/bernice")
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/bernice", model_max_length=128)

# Data
raw_tweets = [
  "So, Nintendo and Illimination's upcoming animated #SuperMarioBrosMovie is reportedly titled 'The Super Mario Bros. Movie'. Alrighty. :)",
  "AMLO se vio muy indignado porque propusieron al presidente de Ucrania para el premio nobel de la paz. ¿Qué no hay otros que luchen por la paz? ¿Acaso se quería proponer él?"
]

# Pre-process tweets for tokenizer
URL_RE = re.compile(r"https?:\/\/[\w\.\/\?\=\d&#%_:/-]+")
HANDLE_RE = re.compile(r"@\w+")
tweets = []
for t in raw_tweets:
  t = HANDLE_RE.sub("@USER", t)
  t = URL_RE.sub("HTTPURL", t)
  tweets.append(t)

with torch.no_grad():
  embeddings = model(tweets)
```


# Limitations and bias

**Presence of Hate Speech:** As with all social media data, there exists spam and hate speech. 
We cleaned our data by filtering for tweet length, but the possibility of this spam remains. 
Hate speech is difficult to detect, especially across languages and cultures thus we leave its removal for future work.

**Low-resource Language Evaluation:** Within languages, even with language sampling during training,
Bernice is still not exposed to the same variety of examples in low-resource languages as high-resource languages like English and Spanish. 
It is unclear whether enough Twitter data exists in these languages, such as Tibetan and Telugu, to ever match the performance on high-resource languages. 
Only models more efficient at generalizing can pave the way for better performance in the wide variety of languages in this low-resource category.

See the paper for a more detailed discussion.


## BibTeX entry and citation info
```
@inproceedings{delucia-etal-2022-bernice,
    title = "Bernice: A Multilingual Pre-trained Encoder for {T}witter",
    author = "DeLucia, Alexandra  and
      Wu, Shijie  and
      Mueller, Aaron  and
      Aguirre, Carlos  and
      Resnik, Philip  and
      Dredze, Mark",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.415",
    pages = "6191--6205",
    abstract = "The language of Twitter differs significantly from that of other domains commonly included in large language model training. While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter, or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.",
}
```