updates
Browse files
README.md
CHANGED
@@ -93,12 +93,22 @@ more efficient compute- and data-wise to train completely on in-domain data with
|
|
93 |
## Training data
|
94 |
2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
|
95 |
The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
|
|
|
96 |
|
97 |
## Training procedure
|
98 |
RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
|
99 |
|
100 |
## Evaluation results
|
101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
|
103 |
# How to use
|
104 |
You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
|
@@ -132,7 +142,18 @@ with torch.no_grad():
|
|
132 |
|
133 |
|
134 |
# Limitations and bias
|
135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
136 |
|
137 |
## BibTeX entry and citation info
|
138 |
```
|
|
|
93 |
## Training data
|
94 |
2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
|
95 |
The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
|
96 |
+
See [Bernice pretrain dataset](https://huggingface.co/datasets/jhu-clsp/bernice-pretrain-data) for details.
|
97 |
|
98 |
## Training procedure
|
99 |
RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
|
100 |
|
101 |
## Evaluation results
|
102 |
+
We evaluated Bernice on three Twitter benchmarks: [TweetEval](https://aclanthology.org/2020.findings-emnlp.148/), [Unified Multilingual Sentiment Analysis
|
103 |
+
Benchmark (UMSAB)](https://aclanthology.org/2022.lrec-1.27/), and [Multilingual Hate Speech](https://link.springer.com/chapter/10.1007/978-3-030-67670-4_26). Summary results are shown below, see the paper appendix
|
104 |
+
for details.
|
105 |
+
|
106 |
+
| | **Bernice** | **BERTweet** | **XLM-R** | **XLM-T** | **TwHIN-BERT-MLM** | **TwHIN-BERT** |
|
107 |
+
|---------|-------------|--------------|-----------|-----------|--------------------|----------------|
|
108 |
+
| TweetEval | 64.80 | **67.90** | 57.60 | 64.40 | 64.80 | 63.10 |
|
109 |
+
| UMSAB | **70.34** | - | 67.71 | 66.74 | 68.10 | 67.53 |
|
110 |
+
| Hate Speech | **76.20** | - | 74.54 | 73.31 | 73.41 | 74.32 |
|
111 |
+
|
112 |
|
113 |
# How to use
|
114 |
You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
|
|
|
142 |
|
143 |
|
144 |
# Limitations and bias
|
145 |
+
|
146 |
+
**Presence of Hate Speech:** As with all social media data, there exists spam and hate speech.
|
147 |
+
We cleaned our data by filtering for tweet length, but the possibility of this spam remains.
|
148 |
+
Hate speech is difficult to detect, especially across languages and cultures thus we leave its removal for future work.
|
149 |
+
|
150 |
+
**Low-resource Language Evaluation:** Within languages, even with language sampling during training,
|
151 |
+
Bernice is still not exposed to the same variety of examples in low-resource languages as high-resource languages like English and Spanish.
|
152 |
+
It is unclear whether enough Twitter data exists in these languages, such as Tibetan and Telugu, to ever match the performance on high-resource languages.
|
153 |
+
Only models more efficient at generalizing can pave the way for better performance in the wide variety of languages in this low-resource category.
|
154 |
+
|
155 |
+
See the paper for a more detailed discussion.
|
156 |
+
|
157 |
|
158 |
## BibTeX entry and citation info
|
159 |
```
|