Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,59 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
|
5 |
+
# Bernice
|
6 |
+
|
7 |
+
Bernice is a multilingual pre-trained encoder exclusively for Twitter data.
|
8 |
+
The model was released with the EMNLP 2022 paper *Bernice: A Multilingual Pre-trained Encoder for Twitter* by Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.
|
9 |
+
|
10 |
+
This model card will contain more information *soon*. Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.
|
11 |
+
|
12 |
+
# Model description
|
13 |
+
TBD
|
14 |
+
|
15 |
+
## Training data
|
16 |
+
TBD
|
17 |
+
|
18 |
+
## Training procedure
|
19 |
+
TBD
|
20 |
+
|
21 |
+
## Evaluation results
|
22 |
+
TBD
|
23 |
+
|
24 |
+
# How to use
|
25 |
+
You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
|
26 |
+
|
27 |
+
```python
|
28 |
+
from transformers import AutoTokenizer, AutoModel
|
29 |
+
import re
|
30 |
+
|
31 |
+
# Load model
|
32 |
+
model = AutoModel("bernice")
|
33 |
+
tokenizer = AutoTokenizer.from_pretrained("bernice", model_max_length=128)
|
34 |
+
|
35 |
+
# Data
|
36 |
+
raw_tweets = [
|
37 |
+
"So, Nintendo and Illimination's upcoming animated #SuperMarioBrosMovie is reportedly titled 'The Super Mario Bros. Movie'. Alrighty. :)",
|
38 |
+
"AMLO se vio muy indignado porque propusieron al presidente de Ucrania para el premio nobel de la paz. ¿Qué no hay otros que luchen por la paz? ¿Acaso se quería proponer él?"
|
39 |
+
]
|
40 |
+
|
41 |
+
# Pre-process tweets for tokenizer
|
42 |
+
URL_RE = re.compile(r"https?:\/\/[\w\.\/\?\=\d&#%_:/-]+")
|
43 |
+
HANDLE_RE = re.compile(r"@\w+")
|
44 |
+
tweets = []
|
45 |
+
for t in raw_tweets:
|
46 |
+
t = HANDLE_RE.sub("@USER", t)
|
47 |
+
t = URL_RE.sub("HTTPURL", t)
|
48 |
+
tweets.append(t)
|
49 |
+
|
50 |
+
with torch.no_grad():
|
51 |
+
embeddings = model(tweets)
|
52 |
+
```
|
53 |
+
|
54 |
+
|
55 |
+
# Limitations and bias
|
56 |
+
TBD
|
57 |
+
|
58 |
+
## BibTeX entry and citation info
|
59 |
+
TBD
|