File size: 1,579 Bytes
baed05f
 
 
269f2cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
license: mit
---

# Bernice

Bernice is a multilingual pre-trained encoder exclusively for Twitter data. 
The model was released with the EMNLP 2022 paper *Bernice: A Multilingual Pre-trained Encoder for Twitter* by Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.

This model card will contain more information *soon*. Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.

# Model description
TBD

## Training data
TBD

## Training procedure
TBD

## Evaluation results
TBD

# How to use
You can use this model for tweet representation. To use with HuggingFace PyTorch interface:

```python
from transformers import AutoTokenizer, AutoModel
import re

# Load model
model = AutoModel("bernice")
tokenizer = AutoTokenizer.from_pretrained("bernice", model_max_length=128)

# Data
raw_tweets = [
  "So, Nintendo and Illimination's upcoming animated #SuperMarioBrosMovie is reportedly titled 'The Super Mario Bros. Movie'. Alrighty. :)",
  "AMLO se vio muy indignado porque propusieron al presidente de Ucrania para el premio nobel de la paz. ¿Qué no hay otros que luchen por la paz? ¿Acaso se quería proponer él?"
]

# Pre-process tweets for tokenizer
URL_RE = re.compile(r"https?:\/\/[\w\.\/\?\=\d&#%_:/-]+")
HANDLE_RE = re.compile(r"@\w+")
tweets = []
for t in raw_tweets:
  t = HANDLE_RE.sub("@USER", t)
  t = URL_RE.sub("HTTPURL", t)
  tweets.append(t)

with torch.no_grad():
  embeddings = model(tweets)
```


# Limitations and bias
TBD

## BibTeX entry and citation info
TBD