metadata
license: mit
Bernice
Bernice is a multilingual pre-trained encoder exclusively for Twitter data. The model was released with the EMNLP 2022 paper Bernice: A Multilingual Pre-trained Encoder for Twitter by Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.
This model card will contain more information soon. Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.
Model description
TBD
Training data
TBD
Training procedure
TBD
Evaluation results
TBD
How to use
You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
from transformers import AutoTokenizer, AutoModel
import re
# Load model
model = AutoModel("bernice")
tokenizer = AutoTokenizer.from_pretrained("bernice", model_max_length=128)
# Data
raw_tweets = [
"So, Nintendo and Illimination's upcoming animated #SuperMarioBrosMovie is reportedly titled 'The Super Mario Bros. Movie'. Alrighty. :)",
"AMLO se vio muy indignado porque propusieron al presidente de Ucrania para el premio nobel de la paz. ¿Qué no hay otros que luchen por la paz? ¿Acaso se quería proponer él?"
]
# Pre-process tweets for tokenizer
URL_RE = re.compile(r"https?:\/\/[\w\.\/\?\=\d&#%_:/-]+")
HANDLE_RE = re.compile(r"@\w+")
tweets = []
for t in raw_tweets:
t = HANDLE_RE.sub("@USER", t)
t = URL_RE.sub("HTTPURL", t)
tweets.append(t)
with torch.no_grad():
embeddings = model(tweets)
Limitations and bias
TBD
BibTeX entry and citation info
TBD