metadata

license: mit

Bernice

Bernice is a multilingual pre-trained encoder exclusively for Twitter data. The model was released with the EMNLP 2022 paper Bernice: A Multilingual Pre-trained Encoder for Twitter by Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Mark Dredze, and Philip Resnik.

This model card will contain more information soon. Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if there are questions.

Model description

TBD

Training data

TBD

Training procedure

TBD

Evaluation results

TBD

How to use

You can use this model for tweet representation. To use with HuggingFace PyTorch interface:

from transformers import AutoTokenizer, AutoModel
import re

# Load model
model = AutoModel("bernice")
tokenizer = AutoTokenizer.from_pretrained("bernice", model_max_length=128)

# Data
raw_tweets = [
  "So, Nintendo and Illimination's upcoming animated #SuperMarioBrosMovie is reportedly titled 'The Super Mario Bros. Movie'. Alrighty. :)",
  "AMLO se vio muy indignado porque propusieron al presidente de Ucrania para el premio nobel de la paz. ¿Qué no hay otros que luchen por la paz? ¿Acaso se quería proponer él?"
]

# Pre-process tweets for tokenizer
URL_RE = re.compile(r"https?:\/\/[\w\.\/\?\=\d&#%_:/-]+")
HANDLE_RE = re.compile(r"@\w+")
tweets = []
for t in raw_tweets:
  t = HANDLE_RE.sub("@USER", t)
  t = URL_RE.sub("HTTPURL", t)
  tweets.append(t)

with torch.no_grad():
  embeddings = model(tweets)

Limitations and bias

TBD

BibTeX entry and citation info

TBD