YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Twitter-scratch-roBERTa-base

This is a RoBERTa-base model trained from scratch on ~58M tweets, as described and evaluated in the TweetEval benchmark (Findings of EMNLP 2020). To evaluate this and other LMs on Twitter-specific data, please refer to the Tweeteval official repository.

Preprocess Text

Replace usernames and links for placeholders: "@user" and "http".

def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

Example Masked Language Model

from transformers import pipeline, AutoTokenizer
import numpy as np

MODEL = "cardiffnlp/twitter-scratch-roberta-base"
fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def print_candidates():
    for i in range(5):
        token = tokenizer.decode(candidates[i]['token'])
        score = np.round(candidates[i]['score'], 4)
        print(f"{i+1}) {token} {score}")

texts = [
 "I am so <mask> 😊",
 "I am so <mask> 😒" 
]
for text in texts:
    t = preprocess(text)
    print(f"{'-'*30}\n{t}")
    candidates = fill_mask(t)
    print_candidates()

Output:

------------------------------
I am so <mask> 😊
1)  happy 0.530
2)  grateful 0.083
3)  excited 0.078
4)  thankful 0.053
5)  blessed 0.041
------------------------------
I am so <mask> 😒
1)  sad 0.439
2)  sorry 0.088
3)  tired 0.045
4)  hurt 0.026
5)  upset 0.026

BibTeX entry and citation info

Please cite the reference paper if you use this model.

@inproceedings{barbieri-etal-2020-tweeteval,
    title = "{T}weet{E}val: Unified Benchmark and Comparative Evaluation for Tweet Classification",
    author = "Barbieri, Francesco  and
      Camacho-Collados, Jose  and
      Espinosa Anke, Luis  and
      Neves, Leonardo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.148",
    doi = "10.18653/v1/2020.findings-emnlp.148",
    pages = "1644--1650"
}
Downloads last month
3
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.