Greek RoBERTa Uncased (v1)
Pretrained model on Greek language using a masked language modeling (MLM) objective using Hugging Face's Transformers library. This model is case-sensitive and has no Greek diacritics (uncased, no-accents).
Training data
This model was pretrained on almost 18M unique tweets, all Greek, collected between 2008-2021, from almost 450K distinct users.
Preprocessing
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50256. For the tokenizer we splited strings containing any numbers (ex. EU2019 ==> EU 2019). The tweet normalization logic described in the example listed bellow.
import unicodedata
from transformers import pipeline
def normalize_tweet(tweet, do_lower = True, do_strip_accents = True, do_split_word_numbers = False, user_fill = '', url_fill = ''):
# your tweet pre-processing logic goes here
# example...
# remove extra spaces, escape HTML, replace non-standard punctuation
# replace any @user with blank
# replace any link with blank
# explode hashtags to strings (ex. #EU2019 ==> EU 2019)
# remove all emojis
# if do_split_word_numbers:
# splited strings containing any numbers
# standardize punctuation
# remove unicode symbols
if do_lower:
tweet = tweet.lower()
if do_strip_accents:
tweet = strip_accents(tweet)
return tweet.strip()
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
nlp = pipeline('fill-mask', model = 'cvcio/roberta-el-uncased-twitter-v1')
print(
nlp(
normalize_tweet(
'<mask>: Μεγάλη υποχώρηση του ιικού φορτίου σε Αττική και Θεσσαλονίκη'
)
)
)
Pretraining
The model was pretrained on a T4 GPU for 1.2M steps with a batch size of 96 and a sequence length of 96. The optimizer used is Adam with a learning rate of 1e-5, gradient accumulation steps of 8, learning rate warmup for 50000 steps and linear decay of the learning rate after.
Authors
Dimitris Papaevagelou - @andefined
About Us
Civic Information Office is a Non Profit Organization based in Athens, Greece focusing on creating technology and research products for the public interest.
- Downloads last month
- 0