Levanti Diacritizer

This model adds diacritics to raw text in Palestinian colloquial Arabic. The model is trained on a special subset of the Levanti dataset (to be released later). The model is fine-tuned from the TavBERT-ar character level encoder LM, with a multi-label token classification head. TavBert-ar is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 8 epochs on the diacritized subset of the Levanti dataset. Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun. A multi-label model is used since a Shadda can accompany other diacritical marks.

Transliterator

This model can be used in conjunction with Levanti Transliterator, which transliterated diacritized text in Palestinian Arabic.

Example Usage

from transformers import RobertaForTokenClassification, AutoTokenizer
model = RobertaForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")

label2diacritic = {0: 'ู‘', # SHADDA
                   1: 'ูŽ', # FATHA
                   2: 'ู', # KASRA
                   3: 'ู', # DAMMA
                   4: 'ู’'} # SUKKUN


def arabic2diacritics(text, model, tokenizer):
    tokens = tokenizer(text, return_tensors="pt")
    preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove preds for BOS and EOS
    new_text = []
    for p, c in zip(preds, text):
        new_text.append(c)
        for i in range(1, 5):
            if p[i]:
                new_text.append(label2diacritic[i])
        # check shadda last
        if p[0]:
            new_text.append(label2diacritic[0])
        
    new_text = "".join(new_text)
    return new_text

text = "ุจุฏูŠุด ุงุฑูˆุญ ุนุงู„ู…ุฏุฑุณุฉ ุจูƒุฑุง"
arabic2diacritics(text, model, tokenizer)
Out[1]: 'ุจูุฏูู‘ูŠู’ุด ุงู’ุฑููˆู’ุญ ุนูŽุงู„ู’ู…ูŽุฏู’ุฑูŽุณูุฉ ุจููƒู’ุฑูŽุง'

Attribution

Created by Guy Mor-Lan.
Contact: guy.mor AT mail.huji.ac.il

Downloads last month
330
Safetensors
Model size
87.5M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train guymorlan/levanti_arabic2diacritics

Spaces using guymorlan/levanti_arabic2diacritics 2