Model Card for tibetan-phonetic-transliteration

This model is a text2text generation model for phonetic transliteration of Tibetan script.

Model Details

Model Description

Developed by: billingsmoore
Model type: text2text generation
Language(s) (NLP): Tibetan
License: [Attribution-NonCommercial 4.0 International ](Attribution-NonCommercial 4.0 International )
Finetuned from model: 'google-t5/t5-small'

Model Sources

Repository: https://github.com/billingsmoore/MLotsawa

Uses

The intended use of this model is to provide phonetic transliteration of Tibetan script, typically as part of a larger Tibetan translation ecosystem.

Direct Use

To use the model for transliteration in a python script, you can use the transformers library like so:

from transformers import pipeline

transliterator = pipeline('translation',model='billingsmoore/tibetan-phonetic-transliteration')

transliterated_text = transliterator(<string of unicode Tibetan script>)

Downstream Use

The model can be finetuned for a specific use case using the following code.

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor
from accelerate import Accelerator

dataset = load_dataset(<your dataset>)
dataset = dataset['train'].train_test_split(.1)

checkpoint = "billingsmoore/tibetan-phonetic-transliteration"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

source_lang = 'bo'
target_lang = 'phon'

def preprocess_function(examples):

    inputs = [example for example in examples[source_lang]]
    targets = [example for example in examples[target_lang]]
    
    model_inputs = tokenizer(inputs, text_target=targets, max_length=256, truncation=True, padding="max_length")

    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

optimizer = Adafactor(
    model.parameters(), 
    scale_parameter=True, 
    relative_step=False, 
    warmup_init=False, 
    lr=3e-4
)

accelerator = Accelerator()
model, optimizer = accelerator.prepare(model, optimizer)

training_args = Seq2SeqTrainingArguments(
    output_dir=".",
    auto_find_batch_size=True,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=False,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    num_train_epochs=5
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
    optimizers=(optimizer, None),
    data_collator=data_collator
)

trainer.train()

Bias, Risks, and Limitations

This model was trained exclusively on material from the Tibetan Buddhist canon and thus on Literary Tibetan. It may not perform satisfactorily on texts from other corpi or on other dialects of Tibetan.

Recommendations

For users who wish to use the model for other texts, I recommend further finetuning on your own dataset using the instructions above.

Training Details

This model was trained on 98597 pairs of text, the first member of which is a line of unicode Tibetan text, the second (the target) is a the phonetic transliteration of the first. This dataset was scraped from Lotsawa House and is released on Kaggle under the same license as the texts from which it is sourced. You can find this dataset and more information on Kaggle by clicking here. You can find this dataset and more information on Huggingface by clicking here.

This model was trained for five epochs. Further information regarding training can be found in the documentation of the MLotsawa repository.

Model Card Contact

billingsmoore [at] gmail [dot] com

billingsmoore
/

tibetan-phonetic-transliteration