|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Model Card for MLotsawa |
|
|
|
This model is a neural machine translation model for translating Literary Tibetan to English. |
|
|
|
The model's name is taken from 'machine learning' (ML) and translation (Tibetan: Lotsawa). |
|
|
|
The model expects transliterated Tibetan as an input and outputs an English translation. |
|
|
|
The model was evaluated using the BLEU metric, with a final score of 83.4374 on evaluation data. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is a finetuned T5 model with 770 million parameters. |
|
|
|
- **Developed by:** billingsmoore |
|
- **Model type:** [More Information Needed] |
|
- **Language(s) (NLP):** Tibetan, English |
|
- **License:** [Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) |
|
- **Finetuned from model [optional]:** 'google-t5/t5-large' |
|
|
|
### Model Sources [optional] |
|
|
|
- **Repository:** [MLotsawa on Github](https://github.com/billingsmoore/MLotsawa) |
|
|
|
## Uses |
|
|
|
This model is intended to be used as the translation model in the larger MLotsawa software, but can also be used in a Jupyter notebook or Python script. |
|
|
|
### Direct Use |
|
|
|
To use this model for translation you can use the following code: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
translator = pipeline('translation', 'billingsmoore/mlotsawa') |
|
|
|
input_text = <your transliterated Tibetan text> |
|
|
|
translation = translator(input_text) |
|
|
|
print(translation) |
|
``` |
|
|
|
Note that if your input text is not already transliterated, you can transliterate unicode Tibetan into Wylie transliteration using the pyets library like so: |
|
|
|
```python |
|
import pyewts |
|
|
|
converter = pyewts.pyewts() |
|
|
|
input_text = <your unicode Tibetan text> |
|
|
|
transliterated = converter.toWylie(input_text) |
|
``` |
|
|
|
### Downstream Use |
|
|
|
The model can be further finetuned using the following code: |
|
|
|
```python |
|
from datasets import load_dataset |
|
from transformers import ( |
|
AutoTokenizer, DataCollatorForSeq2Seq, |
|
AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, |
|
Seq2SeqTrainer, EarlyStoppingCallback, Adafactor |
|
) |
|
import evaluate |
|
import numpy as np |
|
from accelerate import Accelerator |
|
|
|
data = load_dataset(<path_to_your_dataset>) |
|
|
|
checkpoint = "billingsmoore/mlotsawa" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) |
|
|
|
source_lang = 'bo' |
|
target_lang = 'en' |
|
prefix = "translate Tibetan to English: " |
|
|
|
def preprocess_function(examples): |
|
|
|
inputs = [prefix + example[source_lang] for example in examples['translation']] |
|
targets = [example[target_lang] for example in examples['translation']] |
|
|
|
model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True) |
|
|
|
return model_inputs |
|
|
|
tokenized_dataset = dataset.map(preprocess_function, batched=True) |
|
|
|
metric = evaluate.load("sacrebleu") |
|
|
|
def postprocess_text(preds, labels): |
|
preds = [pred.strip() for pred in preds] |
|
labels = [[label.strip()] for label in labels] |
|
|
|
return preds, labels |
|
|
|
|
|
def compute_metrics(eval_preds): |
|
preds, labels = eval_preds |
|
if isinstance(preds, tuple): |
|
preds = preds[0] |
|
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) |
|
|
|
labels = np.where(labels != -100, labels, tokenizer.pad_token_id) |
|
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) |
|
|
|
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels) |
|
|
|
result = metric.compute(predictions=decoded_preds, references=decoded_labels) |
|
result = {"bleu": result["score"]} |
|
|
|
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds] |
|
result["gen_len"] = np.mean(prediction_lens) |
|
result = {k: round(v, 4) for k, v in result.items()} |
|
return result |
|
|
|
early_stop = EarlyStoppingCallback() |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto") |
|
|
|
optimizer = Adafactor( |
|
model.parameters(), |
|
scale_parameter=True, |
|
relative_step=False, |
|
warmup_init=False, |
|
lr=3e-4 |
|
) |
|
|
|
training_args = Seq2SeqTrainingArguments( |
|
output_dir=".", |
|
auto_find_batch_size=True, |
|
predict_with_generate=True, |
|
fp16=False, #check this |
|
push_to_hub=False, |
|
eval_strategy='epoch', |
|
save_strategy='epoch', |
|
load_best_model_at_end=True |
|
) |
|
|
|
trainer = Seq2SeqTrainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=tokenized_dataset['train'], |
|
eval_dataset=tokenized_dataset['test'], |
|
tokenizer=tokenizer, |
|
optimizers=(optimizer, None), |
|
data_collator=data_collator, |
|
compute_metrics=compute_metrics, |
|
callbacks=[early_stop] |
|
) |
|
|
|
trainer.train() |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
[Training Data for this project is available here.](https://www.kaggle.com/datasets/billingsmoore/classical-tibetan-to-english-translation-dataset) |
|
|
|
This dataset consists of 100,000 pairs of sentences or phrases. The first member of each pair is a sentence or phrase in Classical Tibetan. The second member is the English translation of the first. |
|
|
|
The pairs are pulled from texts sourced from Lotsawa House (lotsawahouse.org) and are offered under the same license as the original texts they provided. |
|
|
|
This data was scraped, cleaned, and formatted programmatically. |
|
|
|
### Training Procedure |
|
|
|
This model was trained for 6 epochs on the dataset described above. |
|
|
|
#### Training Hyperparameters |
|
|
|
- This model was trained using the Adafactor optimizer with a learning rate of 2e-5. |
|
|
|
## Evaluation |
|
|
|
The evaluation metric for this model was the BLEU score. BLEU (Bilingual Evaluation Understudy) scores measure the quality of |
|
machine-generated translations by comparing them to human-provided reference translations. The score ranges from 0 to 100, |
|
where 100 represents a perfect match with the reference translations. It evaluates the precision of n-grams (word sequences) |
|
in the generated text, with higher scores indicating closer alignment to the reference translations. A brevity penalty is applied |
|
to discourage translations that are too short. |
|
|
|
The final BLEU score was 83.4374. |
|
|