|
--- |
|
language: en |
|
tags: |
|
- roberta |
|
license: mit |
|
--- |
|
|
|
# RoBERTa base model fine-tuned on pronoun fill masking |
|
|
|
This is RoBERTa base fine-tuned for fill masking of just pronouns. |
|
The model's purpose is to post process machine translated text where sentence |
|
level translation may not have enough context to correctly deduce the correct |
|
pronoun to use. |
|
|
|
This model was trained on 10B tokens of literature (private light novel and book dataset as well as books1 and 20\% of books3 from The Pile). |
|
|
|
This model achieves an 88\% top1 accuracy, evaluated with a sliding window of 512 tokens (84\% without a sliding window). |
|
|
|
### How to use |
|
|
|
Use `fix_pronouns_in_text` from [pronoun_fixer.py](https://huggingface.co/thefrigidliquidation/roberta-base-pronouns/blob/main/pronoun_fixer.py) |
|
|
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline |
|
import pronoun_fixer |
|
|
|
|
|
# text produced by sentence level machine translation where the pronoun was ambiguous in the source language |
|
# and is wrong in the target language |
|
MTL_TEXT = """ |
|
Cadence Lee thought he was a normal girl, perhaps a little well to do, but not exceptionally so. |
|
""" |
|
|
|
device = 'cuda' |
|
pronoun_checkpoint = "thefrigidliquidation/roberta-base-pronouns" |
|
pronoun_model = AutoModelForMaskedLM.from_pretrained(pronoun_checkpoint).to(device) |
|
pronoun_tokenizer = AutoTokenizer.from_pretrained(pronoun_checkpoint) |
|
unmasker = FillMaskPipeline(model=pronoun_model, tokenizer=pronoun_tokenizer, device=device, top_k=10) |
|
|
|
fixed_text = pronoun_fixer.fix_pronouns_in_text(unmasker, pronoun_tokenizer, MTL_TEXT) |
|
|
|
print(fixed_text) |
|
# Cadence Lee thought she was a normal girl, perhaps a little well to do, but not exceptionally so. |
|
# now the pronoun is fixed |
|
``` |