File size: 1,728 Bytes
ed8ce5c
44baf79
 
 
ed8ce5c
 
44baf79
 
 
 
 
 
 
 
 
 
 
 
 
 
1b692d5
44baf79
 
1b692d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
language: en
tags:
- roberta
license: mit
---

# RoBERTa base model fine-tuned on pronoun fill masking

This is RoBERTa base fine-tuned for fill masking of just pronouns.
The model's purpose is to post process machine translated text where sentence
level translation may not have enough context to correctly deduce the correct
pronoun to use.

This model was trained on 10B tokens of literature (private light novel and book dataset as well as books1 and 20\% of books3 from The Pile).

This model achieves an 88\% top1 accuracy, evaluated with a sliding window of 512 tokens (84\% without a sliding window).

### How to use

Use `fix_pronouns_in_text` from [pronoun_fixer.py](https://huggingface.co/thefrigidliquidation/roberta-base-pronouns/blob/main/pronoun_fixer.py)

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline
import pronoun_fixer


# text produced by sentence level machine translation where the pronoun was ambiguous in the source language
# and is wrong in the target language
MTL_TEXT = """
Cadence Lee thought he was a normal girl, perhaps a little well to do, but not exceptionally so.
"""

device = 'cuda'
pronoun_checkpoint = "thefrigidliquidation/roberta-base-pronouns"
pronoun_model = AutoModelForMaskedLM.from_pretrained(pronoun_checkpoint).to(device)
pronoun_tokenizer = AutoTokenizer.from_pretrained(pronoun_checkpoint)
unmasker = FillMaskPipeline(model=pronoun_model, tokenizer=pronoun_tokenizer, device=device, top_k=10)

fixed_text = pronoun_fixer.fix_pronouns_in_text(unmasker, pronoun_tokenizer, MTL_TEXT)

print(fixed_text)
# Cadence Lee thought she was a normal girl, perhaps a little well to do, but not exceptionally so.
# now the pronoun is fixed
```