BlackKakapo commited on
Commit
95147f7
1 Parent(s): 971d965

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -1
README.md CHANGED
@@ -1,3 +1,77 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ annotations_creators: []
3
+ language:
4
+ - ro
5
+ language_creators:
6
+ - machine-generated
7
+ license:
8
+ - apache-2.0
9
+ multilinguality:
10
+ - monolingual
11
+ pretty_name: BlackKakapo/t5-base-paraphrase-ro
12
+ size_categories:
13
+ - 10K<n<100K
14
+ source_datasets:
15
+ - original
16
+ tags: []
17
+ task_categories:
18
+ - text2text-generation
19
+ task_ids: []
20
  ---
21
+ # Romanian paraphrase
22
+
23
+ ![v1.0](https://img.shields.io/badge/V.1-03.08.2022-brightgreen)
24
+
25
+ Fine-tune t5-base model for paraphrase. Since there is no Romanian dataset for paraphrasing, I had to create my own [dataset](https://huggingface.co/datasets/BlackKakapo/paraphrase-ro-v1). The dataset contains ~60k examples.
26
+
27
+ ### How to use
28
+
29
+ ```python
30
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
31
+
32
+ tokenizer = AutoTokenizer.from_pretrained("BlackKakapo/t5-base-paraphrase-ro")
33
+ model = AutoModelForSeq2SeqLM.from_pretrained("BlackKakapo/t5-base-paraphrase-ro")
34
+ ```
35
+
36
+ ### Or
37
+
38
+ ```python
39
+ from transformers import T5ForConditionalGeneration, T5TokenizerFast
40
+
41
+ model = T5ForConditionalGeneration.from_pretrained("BlackKakapo/t5-base-paraphrase-ro")
42
+ tokenizer = T5TokenizerFast.from_pretrained("BlackKakapo/t5-base-paraphrase-ro")
43
+ ```
44
+
45
+ ### Generate
46
+
47
+ ```python
48
+ text = "Am impresia că fac multe greșeli."
49
+
50
+ encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
51
+ input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
52
+
53
+ beam_outputs = model.generate(
54
+ input_ids=input_ids,
55
+ attention_mask=attention_masks,
56
+ do_sample=True,
57
+ max_length=256,
58
+ top_k=10,
59
+ top_p=0.9,
60
+ early_stopping=False,
61
+ num_return_sequences=5
62
+ )
63
+
64
+ for beam_output in beam_outputs:
65
+ text_para = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
66
+
67
+ if text.lower() != text_para.lower() or text not in final_outputs:
68
+ final_outputs.append(text_para)
69
+ break
70
+
71
+ print(final_outputs)
72
+ ```
73
+ ### Output
74
+
75
+ ```out
76
+ ['Cred că fac multe greșeli.']
77
+ ```