Stefan Dumitrescu commited on
Commit
3cbb67f
1 Parent(s): 2b697ec
Files changed (1) hide show
  1. README.md +42 -0
README.md CHANGED
@@ -1,3 +1,45 @@
1
  ---
 
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: ro
3
  license: apache-2.0
4
  ---
5
+
6
+ # MT5x-base-romanian
7
+
8
+ This is a pretrained [mt5x](https://github.com/google-research/multilingual-t5) base model (390M parameters).
9
+
10
+ Training was performed with the span corruption task on a clean 80GB Romanian text corpus for 4M total steps with these [scripts](https://github.com/dumitrescustefan/t5x_models), starting from the 1M public mt5x-base checkpoint. The model was trained with an encoder sequence length of 512 and a decoder sequence length of 256; it has the same mt5x vocabulary as the 1M multilingual checkpoint.
11
+
12
+ #### **IMPORTANT** This model was pretrained on the span corruption LM task, meaning this model is **not usable** in any downstream task **without finetuning** first!
13
+
14
+ ### How to load an mt5x model
15
+
16
+ ```python
17
+ from transformers import MT5Model, T5Tokenizer
18
+
19
+ model = MT5Model.from_pretrained('dumitrescustefan/mt5x-base-romanian')
20
+ tokenizer = T5Tokenizer.from_pretrained('dumitrescustefan/mt5x-base-romanian')
21
+ input_text = "Acesta este un test."
22
+ target_text = "Acesta este"
23
+ inputs = tokenizer(input_text, return_tensors="pt")
24
+ labels = tokenizer(text_target=target_text, return_tensors="pt")
25
+
26
+ outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
27
+ hidden_states = outputs.last_hidden_state
28
+ print(hidden_states.shape) # this will print [1, 4, 768]
29
+ ```
30
+
31
+ Remember to always sanitize your text! Replace ``ş`` and ``ţ`` cedilla-letters to comma-letters with :
32
+ ```python
33
+ text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
34
+ ```
35
+ because the model was **not** trained on cedilla ``ş`` and ``ţ``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.
36
+
37
+ ### Acknowledgements
38
+
39
+ We'd like to thank [TPU Research Cloud](https://sites.research.google/trc/about/) for providing the TPUv4 cores we used to train these models!
40
+
41
+ ### Authors
42
+
43
+ Yours truly,
44
+
45
+ _[Stefan Dumitrescu](https://github.com/dumitrescustefan), [Mihai Ilie](https://github.com/iliemihai) and [Per Egil Kummervold](https://huggingface.co/north)_