Stefan Dumitrescu
commited on
Commit
•
3cbb67f
1
Parent(s):
2b697ec
Update
Browse files
README.md
CHANGED
@@ -1,3 +1,45 @@
|
|
1 |
---
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: ro
|
3 |
license: apache-2.0
|
4 |
---
|
5 |
+
|
6 |
+
# MT5x-base-romanian
|
7 |
+
|
8 |
+
This is a pretrained [mt5x](https://github.com/google-research/multilingual-t5) base model (390M parameters).
|
9 |
+
|
10 |
+
Training was performed with the span corruption task on a clean 80GB Romanian text corpus for 4M total steps with these [scripts](https://github.com/dumitrescustefan/t5x_models), starting from the 1M public mt5x-base checkpoint. The model was trained with an encoder sequence length of 512 and a decoder sequence length of 256; it has the same mt5x vocabulary as the 1M multilingual checkpoint.
|
11 |
+
|
12 |
+
#### **IMPORTANT** This model was pretrained on the span corruption LM task, meaning this model is **not usable** in any downstream task **without finetuning** first!
|
13 |
+
|
14 |
+
### How to load an mt5x model
|
15 |
+
|
16 |
+
```python
|
17 |
+
from transformers import MT5Model, T5Tokenizer
|
18 |
+
|
19 |
+
model = MT5Model.from_pretrained('dumitrescustefan/mt5x-base-romanian')
|
20 |
+
tokenizer = T5Tokenizer.from_pretrained('dumitrescustefan/mt5x-base-romanian')
|
21 |
+
input_text = "Acesta este un test."
|
22 |
+
target_text = "Acesta este"
|
23 |
+
inputs = tokenizer(input_text, return_tensors="pt")
|
24 |
+
labels = tokenizer(text_target=target_text, return_tensors="pt")
|
25 |
+
|
26 |
+
outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
|
27 |
+
hidden_states = outputs.last_hidden_state
|
28 |
+
print(hidden_states.shape) # this will print [1, 4, 768]
|
29 |
+
```
|
30 |
+
|
31 |
+
Remember to always sanitize your text! Replace ``ş`` and ``ţ`` cedilla-letters to comma-letters with :
|
32 |
+
```python
|
33 |
+
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
|
34 |
+
```
|
35 |
+
because the model was **not** trained on cedilla ``ş`` and ``ţ``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.
|
36 |
+
|
37 |
+
### Acknowledgements
|
38 |
+
|
39 |
+
We'd like to thank [TPU Research Cloud](https://sites.research.google/trc/about/) for providing the TPUv4 cores we used to train these models!
|
40 |
+
|
41 |
+
### Authors
|
42 |
+
|
43 |
+
Yours truly,
|
44 |
+
|
45 |
+
_[Stefan Dumitrescu](https://github.com/dumitrescustefan), [Mihai Ilie](https://github.com/iliemihai) and [Per Egil Kummervold](https://huggingface.co/north)_
|