dumitrescustefan
/

mt5-base-romanian

Text2Text Generation

text-generation-inference

Model card Files Files and versions Community

mt5-base-romanian / README.md

Stefan Dumitrescu

Update

3cbb67f about 2 years ago

|

2.1 kB

	---
	language: ro
	license: apache-2.0
	---

	# MT5x-base-romanian

	This is a pretrained [mt5x](https://github.com/google-research/multilingual-t5) base model (390M parameters).

	Training was performed with the span corruption task on a clean 80GB Romanian text corpus for 4M total steps with these [scripts](https://github.com/dumitrescustefan/t5x_models), starting from the 1M public mt5x-base checkpoint. The model was trained with an encoder sequence length of 512 and a decoder sequence length of 256; it has the same mt5x vocabulary as the 1M multilingual checkpoint.

	#### IMPORTANT This model was pretrained on the span corruption LM task, meaning this model is not usable in any downstream task without finetuning first!

	### How to load an mt5x model

	```python
	from transformers import MT5Model, T5Tokenizer

	model = MT5Model.from_pretrained('dumitrescustefan/mt5x-base-romanian')
	tokenizer = T5Tokenizer.from_pretrained('dumitrescustefan/mt5x-base-romanian')
	input_text = "Acesta este un test."
	target_text = "Acesta este"
	inputs = tokenizer(input_text, return_tensors="pt")
	labels = tokenizer(text_target=target_text, return_tensors="pt")

	outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
	hidden_states = outputs.last_hidden_state
	print(hidden_states.shape) # this will print [1, 4, 768]
	```

	Remember to always sanitize your text! Replace ``ş`` and ``ţ`` cedilla-letters to comma-letters with :
	```python
	text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
	```
	because the model was not trained on cedilla ``ş`` and ``ţ``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.

	### Acknowledgements

	We'd like to thank [TPU Research Cloud](https://sites.research.google/trc/about/) for providing the TPUv4 cores we used to train these models!

	### Authors

	Yours truly,

	_[Stefan Dumitrescu](https://github.com/dumitrescustefan), [Mihai Ilie](https://github.com/iliemihai) and [Per Egil Kummervold](https://huggingface.co/north)_