TUKE-KEMT
/

slovak-t5-base

Text2Text Generation

text-generation-inference

Model card Files Files and versions Community

Slovak T5 Base

Monolingual Slovak model, trained from scratch on web data.

This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet.

After fine-tuning, it is suitable for tasks such as:

Question answering
Summarization
Generation of synthetic data

Training data

Trained on the Slovak subset of mc4 dataset with NanoT5 with default settings.

The training corpus has together 14B tokens after deduplication.

It consists of the Slovak data from:

mc4
Oscar
Wikipedia
custom ollection of newspaper articles
custom collection of web pages
Slovak part of the European Parliament Proceedings

Hyperparameters:

Input length: 512 tokens
Effective Batch Size: 128
Steps: 200000
Optimizer: Adafactor
Scheduler: Legacy
Learning Rate: 0.2
Gradient clip: 1

Evaluation

After finetuning for question answering on SK-QUAD, it gives:

Slovak T5 Base : 71.31 F1
Umt5 Base: 69.22 F1
Mt5 Base 65.29 F1
Mt0 Base 65.17 F1

Bias

The model is published as it is. We did not make any specific attempts to clean up the data.

License

Free for scientific and commercial use under the terms of: cc-by-sa-4.0

Creadits

Daniel Hládek @ KEMT FIE TUKE

Downloads last month: 123

Safetensors

Model size

383M params

Tensor type

F32

·

Inference Providers NEW

Text2Text Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train TUKE-KEMT/slovak-t5-base