Slovak T5 Base

Monolingual Slovak model, trained from scratch on web data.

This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet.

After fine-tuning, it is suitable for tasks such as:

  • Question answering
  • Summarization
  • Generation of synthetic data

Training data

Trained on the Slovak subset of mc4 dataset with NanoT5 with default settings.

The training corpus has together 14B tokens after deduplication.

It consists of the Slovak data from:

  • mc4
  • Oscar
  • Wikipedia
  • custom ollection of newspaper articles
  • custom collection of web pages
  • Slovak part of the European Parliament Proceedings

Hyperparameters:

  • Input length: 512 tokens
  • Effective Batch Size: 128
  • Steps: 200000
  • Optimizer: Adafactor
  • Scheduler: Legacy
  • Learning Rate: 0.2
  • Gradient clip: 1

Evaluation

After finetuning for question answering on SK-QUAD, it gives:

  • Slovak T5 Base : 71.31 F1
  • Umt5 Base: 69.22 F1
  • Mt5 Base 65.29 F1
  • Mt0 Base 65.17 F1

Bias

The model is published as it is. We did not make any specific attempts to clean up the data.

License

Free for scientific and commercial use under the terms of: cc-by-sa-4.0

Creadits

  • Daniel Hládek @ KEMT FIE TUKE
Downloads last month
13
Safetensors
Model size
383M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train TUKE-KEMT/slovak-t5-base