File size: 8,520 Bytes
1c2cf88
 
 
 
 
 
 
 
 
 
48026a3
 
 
 
 
 
7e3d4d5
24ab136
7e3d4d5
 
ad2324c
7e3d4d5
 
 
ad2324c
7e3d4d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: apache-2.0
datasets:
- HuggingFaceFW/fineweb-2
language:
- tr
tags:
- turkish
- ul2
- t5
---

# BERT5urk

![BERT5urk](bert5urk_logo.png)

This repository hosts the new 1.42B Turkish T5 model named BERT5urk.

BERT5urk is part of the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) family and pretrained using the awesome
[T5X](https://github.com/google-research/t5x) library with the [UL2](https://arxiv.org/abs/2205.05131) objective.

Inspired by the great [Finnish T5 and UL2 models](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) from the [Finnish NLP](https://huggingface.co/Finnish-NLP)
group, BERT5urk also uses UL2 and the efficient T5 architecture, that is proposed in the ["Scale Efficiently"](https://arxiv.org/abs/2109.10686) paper. Many thanks
to the [Finnish NLP](https://huggingface.co/Finnish-NLP) group for open-sourcing the pretraining code and models!

# Pretraining Data

BERT5urk uses the Turkish part of the amazing [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) corpus.
Only documents with a higher language score than 0.99 are chosen for final pretraining corpus, that has a total size of 262GB.

We train a SPM-based vocab on a 3GB corpus from randomly chosen documents of the pretraining corpus.

# Pretraining

BERT5urk was pretrained with the awesome [T5X](https://github.com/google-research/t5x) library. Some pretraining highlights:

* One-shot pretraining (pretraining without any training crashes) was possible a v3-32 TPU Pod and took 16.56 days
* Model was pretrained for 2M steps for an input & output sequence length of 512 and a batch size of 128
* The resulting model has 1.42B parameters

# Evaluation

Detailed evaluations can be found in the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) repository. Additionally, we also fine-tuned
[TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) models as it is another T5 model with 1.14B parameters for comparison.

## Encoder-only Results

For experiments on named entity recognition (NER) and part-of-speech (PoS) tagging we also the awesome Flair library and fine-tune only the encoder of BERT5urk and TURNA.
The overall performance can be seen in the following table:

| Model Name                                                                                                | Overall Development | Overall Test |
|-----------------------------------------------------------------------------------------------------------|--------------------:|-------------:|
| [BERTurk (cased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-cased)                        |               89.72 |        90.05 |
| [BERTurk (uncased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased)                    |               89.25 |        89.95 |
| [BERTurk (cased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-cased)                              |               88.98 |        89.49 |
| [BERTurk (uncased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-uncased)                          |               89.28 |        89.67 |
| [ConvBERTurk (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-cased)                           |           **90.06** |        90.27 |
| [ConvBERTurk mC4 (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-cased)                   |               90.03 |        90.09 |
| [ConvBERTurk mC4 (uncased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-uncased)               |               89.76 |        89.97 |
| [DistilBERTurk (cased)](https://huggingface.co/dbmdz/distilbert-base-turkish-cased)                       |               87.95 |        88.16 |
| [ELECTRA Base (cased)](https://huggingface.co/dbmdz/electra-base-turkish-cased-discriminator)             |               89.08 |        89.91 |
| [ELECTRA Base mC4 (cased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-cased-discriminator)     |               89.24 |        90.03 |
| [ELECTRA Base mC4 (uncased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-uncased-discriminator) |               89.09 |        89.62 |
| [ELECTRA Small (cased)](https://huggingface.co/dbmdz/electra-small-turkish-cased-discriminator)           |               87.27 |        88.28 |
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)                                                     |               89.96 |        90.26 |
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA)                                                       |               88.81 |        89.36 |

## Encoder-decoder Results

We tried to replicate the results from the [TURNA](https://arxiv.org/abs/2401.14373) paper using the [TURNA fine-tuning](https://github.com/boun-tabi-LMG/turkish-lm-tuner) library.

### Paraphrasing - Tatoeba

We fine-tune five different models for both TURNA and BERT5urk with different seeds and report the average score. Additionally the score from the TURNA paper
is also shown in the following table:

| Model                                                            | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
| [TURNA](https://arxiv.org/abs/2401.14373) (paper)                | 90.22       | 80.23       | 88.95       | 71.14     | 87.56       |
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 90.36       | 80.50       | 89.10       | 71.48     | 87.63       |
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)            | 90.47       | 80.78       | 89.21       | 71.89     | 87.74       |

### Paraphrasing - OpenSubtitles

We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

| Model                                                            | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
| [TURNA](https://arxiv.org/abs/2401.14373) (paper)                | 78.43       | 63.58       | 76.81       | 51.47     | 74.79       |
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 78.36       | 63.42       | 76.71       | 51.39     | 74.94       |
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)            | 78.56       | 63.80       | 76.95       | 51.74     | 75.07       |

#### Title Generation - TrNews

We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

| Model                                                            | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
| [TURNA](https://arxiv.org/abs/2401.14373) (paper)                | 36.47       | 22.88       | 35.47       | 12.64     | 23.62       |
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 41.65       | 27.60       | 36.77       | 18.60     | 34.55       |
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)            | 41.79       | 27.77       | 37.00       | 19.08     | 34.69       |

### Summarization - TrNews

We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):

| Model                                                            | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
| [TURNA](https://arxiv.org/abs/2401.14373) (paper)                | 41.77       | 27.81       | 36.99       | 19.05     | 34.61       |
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 40.75       | 26.82       | 35.88       | 18.00     | 33.91       |
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)            | 41.00       | 27.08       | 36.24       | 18.78     | 23.96       |

# Acknowledgments

Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
Many Thanks for providing access to the TPUs over many years ❤️

Made from Bavarian Oberland with ❤️ and 🥨.