File size: 6,924 Bytes
fe59bc3 2bb2700 8be9521 2bb2700 fe59bc3 2bb2700 ff92228 fe59bc3 4b9a102 fe59bc3 8075d90 fe59bc3 b5345f8 fe59bc3 8075d90 fe59bc3 d809194 9b059ba fe59bc3 8075d90 fe59bc3 8075d90 fe59bc3 8075d90 fe59bc3 8075d90 fe59bc3 8075d90 fe59bc3 8075d90 ce2b694 8075d90 498ca50 b7f718a ce2b694 8075d90 fe59bc3 199385f fe59bc3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
---
license: other
language:
- tr
library_name: transformers
pipeline_tag: text2text-generation
inference: false
---
<!--
inference:
parameters:
temperature: 10
repetition_penalty: 10
top_p: 0.5
temperature: 0.7
repetition_penalty: 100
top_p: 0.9
-->
# Model Card for TURNA
<!-- Provide a quick summary of what the model is/does. -->
TURNA is a Turkish language model based on the UL2 framework which is suitable for both understanding and generation tasks.
Evaluations across three generation and five understanding tasks in Turkish show that TURNA outperforms several multilingual models and competes with monolingual Turkish models in understanding tasks.
The model is shared with the public to be used solely for non-commercial academic research purposes.
## Model Details
- 36 encoder and decoder layers
- 16 attention heads
- Token embeddings are 1024 dimensional
- The multi-layer perceptron layers have 2816 hidden dimensions and employ Gated GeLu activations
- The parameters of the input and classification layers are not shared
- 1.1B parameters
- used a unigram subword tokenizer trained on 10GB of text that consists of random subsets of OSCAR, OPUS, and Wikipedia
- Vocabulary size: 32000 tokens + 128 special tokens
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Bogazici University Computer Engineering Department TABILAB (special thanks to VNGRS-AI for sharing their tokenizer)
- **Funded by:** We thank the Google TPU Research Cloud program for providing us with credits to pretrain our model on TPU v3-8 machines. We are grateful to TETAM and BOUN CMPE for providing access to the GPU cluster used in fine-tuning and evaluation experiments.
<!-- - **Shared by [optional]:** [More Information Needed] -->
- **Model type:** Transformer-based encoder-decoder
- **Language(s) (NLP):** Turkish
- **License:** The model is shared with the public to be used solely for non-commercial academic research purposes.
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** [Training code](https://github.com/boun-tabi-LMG/turna), [Finetuning library](https://github.com/boun-tabi-LMG/turkish-lm-tuner)
- **Paper:** [Arxiv paper](https://arxiv.org/abs/2401.14373)
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
This model can be used for research purposes. You give some text and this model tries to predict the next words.
### Downstream Use
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
This model can be finetuned using [our library](https://github.com/boun-tabi-LMG/turkish-lm-tuner) to solve your custom task involving Turkish language.
This model can be further trained to behave more helpful, less harmful and better for dialog use cases.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
Any commercial or malicious activity.
## Bias, Risks, and Limitations
We refer to the Flan-T5's [official model card](https://arxiv.org/pdf/2210.11416.pdf):
> Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.
### Ethical considerations and risks
> ... (ed. The model) is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.
### Known Limitations
> ... (ed. The model) has not been tested in real world applications.
### Sensitive Use:
> ... (ed. The model) should not be applied for any unacceptable use cases, e.g., generation of abusive speech.
## How to Get Started with the Model
You can find the technical guidance at our library's Github [page](https://github.com/boun-tabi-LMG/turkish-lm-tuner).
## Training Details
- The pretraining was performed with Mixture-of-Denoisers (MoD)
- This version of the model is trained for 1740000 steps
- Batch size: 48
- Input and output lengths: 512
- Effectively exposed to 42.7B tokens
Refer to the paper for more information.
## Evaluation
We didn't yet evaluate the model for biases in any way.
However, we performed finetuning for several understanding and generation tasks:
- Paraphrasing: TAT and OST ([source](https://aclanthology.org/2022.icnlsp-1.14.pdf))
- Summarization and news title generation: [TRNews](https://dl.acm.org/doi/10.1007/s10579-021-09568-y) and [MLSUM](https://arxiv.org/pdf/2004.14900v1.pdf)
- Named Entity Recognition: [WikiANN](https://www.aclweb.org/anthology/P19-1015) and [MilliyetNER](https://doi.org/10.1017/S135132490200284X)
- Part of Speech tagging: Two Universal Dependencies Turkish Treebanks, [IMST](https://universaldependencies.org/treebanks/tr_imst/index.html), [BOUN](https://universaldependencies.org/treebanks/tr_boun/index.html).
- Semantic Textual Similarity: [STSb-tr](https://doi.org/10.18653/v1/2021.gem-1.3)
- Natural language inference: [NLI-TR](https://doi.org/10.18653/v1/2020.emnlp-main.662)
- Text classification: [Product reviews](https://huggingface.co/datasets/turkish_product_reviews), [TTC4900](https://doi.org/10.5505/pajes.2018.15931), and [Tweet sentiments](https://ieeexplore.ieee.org/document/8554037)
Refer to the paper for more information.
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** TPU v3-8
- **Hours used:** About 400 hours
- **Cloud Provider:** Google Cloud
- **Compute Region:** europe-west4-a
- **Carbon Emitted:** 64.52 kg CO2_2
## Technical Specifications
Refer to the paper for more information.
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
Coming soon!
**APA:**
Coming soon!
## Model Card Authors
Paper authors.
## Model Card Contact
Onur Güngör
<!--datasets:
- batubayk/TR-News
- mlsum
- mrbesher/tr-paraphrase-opensubtitles2018
- mrbesher/tr-paraphrase-tatoeba
- figenfikri/stsb_tr
- nli_tr
- ttc4900
- turkish_product_reviews-->
|