gpt2-lang-ident / README.md
nie3e's picture
Update README.md
9751fb6 verified
|
raw
history blame
6.51 kB
---
license: mit
base_model: openai-community/gpt2
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: gpt2-lang-ident
results: []
pipeline_tag: text-classification
language:
- af
- am
- ar
- as
- az
- ba
- be
- bg
- bn
- ca
- ceb
- ckb
- cs
- cy
- da
- de
- dv
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- ka
- kk
- kn
- ku
- ky
- la
- lb
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- mt
- my
- nds
- ne
- nl
- nn
- no
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sah
- sd
- si
- sk
- sl
- sq
- sr
- sv
- sw
- ta
- te
- tg
- th
- tk
- tl
- tr
- tt
- ug
- uk
- ur
- vi
- yi
---
# gpt2-lang-ident
This model is a fine-tuned version of [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) on sampled sentences from `stanford-oval/ccnews` and `qanastek/EMEA-V3` datasets.
It achieves the following results on the evaluation set:
- Loss: 0.1210
- Accuracy: 0.9721
## Model description
This model is trained to predict the language of the input text.
## Intended uses & limitations
The model can predict the following languages:
The model can predict the following 90 languages:
```
[
"af", "am", "ar", "as", "az", "ba", "be", "bg", "bn", "ca",
"ceb", "ckb", "cs", "cy", "da", "de", "dv", "el", "en", "eo",
"es", "et", "eu", "fa", "fi", "fr", "fy", "ga", "gd", "gl",
"gu", "he", "hi", "hr", "hu", "hy", "id", "is", "it", "ja",
"ka", "kk", "kn", "ku", "ky", "la", "lb", "lt", "lv", "mg",
"mk", "ml", "mn", "mr", "mt", "my", "nds", "ne", "nl", "nn",
"no", "or", "pa", "pl", "ps", "pt", "ro", "ru", "sah", "sd",
"si", "sk", "sl", "sq", "sr", "sv", "sw", "ta", "te", "tg",
"th", "tk", "tl", "tr", "tt", "ug", "uk", "ur", "vi", "yi"
]
```
How to use:
```python
from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
pipeline)
checkpoint = f"nie3e/gpt2-lang-ident"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
pipe = pipeline(
task="text-classification",
model=model,
tokenizer=tokenizer,
top_k=5
)
result = pipe("To jest model służący do identyfikacji języka!")
print(result)
```
```
[[{'label': 'pl', 'score': 0.9999653100967407}, {'label': 'sr', 'score': 1.5228776646836195e-05}, {'label': 'hr', 'score': 1.057955432770541e-05}, {'label': 'bn', 'score': 1.590750912328076e-06}, {'label': 'cs', 'score': 1.3942196801508544e-06}]]
```
## Training and evaluation data
<details><summary>Training data ([lang]: count)</summary>
[bn]: 1947
[ar]: 1947
[vi]: 1947
[uk]: 1947
[kn]: 1947
[mr]: 1947
[id]: 1947
[te]: 1947
[no]: 1947
[ru]: 1947
[he]: 1947
[az]: 1947
[ca]: 1946
[fa]: 1946
[hi]: 1946
[th]: 1946
[tr]: 1946
[mk]: 1946
[ta]: 1945
[sq]: 1945
[ur]: 1942
[gu]: 1939
[ml]: 1936
[is]: 1738
[de]: 1543
[da]: 1521
[fi]: 1461
[el]: 1431
[nl]: 1424
[fr]: 1408
[cs]: 1401
[es]: 1397
[en]: 1394
[lt]: 1392
[hu]: 1379
[pt]: 1375
[lv]: 1373
[it]: 1360
[pl]: 1355
[sk]: 1355
[et]: 1348
[sl]: 1328
[sv]: 1300
[bg]: 1278
[mt]: 1234
[ro]: 1218
[kk]: 1179
[hy]: 1176
[or]: 1112
[pa]: 780
[sr]: 744
[as]: 735
[hr]: 722
[ne]: 626
[gl]: 566
[ckb]: 563
[ka]: 560
[ug]: 485
[ky]: 453
[eu]: 351
[ps]: 311
[tl]: 307
[fy]: 290
[mn]: 289
[si]: 244
[cy]: 214
[nn]: 212
[ku]: 195
[tg]: 176
[am]: 141
[tt]: 121
[ja]: 104
[lb]: 93
[tk]: 72
[be]: 64
[sw]: 45
[af]: 44
[my]: 40
[ceb]: 35
[la]: 33
[dv]: 20
[ba]: 19
[ga]: 19
[eo]: 19
[gd]: 16
[mg]: 15
[yi]: 14
[sah]: 14
[sd]: 11
[nds]: 11
</details>
<details><summary>Eval data ([lang]: count)</summary>
[te]: 195
[mk]: 195
[bn]: 195
[uk]: 195
[hi]: 195
[ar]: 195
[sq]: 195
[kn]: 195
[tr]: 195
[ca]: 195
[az]: 195
[fa]: 195
[ru]: 195
[mr]: 195
[id]: 195
[no]: 195
[vi]: 195
[th]: 195
[he]: 195
[gu]: 194
[ml]: 194
[ta]: 194
[ur]: 194
[is]: 174
[de]: 154
[da]: 152
[fi]: 146
[el]: 143
[nl]: 142
[fr]: 141
[es]: 140
[cs]: 140
[en]: 139
[lt]: 139
[hu]: 138
[lv]: 137
[pt]: 137
[it]: 136
[et]: 135
[pl]: 135
[sk]: 135
[sl]: 133
[sv]: 130
[bg]: 128
[mt]: 123
[ro]: 122
[hy]: 118
[kk]: 118
[or]: 111
[pa]: 78
[sr]: 74
[as]: 74
[hr]: 72
[ne]: 63
[gl]: 57
[ckb]: 56
[ka]: 56
[ug]: 49
[ky]: 45
[eu]: 35
[ps]: 31
[tl]: 31
[mn]: 29
[fy]: 29
[si]: 24
[nn]: 21
[cy]: 21
[ku]: 19
[tg]: 18
[am]: 14
[tt]: 12
[ja]: 10
[lb]: 9
[tk]: 7
[be]: 6
[my]: 4
[sw]: 4
[af]: 4
[ceb]: 3
[la]: 3
[ba]: 2
[dv]: 2
[eo]: 2
[gd]: 2
[ga]: 2
[mg]: 1
[sd]: 1
[nds]: 1
[yi]: 1
[sah]: 1
</details>
### Training procedure
GPU: RTX 3090 \
Training time: 1h 53min
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10
- mixed_precision_training: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:-----:|:---------------:|:--------:|
| 0.2833 | 1.0 | 2812 | 0.2004 | 0.94 |
| 0.168 | 2.0 | 5625 | 0.1567 | 0.954 |
| 0.1131 | 3.0 | 8437 | 0.1429 | 0.9586 |
| 0.0832 | 4.0 | 11250 | 0.1257 | 0.967 |
| 0.0635 | 5.0 | 14062 | 0.1222 | 0.9682 |
| 0.0479 | 6.0 | 16875 | 0.1214 | 0.9704 |
| 0.0361 | 7.0 | 19687 | 0.1255 | 0.9712 |
| 0.0258 | 8.0 | 22500 | 0.1178 | 0.9712 |
| 0.0243 | 9.0 | 25312 | 0.1223 | 0.9724 |
| 0.0171 | 10.0 | 28120 | 0.1210 | 0.9721 |
### Framework versions
- Transformers 4.36.2
- Pytorch 2.1.2+cu121
- Datasets 2.16.1
- Tokenizers 0.15.0