--- license: mit base_model: openai-community/gpt2 tags: - generated_from_trainer metrics: - accuracy model-index: - name: gpt2-lang-ident results: [] pipeline_tag: text-classification language: - af - am - ar - as - az - ba - be - bg - bn - ca - ceb - ckb - cs - cy - da - de - dv - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - he - hi - hr - hu - hy - id - is - it - ja - ka - kk - kn - ku - ky - la - lb - lt - lv - mg - mk - ml - mn - mr - mt - my - nds - ne - nl - nn - no - or - pa - pl - ps - pt - ro - ru - sah - sd - si - sk - sl - sq - sr - sv - sw - ta - te - tg - th - tk - tl - tr - tt - ug - uk - ur - vi - yi --- # gpt2-lang-ident This model is a fine-tuned version of [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) on sampled sentences from `stanford-oval/ccnews` and `qanastek/EMEA-V3` datasets. It achieves the following results on the evaluation set: - Loss: 0.1210 - Accuracy: 0.9721 ## Model description This model is trained to predict the language of the input text. ## Intended uses & limitations The model can predict the following languages: The model can predict the following 90 languages: ``` [ "af", "am", "ar", "as", "az", "ba", "be", "bg", "bn", "ca", "ceb", "ckb", "cs", "cy", "da", "de", "dv", "el", "en", "eo", "es", "et", "eu", "fa", "fi", "fr", "fy", "ga", "gd", "gl", "gu", "he", "hi", "hr", "hu", "hy", "id", "is", "it", "ja", "ka", "kk", "kn", "ku", "ky", "la", "lb", "lt", "lv", "mg", "mk", "ml", "mn", "mr", "mt", "my", "nds", "ne", "nl", "nn", "no", "or", "pa", "pl", "ps", "pt", "ro", "ru", "sah", "sd", "si", "sk", "sl", "sq", "sr", "sv", "sw", "ta", "te", "tg", "th", "tk", "tl", "tr", "tt", "ug", "uk", "ur", "vi", "yi" ] ``` How to use: ```python from transformers import (AutoModelForSequenceClassification, AutoTokenizer, pipeline) checkpoint = f"nie3e/gpt2-lang-ident" model = AutoModelForSequenceClassification.from_pretrained(checkpoint) tokenizer = AutoTokenizer.from_pretrained(checkpoint) pipe = pipeline( task="text-classification", model=model, tokenizer=tokenizer, top_k=5 ) result = pipe("To jest model służący do identyfikacji języka!") print(result) ``` ``` [[{'label': 'pl', 'score': 0.9999653100967407}, {'label': 'sr', 'score': 1.5228776646836195e-05}, {'label': 'hr', 'score': 1.057955432770541e-05}, {'label': 'bn', 'score': 1.590750912328076e-06}, {'label': 'cs', 'score': 1.3942196801508544e-06}]] ``` ## Training and evaluation data
Training data ([lang]: count) [bn]: 1947 [ar]: 1947 [vi]: 1947 [uk]: 1947 [kn]: 1947 [mr]: 1947 [id]: 1947 [te]: 1947 [no]: 1947 [ru]: 1947 [he]: 1947 [az]: 1947 [ca]: 1946 [fa]: 1946 [hi]: 1946 [th]: 1946 [tr]: 1946 [mk]: 1946 [ta]: 1945 [sq]: 1945 [ur]: 1942 [gu]: 1939 [ml]: 1936 [is]: 1738 [de]: 1543 [da]: 1521 [fi]: 1461 [el]: 1431 [nl]: 1424 [fr]: 1408 [cs]: 1401 [es]: 1397 [en]: 1394 [lt]: 1392 [hu]: 1379 [pt]: 1375 [lv]: 1373 [it]: 1360 [pl]: 1355 [sk]: 1355 [et]: 1348 [sl]: 1328 [sv]: 1300 [bg]: 1278 [mt]: 1234 [ro]: 1218 [kk]: 1179 [hy]: 1176 [or]: 1112 [pa]: 780 [sr]: 744 [as]: 735 [hr]: 722 [ne]: 626 [gl]: 566 [ckb]: 563 [ka]: 560 [ug]: 485 [ky]: 453 [eu]: 351 [ps]: 311 [tl]: 307 [fy]: 290 [mn]: 289 [si]: 244 [cy]: 214 [nn]: 212 [ku]: 195 [tg]: 176 [am]: 141 [tt]: 121 [ja]: 104 [lb]: 93 [tk]: 72 [be]: 64 [sw]: 45 [af]: 44 [my]: 40 [ceb]: 35 [la]: 33 [dv]: 20 [ba]: 19 [ga]: 19 [eo]: 19 [gd]: 16 [mg]: 15 [yi]: 14 [sah]: 14 [sd]: 11 [nds]: 11
Eval data ([lang]: count) [te]: 195 [mk]: 195 [bn]: 195 [uk]: 195 [hi]: 195 [ar]: 195 [sq]: 195 [kn]: 195 [tr]: 195 [ca]: 195 [az]: 195 [fa]: 195 [ru]: 195 [mr]: 195 [id]: 195 [no]: 195 [vi]: 195 [th]: 195 [he]: 195 [gu]: 194 [ml]: 194 [ta]: 194 [ur]: 194 [is]: 174 [de]: 154 [da]: 152 [fi]: 146 [el]: 143 [nl]: 142 [fr]: 141 [es]: 140 [cs]: 140 [en]: 139 [lt]: 139 [hu]: 138 [lv]: 137 [pt]: 137 [it]: 136 [et]: 135 [pl]: 135 [sk]: 135 [sl]: 133 [sv]: 130 [bg]: 128 [mt]: 123 [ro]: 122 [hy]: 118 [kk]: 118 [or]: 111 [pa]: 78 [sr]: 74 [as]: 74 [hr]: 72 [ne]: 63 [gl]: 57 [ckb]: 56 [ka]: 56 [ug]: 49 [ky]: 45 [eu]: 35 [ps]: 31 [tl]: 31 [mn]: 29 [fy]: 29 [si]: 24 [nn]: 21 [cy]: 21 [ku]: 19 [tg]: 18 [am]: 14 [tt]: 12 [ja]: 10 [lb]: 9 [tk]: 7 [be]: 6 [my]: 4 [sw]: 4 [af]: 4 [ceb]: 3 [la]: 3 [ba]: 2 [dv]: 2 [eo]: 2 [gd]: 2 [ga]: 2 [mg]: 1 [sd]: 1 [nds]: 1 [yi]: 1 [sah]: 1
### Training procedure GPU: RTX 3090 \ Training time: 1h 53min ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 8 - eval_batch_size: 4 - seed: 42 - gradient_accumulation_steps: 4 - total_train_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 10 - mixed_precision_training: Native AMP ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | |:-------------:|:-----:|:-----:|:---------------:|:--------:| | 0.2833 | 1.0 | 2812 | 0.2004 | 0.94 | | 0.168 | 2.0 | 5625 | 0.1567 | 0.954 | | 0.1131 | 3.0 | 8437 | 0.1429 | 0.9586 | | 0.0832 | 4.0 | 11250 | 0.1257 | 0.967 | | 0.0635 | 5.0 | 14062 | 0.1222 | 0.9682 | | 0.0479 | 6.0 | 16875 | 0.1214 | 0.9704 | | 0.0361 | 7.0 | 19687 | 0.1255 | 0.9712 | | 0.0258 | 8.0 | 22500 | 0.1178 | 0.9712 | | 0.0243 | 9.0 | 25312 | 0.1223 | 0.9724 | | 0.0171 | 10.0 | 28120 | 0.1210 | 0.9721 | ### Framework versions - Transformers 4.36.2 - Pytorch 2.1.2+cu121 - Datasets 2.16.1 - Tokenizers 0.15.0