Model Card for Kyrgyz BERT Tokenizer
This is a WordPiece-based BERT tokenizer trained specifically for the Kyrgyz language. It was developed to support Kyrgyz NLP applications, including text classification, translation, and morphological analysis. The tokenizer was trained on a large corpus from various Kyrgyz text sources.
Model Details
Model Description
- Developed by: Metinov Adilet
- Funded by : Self-funded(MetinLab)
- Shared by : metinovadilet
- Model type: WordPiece Tokenizer (BERT-style)
- Language(s) (NLP): Kyrgyz (ky)
- License: MIT
- Finetuned from model [optional]: N/A (trained from scratch)
Model Sources
- Repository: metinovadilet/bert-kyrgyz-tokenizer
- Paper [optional]: N/A -
- Demo [optional]: N/A
Uses
Direct Use
This tokenizer can be used directly for NLP tasks such as:
- Tokenizing Kyrgyz texts for training language models
- Preparing data for Kyrgyz BERT training or fine-tuning
- Kyrgyz text segmentation and wordpiece-based analysis
Downstream Use [optional]
Can be used as the tokenizer for BERT-based models trained on Kyrgyz text
Supports various NLP applications like sentiment analysis, morphological modeling, and machine translation
Out-of-Scope Use
- This tokenizer is not optimized for multilingual text. It is designed for Kyrgyz-only corpora. - It may not work well for transliterated or mixed-script text (e.g., combining Latin and Cyrillic scripts).
Bias, Risks, and Limitations
- The tokenizer is limited by the training corpus, meaning rare words, dialectal forms, and domain-specific terms may not be well-represented. - As with most tokenizers, it may exhibit biases from the source text, particularly in areas of gender, ethnicity, or socio-political context. ### Recommendations Users should be aware of potential biases and evaluate performance for their specific application. If biases or inefficiencies are found, fine-tuning or training with a more diverse corpus is recommended.
How to Get Started with the Model Use the code below to get started with the model.
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("metinovadilet/bert-kyrgyz-tokenizer")
text = "Бул кыргыз тилинде жазылган текст."
tokens = tokenizer(text, return_offsets_mapping=True)
print("Input Text:", text)
print("Tokens:", tokenizer.convert_ids_to_tokens(tokens['input_ids']))
print("Token IDs:", tokens['input_ids'])
rint("Offsets:", tokens['offset_mapping'])
Training Details and Training Data Non disclosable
Technical Specifications
Model Architecture and Objective - Architecture:
WordPiece-based BERT tokenizer - Objective: Efficient tokenization for Kyrgyz NLP applications
Compute Infrastructure [More Information Needed]
Hardware
GPU: NVIDIA RTX 3090 (24GB VRAM)
CPU: intel core i5-13400f #### Software - Python 3.10 - Transformers (Hugging Face) - Tokenizers (Hugging Face)
Citation [optional]
f you use this tokenizer, please cite: @misc{bert-kyrgyz-tokenizer, author = {Metinov Adilet}, title = {BERT Kyrgyz Tokenizer}, year = {2025}, url = {https://huggingface.co/metinovadilet/bert-kyrgyz-tokenizer}, note = {Trained at MetinLab} }
Model Card Contact
For questions or issues, reach out to MetinLab via: Email: [email protected]