Model Card for Kyrgyz BERT Tokenizer

This is a WordPiece-based BERT tokenizer trained specifically for the Kyrgyz language. It was developed to support Kyrgyz NLP applications, including text classification, translation, and morphological analysis. The tokenizer was trained on a large corpus from various Kyrgyz text sources.

Model Details

Model Description

  • Developed by: Metinov Adilet
  • Funded by : Self-funded(MetinLab)
  • Shared by : metinovadilet
  • Model type: WordPiece Tokenizer (BERT-style)
  • Language(s) (NLP): Kyrgyz (ky)
  • License: MIT
  • Finetuned from model [optional]: N/A (trained from scratch)

Model Sources

  • Repository: metinovadilet/bert-kyrgyz-tokenizer
  • Paper [optional]: N/A -
  • Demo [optional]: N/A

Uses

Direct Use

This tokenizer can be used directly for NLP tasks such as:

  • Tokenizing Kyrgyz texts for training language models
  • Preparing data for Kyrgyz BERT training or fine-tuning
  • Kyrgyz text segmentation and wordpiece-based analysis

Downstream Use [optional]

Can be used as the tokenizer for BERT-based models trained on Kyrgyz text

Supports various NLP applications like sentiment analysis, morphological modeling, and machine translation

Out-of-Scope Use

  • This tokenizer is not optimized for multilingual text. It is designed for Kyrgyz-only corpora. - It may not work well for transliterated or mixed-script text (e.g., combining Latin and Cyrillic scripts).

Bias, Risks, and Limitations

  • The tokenizer is limited by the training corpus, meaning rare words, dialectal forms, and domain-specific terms may not be well-represented. - As with most tokenizers, it may exhibit biases from the source text, particularly in areas of gender, ethnicity, or socio-political context. ### Recommendations Users should be aware of potential biases and evaluate performance for their specific application. If biases or inefficiencies are found, fine-tuning or training with a more diverse corpus is recommended.
  • How to Get Started with the Model Use the code below to get started with the model.

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("metinovadilet/bert-kyrgyz-tokenizer")
text = "Бул кыргыз тилинде жазылган текст."
tokens = tokenizer(text, return_offsets_mapping=True)
print("Input Text:", text)
print("Tokens:", tokenizer.convert_ids_to_tokens(tokens['input_ids']))
print("Token IDs:", tokens['input_ids']) 
rint("Offsets:", tokens['offset_mapping'])

Training Details and Training Data Non disclosable

Technical Specifications

Model Architecture and Objective - Architecture:

WordPiece-based BERT tokenizer - Objective: Efficient tokenization for Kyrgyz NLP applications

Compute Infrastructure [More Information Needed]

Hardware

GPU: NVIDIA RTX 3090 (24GB VRAM)

CPU: intel core i5-13400f #### Software - Python 3.10 - Transformers (Hugging Face) - Tokenizers (Hugging Face)

Citation [optional]

f you use this tokenizer, please cite: @misc{bert-kyrgyz-tokenizer, author = {Metinov Adilet}, title = {BERT Kyrgyz Tokenizer}, year = {2025}, url = {https://huggingface.co/metinovadilet/bert-kyrgyz-tokenizer}, note = {Trained at MetinLab} }

Model Card Contact

For questions or issues, reach out to MetinLab via: Email: [email protected]

This model was made in Collaboration with UlutsoftLLC

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including metinovadilet/bert-kyrgyz-tokenizer