--- library_name: transformers license: apache-2.0 tags: - turkish - tokenizer - byte-pair-encoding - nlp - linguistics --- # Model Card for Turkish Byte Pair Encoding Tokenizer This model provides a tokenizer specifically designed for the Turkish language. It includes 256,000 Turkish word roots, all Turkish suffixes in both lowercase and uppercase forms, and extends with approximately 207,000 additional tokens using Byte Pair Encoding (BPE). The tokenizer is intended to improve the tokenization quality for NLP tasks involving Turkish text. ## Model Details ### Model Description This tokenizer is developed to handle the complex morphology and agglutinative nature of the Turkish language. By leveraging a comprehensive set of word roots and suffixes combined with BPE, it ensures efficient tokenization, preserving linguistic structure and reducing the vocabulary size for downstream tasks. - **Developed by:** Ali Arda Fincan - **Model type:** Tokenizer (Byte Pair Encoding & Pre-Defined Turkish Words) - **Language(s) (NLP):** Turkish - **License:** Apache-2.0 ### Model Sources [optional] - **Repository:** umarigan/turkish_corpus_small ### Direct Use This tokenizer can be directly used for tokenizing Turkish text in tasks like text classification, translation, or sentiment analysis. It efficiently handles the linguistic properties of Turkish, making it suitable for tasks requiring morphological analysis or text processing. ### Downstream Use The tokenizer can be fine-tuned or integrated into NLP pipelines for Turkish language processing, including model training or inference tasks. ### Out-of-Scope Use The tokenizer is not designed for non-Turkish languages or tasks requiring domain-specific tokenization not covered in its training. ## Bias, Risks, and Limitations While this tokenizer is optimized for Turkish, biases may arise if the training data contains imbalances or stereotypes. It may also perform suboptimally on highly informal or domain-specific text. ### Recommendations Users should evaluate the tokenizer on their specific datasets and tasks to identify any biases or limitations. Supplementary preprocessing or token adjustments may be required for optimal results. ## How to Get Started with the Model ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("aliarda/turkish_tokenizer") # Example usage: text = "Türkçe metin işleme için bir örnek." tokens = tokenizer.tokenize(text) print(tokens)