Improved LLaMA 2 Tokenizer with Persian Language Support

Model Description

This tokenizer is an improved version of the LLaMA 2 tokenizer, specifically enhanced to provide better support for the Persian language. It combines the original LLaMA 2 tokenizer with a custom tokenizer trained on the Persian Wikipedia corpus, resulting in improved tokenization for Persian text while maintaining support for other languages.

Key Features

  • Enhanced support for Persian language tokenization
  • Maintained multilingual capabilities of the original LLaMA 2 tokenizer
  • Improved handling of Persian-specific characters and word structures
  • Larger vocabulary size to accommodate Persian tokens

Training Data

The tokenizer was created using the following steps:

  1. A separate tokenizer with 5000 merges was trained on the Persian Wikipedia corpus to capture Persian-specific tokenization patterns.
  2. This Persian-specific tokenizer was then merged with the original LLaMA 2 tokenizer.

Training Procedure

  1. Persian Wikipedia Tokenizer Training:

    • Corpus: Persian Wikipedia dump
    • Tokenization algorithm: BPE
    • Vocabulary size: 5000
  2. Merging with LLaMA 2 Tokenizer:

    • Base tokenizer: LLaMA 2 tokenizer
    • Final vocabulary size: 36954

Usage

To use this tokenizer with the Hugging Face Transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("amirakhlaghiqqq/llama2-persian-tokenizer")

# Example usage
text = "این یک مثال به زبان فارسی است."
tokens = tokenizer(text)
print(tokens)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.