Aranizer | Arabic Tokenization with SentencePiece & PBE
Collection
Collection of Arabic Tokenizers with different sizes based on SentencePiece & PBE Encodings suitable for training LLMs
•
6 items
•
Updated
•
1
Aranizer is an Arabic PBE-based tokenizer designed for efficient and versatile tokenization.
The Aranizer tokenizer has achieved state-of-the-art results on the Arabic Tokenizers Leaderboard on Hugging Face. Below is a screenshot highlighting this achievement:
The Aranizer tokenizer can be easily loaded using the transformers
library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project:
from transformers import AutoTokenizer
# Load the Aranizer tokenizer
tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-PBE-32k")
# Example usage
text = "اكتب النص العربي"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens:", tokens)
print("Token IDs:", token_ids)
## Citation
@article{koubaa2024arabiangpt,
title={ArabianGPT: Native Arabic GPT-based Large Language Model},
author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
year={2024},
publisher={Preprints}
}