Minimind Odia Tokenizer
This is the Minimind Odia Tokenizer, a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Odia language. It was trained on the OdiaGenAIdata/pre_train_odia_data_processed dataset and is designed to handle Odia text for natural language processing (NLP) tasks.
The tokenizer is available for easy integration into your projects directly from Hugging Face.
How to Use
To use this tokenizer in your own project, follow the steps below:
1. Install the Necessary Libraries
Ensure that you have the Hugging Face transformers
and tokenizers
libraries installed:
pip install transformers tokenizers
from transformers import AutoTokenizer
2. Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("shantipriya/minimind_odia_tokenizer")
Example Odia text
text = "ତୁମେ କେଉଁଠାରୁ ଆସିଛ?"
Tokenize the input text
encoded_text = tokenizer(text)
Print tokenized output
print("Tokenized text:", encoded_text)
3. Tokenizer Features
- Vocabulary Size: The tokenizer has a vocabulary of 150,000 tokens, optimized for handling a diverse range of Odia words.
- Special Tokens: Includes essential special tokens such as
<unk>
,<s>
, and</s>
for managing unknown tokens, and marking the beginning and end of sequences, respectively. - Preprocessing: The tokenizer employs a Byte-Level pre-tokenizer to handle byte-level encoding, ensuring efficient tokenization for various Odia texts.
4. Example Usage
Here’s how you can use the tokenizer for encoding and decoding Odia text:
from transformers import AutoTokenizer
Load the pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("shantipriya/minimind_odia_tokenizer")
Example Odia messages
messages = [
{"role": "system", "content": "ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି।"},
{"role": "user", "content": "ତୁମେ କେଉଁଠାରୁ ଆସିଛ?"},
{"role": "assistant", "content": "ମୁଁ ପୃଥିବୀରୁ ଆସିଛି"}
]
Tokenized input (ID list): [414, 329, 225, 316, 221, 318, 309, 274, 531, 222, 303, 596, 1506, 223, 237, 221, 238, 223, 226, 272, 222, 312, 221, 227, 676, 295, 276, 229, 231, 225, 232, 225, 292, 275, 300, 221, 224, 229, 349, 223, 255, 33, 249, 463, 236, 299, 247, 223, 230, 246, 224, 229, 349, 223, 255, 223]
Tokenization time: 0.0548 seconds
Decoded output: ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି। ତୁମେ କେଉଁଠାରୁ ଆସିଛ? ମୁଁ ପୃଥିବୀରୁ ଆସିଛି
Number of tokens: 56
Comparison of Original and Decoded Text:
Original Text: ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି। ତୁମେ କେଉଁଠାରୁ ଆସିଛ? ମୁଁ ପୃଥିବୀରୁ ଆସିଛି
Decoded Text: ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି। ତୁମେ କେଉଁଠାରୁ ଆସିଛ? ମୁଁ ପୃଥିବୀରୁ ଆସିଛି
5. Dataset Used
The tokenizer was trained on the OdiaGenAIdata/pre_train_odia_data_processed dataset, which contains a large corpus of Odia text suitable for training NLP models.
6. Tokenizer Comparison
Tokenizer | Vocabulary Size | Tokenization Time (seconds) | Number of Tokens | Original Text | Decoded Output | Tokenized Output (Odia Characters) |
---|---|---|---|---|---|---|
shantipriya/minimind_odia_tokenizer | 150,000 | 0.563647 | 181 | ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... | ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... | ['Ċ', 'à¬ĵଡ', '଼ି', 'à¬Ĩ', 'Ġà¬Ń', ...] |
ai4bharat/indic-bert | 200,000 | 2.369656 | 110 | ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... | ଓଡଆ ଭଷ ଏକ ଇଣଡ-ଆରୟନ ଭଷ... | ['[CLS]', '▁ଓ', 'ଡ', 'ଆ', '▁ଭ', 'ଷ', ...] |
facebook/m2m100_418M | 128,104 | 0.358011 | 90 | ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... | ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... | ['en', '▁ଓଡ଼ିଆ', '▁ଭାଷ', 'ା', '▁ଏକ', ...] |
a. Vocabulary Size
- shantipriya/minimind_odia_tokenizer: 150,000 tokens (Optimized for Odia)
- ai4bharat/indic-bert: 200,000 tokens (Supports multiple Indic languages)
- facebook/m2m100_418M: 128,104 tokens (Multilingual support)
b. Tokenization Time
- facebook/m2m100_418M: Fastest at 0.3580s
- shantipriya/minimind_odia_tokenizer: Moderate at 0.5636s
- ai4bharat/indic-bert: Slowest at 2.3697s
c. Number of Tokens
- shantipriya/minimind_odia_tokenizer: 181 tokens (Granular tokenization)
- ai4bharat/indic-bert: 110 tokens (Compact tokenization)
- facebook/m2m100_418M: 90 tokens (Efficient tokenization for multilingual data)
d. Best for Specific Use Cases
- For Odia text:
shantipriya/minimind_odia_tokenizer
is the best choice due to its precision in tokenizing Odia words. - For multilingual support:
facebook/m2m100_418M
is ideal for handling many languages efficiently. - For Indic languages:
ai4bharat/indic-bert
offers robust support for multiple Indic languages but is slower in comparison to the other two.
6. License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Contibutors
Acknowledgements
MiniMind: Repository and scripts related to the project.
OdiaGenAI: The dataset used for training the tokenizer is part of the Odia Generative AI project.
AMD: The tokenizer was trained on an AMD MI 250 machine to optimize performance and efficiency.
For questions or issues, feel free to open an issue or contact the repository maintainer.