Minimind Odia Tokenizer

This is the Minimind Odia Tokenizer, a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Odia language. It was trained on the OdiaGenAIdata/pre_train_odia_data_processed dataset and is designed to handle Odia text for natural language processing (NLP) tasks.

The tokenizer is available for easy integration into your projects directly from Hugging Face.

How to Use

To use this tokenizer in your own project, follow the steps below:

1. Install the Necessary Libraries

Ensure that you have the Hugging Face transformers and tokenizers libraries installed:

pip install transformers tokenizers

from transformers import AutoTokenizer

2. Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("shantipriya/minimind_odia_tokenizer")

Example Odia text

text = "ତୁମେ କେଉଁଠାରୁ ଆସିଛ?"

Tokenize the input text

encoded_text = tokenizer(text)

Print tokenized output

print("Tokenized text:", encoded_text)

3. Tokenizer Features

Vocabulary Size: The tokenizer has a vocabulary of 150,000 tokens, optimized for handling a diverse range of Odia words.
Special Tokens: Includes essential special tokens such as <unk>, <s>, and </s> for managing unknown tokens, and marking the beginning and end of sequences, respectively.
Preprocessing: The tokenizer employs a Byte-Level pre-tokenizer to handle byte-level encoding, ensuring efficient tokenization for various Odia texts.

4. Example Usage

Here’s how you can use the tokenizer for encoding and decoding Odia text:

from transformers import AutoTokenizer

Load the pre-trained tokenizer

tokenizer = AutoTokenizer.from_pretrained("shantipriya/minimind_odia_tokenizer")

Example Odia messages

messages = [
    {"role": "system", "content": "ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି।"},
    {"role": "user", "content": "ତୁମେ କେଉଁଠାରୁ ଆସିଛ?"},
    {"role": "assistant", "content": "ମୁଁ ପୃଥିବୀରୁ ଆସିଛି"}
]

Tokenized input (ID list): [414, 329, 225, 316, 221, 318, 309, 274, 531, 222, 303, 596, 1506, 223, 237, 221, 238, 223, 226, 272, 222, 312, 221, 227, 676, 295, 276, 229, 231, 225, 232, 225, 292, 275, 300, 221, 224, 229, 349, 223, 255, 33, 249, 463, 236, 299, 247, 223, 230, 246, 224, 229, 349, 223, 255, 223]

Tokenization time: 0.0548 seconds

Decoded output:  ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି। ତୁମେ କେଉଁଠାରୁ ଆସିଛ? ମୁଁ ପୃଥିବୀରୁ ଆସିଛି

Number of tokens: 56

Comparison of Original and Decoded Text:

Original Text: ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି। ତୁମେ କେଉଁଠାରୁ ଆସିଛ? ମୁଁ ପୃଥିବୀରୁ ଆସିଛି

Decoded Text:  ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି। ତୁମେ କେଉଁଠାରୁ ଆସିଛ? ମୁଁ ପୃଥିବୀରୁ ଆସିଛି

5. Dataset Used

The tokenizer was trained on the OdiaGenAIdata/pre_train_odia_data_processed dataset, which contains a large corpus of Odia text suitable for training NLP models.

6. Tokenizer Comparison

Tokenizer	Vocabulary Size	Tokenization Time (seconds)	Number of Tokens	Original Text	Decoded Output	Tokenized Output (Odia Characters)
shantipriya/minimind_odia_tokenizer	150,000	0.563647	181	ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା...	ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା...	['Ċ', 'à¬ĵà¬¡', 'à¬¼à¬¿', 'à¬Ĩ', 'Ġà¬Ń', ...]
ai4bharat/indic-bert	200,000	2.369656	110	ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା...	ଓଡଆ ଭଷ ଏକ ଇଣଡ-ଆରୟନ ଭଷ...	['[CLS]', '▁ଓ', 'ଡ', 'ଆ', '▁ଭ', 'ଷ', ...]
facebook/m2m100_418M	128,104	0.358011	90	ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା...	ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା...	['en', '▁ଓଡ଼ିଆ', '▁ଭାଷ', 'ା', '▁ଏକ', ...]

a. Vocabulary Size

shantipriya/minimind_odia_tokenizer: 150,000 tokens (Optimized for Odia)
ai4bharat/indic-bert: 200,000 tokens (Supports multiple Indic languages)
facebook/m2m100_418M: 128,104 tokens (Multilingual support)

b. Tokenization Time

facebook/m2m100_418M: Fastest at 0.3580s
shantipriya/minimind_odia_tokenizer: Moderate at 0.5636s
ai4bharat/indic-bert: Slowest at 2.3697s

c. Number of Tokens

shantipriya/minimind_odia_tokenizer: 181 tokens (Granular tokenization)
ai4bharat/indic-bert: 110 tokens (Compact tokenization)
facebook/m2m100_418M: 90 tokens (Efficient tokenization for multilingual data)

d. Best for Specific Use Cases

For Odia text: shantipriya/minimind_odia_tokenizer is the best choice due to its precision in tokenizing Odia words.
For multilingual support: facebook/m2m100_418M is ideal for handling many languages efficiently.
For Indic languages: ai4bharat/indic-bert offers robust support for multiple Indic languages but is slower in comparison to the other two.

6. License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contibutors

Acknowledgements

MiniMind: Repository and scripts related to the project.
OdiaGenAI: The dataset used for training the tokenizer is part of the Odia Generative AI project.
AMD: The tokenizer was trained on an AMD MI 250 machine to optimize performance and efficiency.

For questions or issues, feel free to open an issue or contact the repository maintainer.