Minimind Odia Tokenizer

This is the Minimind Odia Tokenizer, a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Odia language. It was trained on the OdiaGenAIdata/pre_train_odia_data_processed dataset and is designed to handle Odia text for natural language processing (NLP) tasks.

The tokenizer is available for easy integration into your projects directly from Hugging Face.

How to Use

To use this tokenizer in your own project, follow the steps below:

1. Install the Necessary Libraries

Ensure that you have the Hugging Face transformers and tokenizers libraries installed:

pip install transformers tokenizers

from transformers import AutoTokenizer

2. Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("shantipriya/minimind_odia_tokenizer")

Example Odia text

text = "ତୁମେ କେଉଁଠାରୁ ଆସିଛ?"

Tokenize the input text

encoded_text = tokenizer(text)

Print tokenized output

print("Tokenized text:", encoded_text)

3. Tokenizer Features

  • Vocabulary Size: The tokenizer has a vocabulary of 150,000 tokens, optimized for handling a diverse range of Odia words.
  • Special Tokens: Includes essential special tokens such as <unk>, <s>, and </s> for managing unknown tokens, and marking the beginning and end of sequences, respectively.
  • Preprocessing: The tokenizer employs a Byte-Level pre-tokenizer to handle byte-level encoding, ensuring efficient tokenization for various Odia texts.

4. Example Usage

Here’s how you can use the tokenizer for encoding and decoding Odia text:

from transformers import AutoTokenizer

Load the pre-trained tokenizer

tokenizer = AutoTokenizer.from_pretrained("shantipriya/minimind_odia_tokenizer")

Example Odia messages

messages = [
    {"role": "system", "content": "ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି।"},
    {"role": "user", "content": "ତୁମେ କେଉଁଠାରୁ ଆସିଛ?"},
    {"role": "assistant", "content": "ମୁଁ ପୃଥିବୀରୁ ଆସିଛି"}
]
Tokenized input (ID list): [414, 329, 225, 316, 221, 318, 309, 274, 531, 222, 303, 596, 1506, 223, 237, 221, 238, 223, 226, 272, 222, 312, 221, 227, 676, 295, 276, 229, 231, 225, 232, 225, 292, 275, 300, 221, 224, 229, 349, 223, 255, 33, 249, 463, 236, 299, 247, 223, 230, 246, 224, 229, 349, 223, 255, 223]

Tokenization time: 0.0548 seconds

Decoded output:  ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି। ତୁମେ କେଉଁଠାରୁ ଆସିଛ? ମୁଁ ପୃଥିବୀରୁ ଆସିଛି

Number of tokens: 56

Comparison of Original and Decoded Text:

Original Text: ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି। ତୁମେ କେଉଁଠାରୁ ଆସିଛ? ମୁଁ ପୃଥିବୀରୁ ଆସିଛି

Decoded Text:  ଆପଣ ଜଣେ ସହାୟକ ଏବଂ ଆପଣଙ୍କର ଭଲ ଐତିହାସିକ ଜ୍ଞାନ ଅଛି। ତୁମେ କେଉଁଠାରୁ ଆସିଛ? ମୁଁ ପୃଥିବୀରୁ ଆସିଛି

5. Dataset Used

The tokenizer was trained on the OdiaGenAIdata/pre_train_odia_data_processed dataset, which contains a large corpus of Odia text suitable for training NLP models.

6. Tokenizer Comparison

Tokenizer Vocabulary Size Tokenization Time (seconds) Number of Tokens Original Text Decoded Output Tokenized Output (Odia Characters)
shantipriya/minimind_odia_tokenizer 150,000 0.563647 181 ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... ['Ċ', 'à¬ĵଡ', '଼ି', 'à¬Ĩ', 'Ġà¬Ń', ...]
ai4bharat/indic-bert 200,000 2.369656 110 ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... ଓଡଆ ଭଷ ଏକ ଇଣଡ-ଆରୟନ ଭଷ... ['[CLS]', '▁ଓ', 'ଡ', 'ଆ', '▁ଭ', 'ଷ', ...]
facebook/m2m100_418M 128,104 0.358011 90 ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... ଓଡ଼ିଆ ଭାଷା ଏକ ଇଣ୍ଡୋ-ଆର୍ୟାନ୍ ଭାଷା... ['en', '▁ଓଡ଼ିଆ', '▁ଭାଷ', 'ା', '▁ଏକ', ...]

a. Vocabulary Size

  • shantipriya/minimind_odia_tokenizer: 150,000 tokens (Optimized for Odia)
  • ai4bharat/indic-bert: 200,000 tokens (Supports multiple Indic languages)
  • facebook/m2m100_418M: 128,104 tokens (Multilingual support)

b. Tokenization Time

  • facebook/m2m100_418M: Fastest at 0.3580s
  • shantipriya/minimind_odia_tokenizer: Moderate at 0.5636s
  • ai4bharat/indic-bert: Slowest at 2.3697s

c. Number of Tokens

  • shantipriya/minimind_odia_tokenizer: 181 tokens (Granular tokenization)
  • ai4bharat/indic-bert: 110 tokens (Compact tokenization)
  • facebook/m2m100_418M: 90 tokens (Efficient tokenization for multilingual data)

d. Best for Specific Use Cases

  • For Odia text: shantipriya/minimind_odia_tokenizer is the best choice due to its precision in tokenizing Odia words.
  • For multilingual support: facebook/m2m100_418M is ideal for handling many languages efficiently.
  • For Indic languages: ai4bharat/indic-bert offers robust support for multiple Indic languages but is slower in comparison to the other two.

6. License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contibutors

Acknowledgements

  • MiniMind: Repository and scripts related to the project.

  • OdiaGenAI: The dataset used for training the tokenizer is part of the Odia Generative AI project.

  • AMD: The tokenizer was trained on an AMD MI 250 machine to optimize performance and efficiency.

For questions or issues, feel free to open an issue or contact the repository maintainer.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.