|
|
--- |
|
|
license: llama3 |
|
|
--- |
|
|
|
|
|
In this experiment i trained a tokenizer that supports multiple Indian languages and merged & extend the llama-3 tokenizer. |
|
|
|
|
|
## STEP 1: |
|
|
|
|
|
I sampled data from the multilingual(7 Indic languages) [aloobun/dhpileIN](https://huggingface.co/datasets/aloobun/dhpileIN) dataset and [trained](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/train.py) a SentencePiece tokenizer. |
|
|
|
|
|
## STEP 2: |
|
|
I evaluated the tokenizer's performance on: |
|
|
- Unicode coverage. |
|
|
- Token distribution. |
|
|
- Tokenization complexity across different scripts. |
|
|
- Encoding and decoding capabilities & |
|
|
- Edge cases e.g., special characters, numbers, etc. |
|
|
|
|
|
## STEP 2.1: |
|
|
The first [test](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/test_suite_step_2_1.py) gives detailed results of the tokenizer's performance on unicode coverage, token distribution visualiztion and complexity across scripts. |
|
|
|
|
|
## Step 2.2: |
|
|
The second [script](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/test_step_2_2.py) tests the encoding and decoding capabilities. |
|
|
Example output: |
|
|
``` |
|
|
Bengali Analysis: |
|
|
Original Text Length: 48 characters |
|
|
Token IDs Count: 11 |
|
|
Token Strings: ['▁আমি', '▁বাংলাদেশ', '▁থেকে', '▁এসে', 'ছি', '।', '▁কলকাতা', '▁একটি', '▁সুন্দর', '▁শহর', '।'] |
|
|
Text Reconstruction: True |
|
|
|
|
|
Hindi Analysis: |
|
|
Original Text Length: 49 characters |
|
|
Token IDs Count: 15 |
|
|
Token Strings: ['▁नम', 'स्ते', ',', '▁मैं', '▁भारत', '▁से', '▁हू', 'ँ', '।', '▁दिल्ली', '▁बहुत', '▁बड़ा', '▁शहर', '▁है', '।'] |
|
|
Text Reconstruction: True |
|
|
|
|
|
Kannada Analysis: |
|
|
Original Text Length: 53 characters |
|
|
Token IDs Count: 13 |
|
|
Token Strings: ['▁ನಾನು', '▁ಬೆಂಗಳೂರಿ', 'ನಿಂದ', '▁ಬಂದ', 'ಿದ್ದೇನೆ', '।', '▁ಕನ್ನಡ', '▁ಒಂದು', '▁ಸೋ', 'ಂಪ', 'ಿನ', '▁ಭಾಷೆ', '।'] |
|
|
Text Reconstruction: True |
|
|
|
|
|
Malayalam Analysis: |
|
|
Original Text Length: 47 characters |
|
|
Token IDs Count: 15 |
|
|
Token Strings: ['▁ഞ', 'ാ', 'ൻ', '▁കേരള', 'ത്തി', 'ൽ', '▁നിന്നാണ്', '.', '▁കൊച്ചി', '▁ഒരു', '▁സുന്ദ', 'ര', '▁നഗ', 'രം', '.'] |
|
|
Text Reconstruction: True |
|
|
|
|
|
Telugu Analysis: |
|
|
Original Text Length: 53 characters |
|
|
Token IDs Count: 10 |
|
|
Token Strings: ['▁నేను', '▁తెలంగాణ', '▁నుంచి', '▁వచ్చ', 'ాను', '.', '▁హైదరాబాద్', '▁అద్భుతమైన', '▁నగరం', '.'] |
|
|
Text Reconstruction: True |
|
|
|
|
|
Tamil Analysis: |
|
|
Original Text Length: 54 characters |
|
|
Token IDs Count: 13 |
|
|
Token Strings: ['▁நான்', '▁தமிழ்நா', 'ட்டை', 'ச்', '▁சேர்ந்த', 'வன்', '.', '▁சென்னை', '▁ஒரு', '▁பெரிய', '▁நக', 'ரம்', '.'] |
|
|
Text Reconstruction: True |
|
|
|
|
|
Gujarati Analysis: |
|
|
Original Text Length: 50 characters |
|
|
Token IDs Count: 12 |
|
|
Token Strings: ['▁હું', '▁ગુજરાત', '▁થી', '▁આવ્યો', '▁છું', '।', '▁અમદાવાદ', '▁એક', '▁સુંદર', '▁શહેર', '▁છે', '।'] |
|
|
Text Reconstruction: True |
|
|
``` |
|
|
|
|
|
## STEP 3: |
|
|
This [script](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/merge_step_3.py) is used to merge and extend the tokenizer for the Llama3 tokenizer. |
|
|
|
|
|
Script ensures: |
|
|
- No duplicate tokens are added. |
|
|
- Tokens arent excessively long. |
|
|
- New tokens are correctly integrated. |
|
|
- Token mappings, etc |
|
|
|
|
|
I feel there are some unecessary bloat like token validation and redundant test methods in the script. I'm still working on how to improve things and will update as soon as I have any progress. |
|
|
|
|
|
Here's a comparison of sub word **fertility** scores between [sarvam-1](https://huggingface.co/sarvamai/sarvam-1) and this model. |
|
|
|
|
|
| |sarvam-1|IN-Llama-3-Tokenizer| |
|
|
|--------|------|---------| |
|
|
|Bengali|1.7 |3.52 | |
|
|
|Gujrati|2.784313 |3.588235 | |
|
|
|Hindi|1.583333 |2.933333 | |
|
|
|Kannada|2.571428 |3.976190 | |
|
|
|Malayalam|3.487804 |4.365853 | |
|
|
|Tamil|2.767441 |3.860465 | |
|
|
|Telugu|2.372093 |3.511627 | |