Update README.md

5c664a3 verified 12 months ago

4.23 kB

	---
	license: llama3
	---

	In this experiment i trained a tokenizer that supports multiple Indian languages and merged & extend the llama-3 tokenizer.

	## STEP 1:

	I sampled data from the multilingual(7 Indic languages) [aloobun/dhpileIN](https://huggingface.co/datasets/aloobun/dhpileIN) dataset and [trained](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/train.py) a SentencePiece tokenizer.

	## STEP 2:
	I evaluated the tokenizer's performance on:
	- Unicode coverage.
	- Token distribution.
	- Tokenization complexity across different scripts.
	- Encoding and decoding capabilities &
	- Edge cases e.g., special characters, numbers, etc.

	## STEP 2.1:
	The first [test](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/test_suite_step_2_1.py) gives detailed results of the tokenizer's performance on unicode coverage, token distribution visualiztion and complexity across scripts.

	## Step 2.2:
	The second [script](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/test_step_2_2.py) tests the encoding and decoding capabilities.
	Example output:
	```
	Bengali Analysis:
	Original Text Length: 48 characters
	Token IDs Count: 11
	Token Strings: ['▁আমি', '▁বাংলাদেশ', '▁থেকে', '▁এসে', 'ছি', '।', '▁কলকাতা', '▁একটি', '▁সুন্দর', '▁শহর', '।']
	Text Reconstruction: True

	Hindi Analysis:
	Original Text Length: 49 characters
	Token IDs Count: 15
	Token Strings: ['▁नम', 'स्ते', ',', '▁मैं', '▁भारत', '▁से', '▁हू', 'ँ', '।', '▁दिल्ली', '▁बहुत', '▁बड़ा', '▁शहर', '▁है', '।']
	Text Reconstruction: True

	Kannada Analysis:
	Original Text Length: 53 characters
	Token IDs Count: 13
	Token Strings: ['▁ನಾನು', '▁ಬೆಂಗಳೂರಿ', 'ನಿಂದ', '▁ಬಂದ', 'ಿದ್ದೇನೆ', '।', '▁ಕನ್ನಡ', '▁ಒಂದು', '▁ಸೋ', 'ಂಪ', 'ಿನ', '▁ಭಾಷೆ', '।']
	Text Reconstruction: True

	Malayalam Analysis:
	Original Text Length: 47 characters
	Token IDs Count: 15
	Token Strings: ['▁ഞ', 'ാ', 'ൻ', '▁കേരള', 'ത്തി', 'ൽ', '▁നിന്നാണ്', '.', '▁കൊച്ചി', '▁ഒരു', '▁സുന്ദ', 'ര', '▁നഗ', 'രം', '.']
	Text Reconstruction: True

	Telugu Analysis:
	Original Text Length: 53 characters
	Token IDs Count: 10
	Token Strings: ['▁నేను', '▁తెలంగాణ', '▁నుంచి', '▁వచ్చ', 'ాను', '.', '▁హైదరాబాద్', '▁అద్భుతమైన', '▁నగరం', '.']
	Text Reconstruction: True

	Tamil Analysis:
	Original Text Length: 54 characters
	Token IDs Count: 13
	Token Strings: ['▁நான்', '▁தமிழ்நா', 'ட்டை', 'ச்', '▁சேர்ந்த', 'வன்', '.', '▁சென்னை', '▁ஒரு', '▁பெரிய', '▁நக', 'ரம்', '.']
	Text Reconstruction: True

	Gujarati Analysis:
	Original Text Length: 50 characters
	Token IDs Count: 12
	Token Strings: ['▁હું', '▁ગુજરાત', '▁થી', '▁આવ્યો', '▁છું', '।', '▁અમદાવાદ', '▁એક', '▁સુંદર', '▁શહેર', '▁છે', '।']
	Text Reconstruction: True
	```

	## STEP 3:
	This [script](https://github.com/aloobun/IN-L3-Tokenizer/blob/main/merge_step_3.py) is used to merge and extend the tokenizer for the Llama3 tokenizer.

	Script ensures:
	- No duplicate tokens are added.
	- Tokens arent excessively long.
	- New tokens are correctly integrated.
	- Token mappings, etc

	I feel there are some unecessary bloat like token validation and redundant test methods in the script. I'm still working on how to improve things and will update as soon as I have any progress.

	Here's a comparison of sub word fertility scores between [sarvam-1](https://huggingface.co/sarvamai/sarvam-1) and this model.

	\| \|sarvam-1\|IN-Llama-3-Tokenizer\|
	\|--------\|------\|---------\|
	\|Bengali\|1.7 \|3.52 \|
	\|Gujrati\|2.784313 \|3.588235 \|
	\|Hindi\|1.583333 \|2.933333 \|
	\|Kannada\|2.571428 \|3.976190 \|
	\|Malayalam\|3.487804 \|4.365853 \|
	\|Tamil\|2.767441 \|3.860465 \|
	\|Telugu\|2.372093 \|3.511627 \|