Update README.md
Browse files
README.md
CHANGED
@@ -10,7 +10,8 @@ tags:
|
|
10 |
- gemma
|
11 |
---
|
12 |
|
13 |
-
|
|
|
14 |
|
15 |
## Token Information
|
16 |
|
@@ -31,8 +32,8 @@ While Bengali is very expressive and flexible, it hasn't undergone as much globa
|
|
31 |
|
32 |
| Tokenizer | Output |
|
33 |
|----------------------------|----------------------------------------------------------------------------------------------------------------------|
|
34 |
-
| `
|
35 |
-
| `
|
36 |
|
37 |
## Usage
|
38 |
|
@@ -47,8 +48,4 @@ While Bengali is very expressive and flexible, it hasn't undergone as much globa
|
|
47 |
tokenizer = AutoTokenizer.from_pretrained("rishiraj/gemma-2-9b-bn")
|
48 |
tokens = tokenizer.tokenize("আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি")
|
49 |
print(tokens)
|
50 |
-
```
|
51 |
-
|
52 |
-
## Conclusion
|
53 |
-
|
54 |
-
The original `gemma_tokenizer` splits many Bengali words into subword components, leading to inefficiency and loss of meaning. Our extended Bengali tokenizer better preserves word integrity, tokenizing more effectively with fewer splits, ensuring more meaningful representation of the text.
|
|
|
10 |
- gemma
|
11 |
---
|
12 |
|
13 |
+
# rishiraj/gemma-2-9b-bn
|
14 |
+
This repository extends the `google/gemma-2-9b` tokenizer by training it on Bengali text. The original tokenizer splits many Bengali words into subword components, leading to inefficiency and loss of meaning. Our extended Bengali tokenizer better preserves word integrity, tokenizing more effectively with fewer splits, ensuring more meaningful representation of the text.
|
15 |
|
16 |
## Token Information
|
17 |
|
|
|
32 |
|
33 |
| Tokenizer | Output |
|
34 |
|----------------------------|----------------------------------------------------------------------------------------------------------------------|
|
35 |
+
| `google/gemma-2-9b` | ['আ', 'মি', '▁এক', 'জন', '▁ভ', 'াল', 'ো', '▁', 'ছে', 'লে', '▁এবং', '▁আম', 'ি', '▁ফ', 'ু', 'ট', 'ব', 'ল', '▁খ', 'েল', 'তে', '▁প', 'ছ', 'ন্দ', '▁কর', 'ি'] |
|
36 |
+
| `rishiraj/gemma-2-9b-bn` | ['আমি', '▁একজন', '▁ভালো', '▁ছেলে', '▁এবং', '▁আমি', '▁ফুটবল', '▁খেলতে', '▁পছন্দ', '▁করি'] |
|
37 |
|
38 |
## Usage
|
39 |
|
|
|
48 |
tokenizer = AutoTokenizer.from_pretrained("rishiraj/gemma-2-9b-bn")
|
49 |
tokens = tokenizer.tokenize("আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি")
|
50 |
print(tokens)
|
51 |
+
```
|
|
|
|
|
|
|
|