rishiraj commited on
Commit
aa96e23
·
verified ·
1 Parent(s): e77e072

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -8
README.md CHANGED
@@ -10,7 +10,8 @@ tags:
10
  - gemma
11
  ---
12
 
13
- This repository extends the `google/gemma-2-9b` tokenizer by training it on Bengali text.
 
14
 
15
  ## Token Information
16
 
@@ -31,8 +32,8 @@ While Bengali is very expressive and flexible, it hasn't undergone as much globa
31
 
32
  | Tokenizer | Output |
33
  |----------------------------|----------------------------------------------------------------------------------------------------------------------|
34
- | `gemma_tokenizer` | ['আ', 'মি', '▁এক', 'জন', '▁ভ', 'াল', 'ো', '▁', 'ছে', 'লে', '▁এবং', '▁আম', 'ি', '▁ফ', 'ু', 'ট', 'ব', 'ল', '▁খ', 'েল', 'তে', '▁প', 'ছ', 'ন্দ', '▁কর', 'ি'] |
35
- | `our_tokenizer` | ['আমি', '▁একজন', '▁ভালো', '▁ছেলে', '▁এবং', '▁আমি', '▁ফুটবল', '▁খেলতে', '▁পছন্দ', '▁করি'] |
36
 
37
  ## Usage
38
 
@@ -47,8 +48,4 @@ While Bengali is very expressive and flexible, it hasn't undergone as much globa
47
  tokenizer = AutoTokenizer.from_pretrained("rishiraj/gemma-2-9b-bn")
48
  tokens = tokenizer.tokenize("আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি")
49
  print(tokens)
50
- ```
51
-
52
- ## Conclusion
53
-
54
- The original `gemma_tokenizer` splits many Bengali words into subword components, leading to inefficiency and loss of meaning. Our extended Bengali tokenizer better preserves word integrity, tokenizing more effectively with fewer splits, ensuring more meaningful representation of the text.
 
10
  - gemma
11
  ---
12
 
13
+ # rishiraj/gemma-2-9b-bn
14
+ This repository extends the `google/gemma-2-9b` tokenizer by training it on Bengali text. The original tokenizer splits many Bengali words into subword components, leading to inefficiency and loss of meaning. Our extended Bengali tokenizer better preserves word integrity, tokenizing more effectively with fewer splits, ensuring more meaningful representation of the text.
15
 
16
  ## Token Information
17
 
 
32
 
33
  | Tokenizer | Output |
34
  |----------------------------|----------------------------------------------------------------------------------------------------------------------|
35
+ | `google/gemma-2-9b` | ['আ', 'মি', '▁এক', 'জন', '▁ভ', 'াল', 'ো', '▁', 'ছে', 'লে', '▁এবং', '▁আম', 'ি', '▁ফ', 'ু', 'ট', 'ব', 'ল', '▁খ', 'েল', 'তে', '▁প', 'ছ', 'ন্দ', '▁কর', 'ি'] |
36
+ | `rishiraj/gemma-2-9b-bn` | ['আমি', '▁একজন', '▁ভালো', '▁ছেলে', '▁এবং', '▁আমি', '▁ফুটবল', '▁খেলতে', '▁পছন্দ', '▁করি'] |
37
 
38
  ## Usage
39
 
 
48
  tokenizer = AutoTokenizer.from_pretrained("rishiraj/gemma-2-9b-bn")
49
  tokens = tokenizer.tokenize("আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি")
50
  print(tokens)
51
+ ```