Spaces:

xu-song
/

tokenizer-arena

Running

App Files Files Community

xu-song commited on Sep 17, 2024

Commit

f2cec45

1 Parent(s): 1706767

update

Browse files

Files changed (9) hide show

README.md +15 -0
app.py +9 -1
compression_app.py +11 -11
compression_util.py +1 -1
requirements.txt +12 -12
stats/character_stats.json +21 -0
stats/compression_rate.json +48 -0
utils/lang_util.py +2 -2
vocab.py +1 -1

README.md CHANGED Viewed

@@ -9,9 +9,24 @@ app_file: app.py
 pinned: false
 datasets:
   - cc100
 ---
 Please visit our GitHub repo for more information: https://github.com/xu-song/tokenizer-arena

 pinned: false
 datasets:
   - cc100
+tags:
+  - tokenizer
+short_description: Compare different tokenizers in char-level and byte-level.
 ---
 Please visit our GitHub repo for more information: https://github.com/xu-song/tokenizer-arena
+## Run gradio demo
+```
+python app.py
+```
+## ss

app.py CHANGED Viewed

@@ -12,13 +12,21 @@ if auth_token:
     login(token=auth_token)
-title = '<div align="center">Tokenizer Arena ⚔️</div>'
 interface_list = [playground_tab, compression_tab, character_tab]
 tab_names = [" ⚔️ Playground", "🏆 Compression Leaderboard", "📊 Character Statistics"]
 # interface_list = [compression_tab, character_tab]
 # tab_names = ["🏆 Compression Leaderboard", "📊 Character Statistics"]
 with gr.Blocks(css="css/style.css", js="js/onload.js") as demo:
     gr.HTML(
         f"<h1 style='text-align: center; margin-bottom: 1rem'>{title}</h1>"

     login(token=auth_token)
+# title = '<div align="center">Tokenizer Arena ⚔️</div>'
+title = """
+<div align="center">
+   <span style="background-color: rgb(254, 226, 226);">Token</span><span style="background-color: rgb(220, 252, 231);">ization</span>
+   <span style="background-color: rgb(219, 234, 254);">  Arena</span>
+   <span style="background-color: rgb(254, 249, 195);"> ⚔️</span>
+</div>
+"""
 interface_list = [playground_tab, compression_tab, character_tab]
 tab_names = [" ⚔️ Playground", "🏆 Compression Leaderboard", "📊 Character Statistics"]
 # interface_list = [compression_tab, character_tab]
 # tab_names = ["🏆 Compression Leaderboard", "📊 Character Statistics"]
 with gr.Blocks(css="css/style.css", js="js/onload.js") as demo:
     gr.HTML(
         f"<h1 style='text-align: center; margin-bottom: 1rem'>{title}</h1>"

compression_app.py CHANGED Viewed

@@ -35,21 +35,21 @@ The encoding and decoding process can be formulated as
     decoded_text = tokenizer.decode(token_ids)  # reconstructed text
 ```
-- **Lossless** <br>
 Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`. There are mainly two causes of compression loss.
-  1. `OOV`: Most lossy tokenizers get many out-of-vocabulary(OOV) words. 👉 Check the OOV and
-    tokenization loss of [bert](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json) and
-    [t5](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-t5.t5-large%20%40%20cc100.es.diff.json).
-  2. `Normalization`: Even if a tokenizer has no OOV, it can be lossy due to text normalization. For example, qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338) in encoding process,
-    llama performs [clean_up_tokenization_spaces](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/tokenizer_config.json#L2053) in decoding process,
-    which may bring some slight differences to the reconstructed text. 👉 Check the tokenization loss of
-    [qwen](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) and
-    [llama](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/meta-llama.Meta-Llama-3.1-405B%20@%20cc100.en.diff.json).
-- **Compression Rate** <br>
 There are mainly two types of metric to represent the `input_text`:
   - `char-level`: the number of characters in the given text
   - `byte-level`: the number of bytes in the given text.

     decoded_text = tokenizer.decode(token_ids)  # reconstructed text
 ```
+**Lossless**<br>
 Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`. There are mainly two causes of compression loss.
+1. `OOV`: Most lossy tokenizers get many out-of-vocabulary(OOV) words. 👉 Check the OOV and
+tokenization loss of [bert](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json) and
+[t5](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-t5.t5-large%20%40%20cc100.es.diff.json).
+2. `Normalization`: Even if a tokenizer has no OOV, it can be lossy due to text normalization. For example, qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338) in encoding process,
+llama performs [clean_up_tokenization_spaces](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/tokenizer_config.json#L2053) in decoding process,
+which may bring some slight differences to the reconstructed text. 👉 Check the tokenization loss of
+[qwen](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) and
+[llama](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/meta-llama.Meta-Llama-3.1-405B%20@%20cc100.en.diff.json).
+**Compression Rate**<br>
 There are mainly two types of metric to represent the `input_text`:
   - `char-level`: the number of characters in the given text
   - `byte-level`: the number of bytes in the given text.

compression_util.py CHANGED Viewed

@@ -318,4 +318,4 @@ def main():
 if __name__ == "__main__":
-    main()


318
319
320	if __name__ == "__main__":
321	+ main()

requirements.txt CHANGED Viewed

@@ -1,12 +1,12 @@
-gradio>=4.38.1
-transformers
-sentencepiece
-tiktoken
-icetk
-torch
-nltk
-boto3
-protobuf==4.25.3
-ai2-olmo==0.2.4
-ipadic
-fugashi

+gradio>=4.38.1
+transformers>4.40.0
+sentencepiece
+tiktoken
+icetk
+torch
+nltk
+boto3
+protobuf==4.25.3
+ai2-olmo
+ipadic
+fugashi

stats/character_stats.json CHANGED Viewed

@@ -1936,5 +1936,26 @@
     "len(ja-kana)": "1,2,11",
     "num(ko)": 4492,
     "len(ko)": "1,3,6"
   }
 }

     "len(ja-kana)": "1,2,11",
     "num(ko)": 4492,
     "len(ko)": "1,3,6"
+  },
+  "allenai/OLMo-7B-hf": {
+    "tokenizer": "<a target=\"_blank\" href=\"https://huggingface.co/allenai/OLMo-7B-hf\" style=\"color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;\">OLMo-7B-hf</a>",
+    "organization": "Allen AI",
+    "vocab_size": 50280,
+    "num(digit)": 2036,
+    "len(digit)": "1,3,35",
+    "num(space)": 29019,
+    "len(space)": "1,7,512",
+    "num(ar)": 94,
+    "len(ar)": "1,2,4",
+    "num(zh)": 313,
+    "len(zh)": "1,1,2",
+    "num(ja)": 480,
+    "len(ja)": "1,1,4",
+    "num(ja-kana)": 167,
+    "len(ja-kana)": "1,1,4",
+    "num(ko)": 25,
+    "len(ko)": "1,1,2",
+    "num(la)": 48651,
+    "len(la)": "1,6,512"
   }
 }

stats/compression_rate.json CHANGED Viewed

@@ -10306,5 +10306,53 @@
     "oov_ratio": 0.0,
     "_oov_charset": "[]",
     "lossless": true
   }
 }

     "oov_ratio": 0.0,
     "_oov_charset": "[]",
     "lossless": true
+  },
+  "allenai/OLMo-7B-hf @ cc100/en": {
+    "tokenizer": "<a target=\"_blank\" href=\"https://huggingface.co/allenai/OLMo-7B-hf\" style=\"color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;\">OLMo-7B-hf</a>",
+    "organization": "Allen AI",
+    "vocab_size": 50280,
+    "_n_bytes": 1124813,
+    "_n_tokens": 259357,
+    "_n_chars": 1121360,
+    "_n_oov_chars": 0,
+    "oov_ratio": 0.0,
+    "_oov_charset": "[]",
+    "lossless": false
+  },
+  "allenai/OLMo-7B-hf @ cc100/zh-Hans": {
+    "tokenizer": "<a target=\"_blank\" href=\"https://huggingface.co/allenai/OLMo-7B-hf\" style=\"color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;\">OLMo-7B-hf</a>",
+    "organization": "Allen AI",
+    "vocab_size": 50280,
+    "_n_bytes": 2633047,
+    "_n_tokens": 1220529,
+    "_n_chars": 927311,
+    "_n_oov_chars": 0,
+    "oov_ratio": 0.0,
+    "_oov_charset": "[]",
+    "lossless": false
+  },
+  "allenai/OLMo-7B-hf @ cc100/fr": {
+    "tokenizer": "<a target=\"_blank\" href=\"https://huggingface.co/allenai/OLMo-7B-hf\" style=\"color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;\">OLMo-7B-hf</a>",
+    "organization": "Allen AI",
+    "vocab_size": 50280,
+    "_n_bytes": 1540504,
+    "_n_tokens": 458961,
+    "_n_chars": 1484970,
+    "_n_oov_chars": 0,
+    "oov_ratio": 0.0,
+    "_oov_charset": "[]",
+    "lossless": false
+  },
+  "allenai/OLMo-7B-hf @ cc100/es": {
+    "tokenizer": "<a target=\"_blank\" href=\"https://huggingface.co/allenai/OLMo-7B-hf\" style=\"color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;\">OLMo-7B-hf</a>",
+    "organization": "Allen AI",
+    "vocab_size": 50280,
+    "_n_bytes": 1664455,
+    "_n_tokens": 494577,
+    "_n_chars": 1630297,
+    "_n_oov_chars": 0,
+    "oov_ratio": 0.0,
+    "_oov_charset": "[]",
+    "lossless": false
   }
 }

utils/lang_util.py CHANGED Viewed

@@ -12,7 +12,7 @@
 此外，有些语言（如法语和西班牙语）在某些情况下可能共享特定的重音符号，这可能导致一个字符串被错误地识别为多种语言。
 ## common language
-English | 简体中文 | 繁體中文 | 한국어 | Español | 日本語 | हिन्दी | Русский | Рortuguês | తెలుగు | Français | Deutsch | Tiếng Việt |
 """
 import re
@@ -85,4 +85,4 @@ if __name__ == "__main__":
     for s, expected in test_strings.items():
         # print(f"'{s}' === Detected lang: {detect_language(s)} === Expected: {expected}")
-        print(f"'{s}'\nDetected lang: {detect_language_by_unicode(s)}\nExpected lang: {expected}")

 此外，有些语言（如法语和西班牙语）在某些情况下可能共享特定的重音符号，这可能导致一个字符串被错误地识别为多种语言。
 ## common language
+English | 简体中文 | 繁體中文 | 한국어 | Español | 日本語 | हिन्दी | Русский | Рortuguês | తెలుగు | Français | Deutsch | Tiếng Việt |
 """
 import re
     for s, expected in test_strings.items():
         # print(f"'{s}' === Detected lang: {detect_language(s)} === Expected: {expected}")
+        print(f"'{s}'\nDetected lang: {detect_language_by_unicode(s)}\nExpected lang: {expected}")

vocab.py CHANGED Viewed

@@ -367,7 +367,7 @@ _all_tokenizer_config = [
     TokenizerConfig("deepseek-ai/DeepSeek-V2", org="DeepSeek"),
     TokenizerConfig("google/gemma-7b", org="Google"),
     TokenizerConfig("google/gemma-2-9b", org="Google"),
-    TokenizerConfig("allenai/OLMo-7B", org="Allen AI"),
     TokenizerConfig("HuggingFaceH4/zephyr-7b-beta", org="HuggingFace"),
     TokenizerConfig("ai21labs/Jamba-v0.1", org="AI21"),
     TokenizerConfig("databricks/dbrx-instruct", org="Databricks"),

     TokenizerConfig("deepseek-ai/DeepSeek-V2", org="DeepSeek"),
     TokenizerConfig("google/gemma-7b", org="Google"),
     TokenizerConfig("google/gemma-2-9b", org="Google"),
+    TokenizerConfig("allenai/OLMo-7B-hf", org="Allen AI"),
     TokenizerConfig("HuggingFaceH4/zephyr-7b-beta", org="HuggingFace"),
     TokenizerConfig("ai21labs/Jamba-v0.1", org="AI21"),
     TokenizerConfig("databricks/dbrx-instruct", org="Databricks"),