Spaces:

xu-song
/

tokenizer-arena

Running

App Files Files Community

xu-song commited on May 23, 2024

Commit

f1b4ae2

1 Parent(s): f652c69

add access token

Browse files

Files changed (5) hide show

README.md +2 -70
app.py +9 -3
character_util.py +2 -2
compression_app.py +1 -1
compression_util.py +2 -2

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: Tokenizer Arena
-emoji: ⚡
 colorFrom: red
 colorTo: gray
 sdk: gradio
@@ -12,72 +12,4 @@ datasets:
 ---
-## 压缩率 Compress Rate
-在 [cc-100](https://huggingface.co/datasets/cc100) 数据集，每个语言取1万条数据，测试不同tokenizer的压缩率。
-> 压缩率示例：
-llama3扩充了词典，具有更高的压缩比。同样1T字节的简体中文语料，llama分词后是 0.56万亿个token，llama3只需要0.31万亿个token。
-| tokenizer                    |   vocab_size |    t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
-|:-----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
-| llama                        |        32000 |               1.8  |               0.56 |               0.7  |
-| llama3                       |       128000 |               3.2  |               0.31 |               1.24 |
-可通过以下脚本进行复现
-```sh
-python utils/compress_rate_util.py
-```
-<details> <summary>英文压缩率</summary>
-在英文数据集 cc100-en 计算压缩率
-| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   b_tokens/g_bytes |   t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
-|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
-| amber                       |        32000 |               3.56 |               0.28 |               3.47 |               0.29 |               3.81 |
-| aya_101                     |       250100 |               3.3  |               0.3  |               3.22 |               0.31 |               3.53 |
-| baichuan                    |        64000 |               3.74 |               0.27 |               3.65 |               0.27 |               4    |
-| baichuan2                   |       125696 |               3.89 |               0.26 |               3.8  |               0.26 |               4.17 |
-</details>
-<details> <summary>简体中文压缩率</summary>
-在简体中文数据集 cc100-zh-Hans 计算压缩率
-| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   b_tokens/g_bytes |   t_bytes/t_tokens |   t_tokens/t_bytes |   n_chars/n_tokens |
-|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
-| amber                       |        32000 |               1.84 |               0.54 |               1.8  |               0.56 |               0.7  |
-| aya_101                     |       250100 |               3.89 |               0.26 |               3.79 |               0.26 |               1.47 |
-| baichuan                    |        64000 |               3.92 |               0.26 |               3.82 |               0.26 |               1.48 |
-</details>
-## Reference
-- Getting the most out of your tokenizer for pre-training and domain adaptation
-- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
-- blog
-  - https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
-  - https://huggingface.co/docs/transformers/tokenizer_summary#sentencepiece
-  - https://www.huaxiaozhuan.com/%E5%B7%A5%E5%85%B7/huggingface_transformer/chapters/1_tokenizer.html
-  - https://zhuanlan.zhihu.com/p/652520262
-  - https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md
-  - https://tonybaloney.github.io/posts/cjk-chinese-japanese-korean-llm-ai-best-practices.html
-  -
-- demo
-  - https://huggingface.co/spaces/Xenova/the-tokenizer-playground
-  - https://github.com/dqbd/tiktokenizer
-  - https://chat.lmsys.org/?leaderboard
-  - https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
-- paper
-  - ss
--

 ---
 title: Tokenizer Arena
+emoji: 📚
 colorFrom: red
 colorTo: gray
 sdk: gradio
 ---
+Please visit our GitHub repo for more information: https://github.com/xu-song/tokenizer-arena

app.py CHANGED Viewed

@@ -1,13 +1,19 @@
 from playground_app import demo as playground_tab
 from compression_app import demo as compression_tab
 from character_app import demo as character_tab
 from patcher.gr_interface import TabbedInterface
 demo = TabbedInterface(
     [playground_tab, compression_tab, character_tab],
-    [" ⚔️ Playground", "🏆 Compression Leaderboard", "📊 Character Statistics"],  # 编码速度，解码速度，字符分类(zh、num等，支持正则)，支持的语言，机构，。
     title='<div align="center">Tokenizer Arena ⚔️</div>',
     css="css/style.css"
 )
@@ -15,4 +21,4 @@ demo = TabbedInterface(
 demo.load(js=open("js/onload.js", "r", encoding="utf-8").read())
 if __name__ == "__main__":
-    demo.launch()

+import os
 from playground_app import demo as playground_tab
 from compression_app import demo as compression_tab
 from character_app import demo as character_tab
 from patcher.gr_interface import TabbedInterface
+from huggingface_hub import login
+auth_token = os.environ.get('HF_TOKEN', None)
+if auth_token:
+    login(token=auth_token)
+# 编码速度，解码速度，字符分类(zh、num等，支持正则)，支持的语言，。
 demo = TabbedInterface(
     [playground_tab, compression_tab, character_tab],
+    [" ⚔️ Playground", "🏆 Compression Leaderboard", "📊 Character Statistics"],
     title='<div align="center">Tokenizer Arena ⚔️</div>',
     css="css/style.css"
 )
 demo.load(js=open("js/onload.js", "r", encoding="utf-8").read())
 if __name__ == "__main__":
+    demo.launch()

character_util.py CHANGED Viewed

@@ -1,7 +1,7 @@
 """
 TODO:
-1. 繁体、简体、语种、
-2. 确认 bert的space token数目
 3. add token_impl
 4.
 """

 """
 TODO:
+1. add more language
+2. check space count of bert
 3. add token_impl
 4.
 """

compression_app.py CHANGED Viewed

@@ -82,7 +82,7 @@ with gr.Blocks() as demo:
                 # "- `g_bytes/b_tokens` measures how many gigabytes corpus per billion tokens.\n"
                 # "- `t_bytes/t_tokens` measures how many terabytes corpus per trillion tokens.\n"
                 "- `char/token` measures how many chars per token on the tokenized corpus.\n"
-                "- `oov_ratio`: out-of-vocabulary ratio on the selected corpus. 👉 get [oov charset](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate.json)\n\n"
                 "You can reproduce this procedure with [compression_util.py](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/compression_util.py)."
             )

                 # "- `g_bytes/b_tokens` measures how many gigabytes corpus per billion tokens.\n"
                 # "- `t_bytes/t_tokens` measures how many terabytes corpus per trillion tokens.\n"
                 "- `char/token` measures how many chars per token on the tokenized corpus.\n"
+                "- `oov_ratio`: out-of-vocabulary ratio on the selected corpus, 👉 get [oov charset](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate.json)\n\n"
                 "You can reproduce this procedure with [compression_util.py](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/compression_util.py)."
             )

compression_util.py CHANGED Viewed

@@ -1,9 +1,9 @@
 """
-中文数据：clue superclue
-英文数据：glue cnn_dailymail gigaword
 code:
 math：
 """

 """
+## TODO
 code:
 math：
+whitespace:
 """