Spaces:
Running
Running
title: Tokenizer Arena | |
emoji: ⚡ | |
colorFrom: red | |
colorTo: gray | |
sdk: gradio | |
sdk_version: 3.41.2 | |
app_file: app.py | |
pinned: false | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
## ss | |
## TODO | |
- 搜索栏 | |
- | |
## 统计 | |
## vocabsize | |
- 增大能提到压缩率,副作用是增大计算量和内存 (getting the most out of your tokenizer for pre-training and) | |
- | |
https://huggingface.co/spaces/yenniejun/tokenizers-languages | |
## gradio app | |
- https://arena.lmsys.org/ | |
## lang | |
## number | |
## diff | |
## Compress Rate | |
**简介** | |
we tokenize in cc-100 | |
| tokenizer | vocab_size | g_bytes/b_tokens | t_bytes/t_tokens | b_tokens/g_bytes | | |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:| | |
| amber | 32000 | 1.84 | 1.8 | 0.54 | | |
| aya_101 | 250100 | 3.89 | 3.79 | 0.26 | | |
| baichuan | 64000 | 3.92 | 3.82 | 0.26 | | |
| baichuan2 | 125696 | 4.53 | 4.42 | 0.22 | | |
| bert_base_cased | 28996 | 2.73 | 2.66 | 0.37 | | |
| bert_base_chinese | 21128 | 2.74 | 2.67 | 0.37 | | |
| bert_base_uncased | 30522 | 2.73 | 2.67 | 0.37 | | |
| bloom | 250680 | 4.28 | 4.18 | 0.23 | | |
| byt5_small | 256 | 0.93 | 0.91 | 1.08 | | |
| character_glm_6b | 64794 | 4.2 | 4.1 | 0.24 | | |
| chatglm2_6b | 64794 | 4.2 | 4.1 | 0.24 | | |
| chatglm3_6b | 64798 | 4.2 | 4.1 | 0.24 | | |
| chatglm_6b | 150344 | 4.65 | 4.54 | 0.22 | | |
| chatyuan_large_v2 | 32128 | 4.34 | 4.24 | 0.23 | | |
| chinese_llama | 49953 | 3.93 | 3.84 | 0.25 | | |
| chinese_llama2 | 55296 | 3.92 | 3.83 | 0.26 | | |
| code_davinci_002 | 50281 | 1.31 | 1.28 | 0.77 | | |
| crystal_coder | 32000 | 1.86 | 1.81 | 0.54 | | |
| deepseek_coder_33b_instruct | 32000 | 3.4 | 3.32 | 0.29 | | |
| deepseek_llm_7b_base | 100000 | 4.05 | 3.96 | 0.25 | | |
| falcon_180b | 65024 | 2.18 | 2.13 | 0.46 | | |
| falcon_7b | 65024 | 2.18 | 2.13 | 0.46 | | |
| fastchat_t5_3b | 32000 | 13.7 | 13.38 | 0.07 | | |
| flan_t5_base | 32100 | 14.13 | 13.8 | 0.07 | | |
| gemma_7b | 256000 | 3.82 | 3.73 | 0.26 | | |
| gpt2 | 50257 | 1.31 | 1.28 | 0.77 | | |
| gpt2_chinese | 21128 | 2.73 | 2.66 | 0.37 | | |
| gpt_35_turbo | 100277 | 2.26 | 2.21 | 0.44 | | |
| gpt_4 | 100277 | 2.26 | 2.21 | 0.44 | | |
| gpt_nexo_20b | 50254 | 2.01 | 1.96 | 0.5 | | |
| internlm2_chat_7b | 92544 | 4.23 | 4.13 | 0.24 | | |
| internlm2_math_7b | 92544 | 4.23 | 4.13 | 0.24 | | |
| internlm_chat_7b | 103168 | 4.23 | 4.14 | 0.24 | | |
| internlm_xcomposer_7b | 103168 | 4.23 | 4.14 | 0.24 | | |
| kplug | 10261 | 2.72 | 2.65 | 0.37 | | |
| llama | 32000 | 1.84 | 1.8 | 0.54 | | |
| llama2 | 32000 | 1.84 | 1.8 | 0.54 | | |
| mistral_7b | 32000 | 2.36 | 2.3 | 0.42 | | |
| mixtral_8_7b | 32000 | 2.36 | 2.3 | 0.42 | | |
| mobilebert_uncased | 30522 | 2.73 | 2.67 | 0.37 | | |
| moss | 106029 | 4.4 | 4.3 | 0.23 | | |
| mt5_large | 250100 | 3.89 | 3.79 | 0.26 | | |
| olmo_7b | 50280 | 2.01 | 1.96 | 0.5 | | |
| orion_14b_chat | 84608 | 4.63 | 4.52 | 0.22 | | |
| phi_1 | 50257 | 1.31 | 1.28 | 0.77 | | |
| phi_2 | 50257 | 1.31 | 1.28 | 0.77 | | |
| pko_t5_large | 50258 | 0.97 | 0.95 | 1.03 | | |
| prompt_clue | 32128 | 4.34 | 4.24 | 0.23 | | |
| qwen1_5_14b_chat | 151643 | 4.16 | 4.06 | 0.24 | | |
| qwen_1_8b_chat | 151851 | 4.16 | 4.06 | 0.24 | | |
| qwen_72b_chat | 151851 | 4.16 | 4.06 | 0.24 | | |
| qwen_7b_chat | 151851 | 4.16 | 4.06 | 0.24 | | |
| roberta_chinese_clue | 8021 | 2.7 | 2.64 | 0.37 | | |
| skywork_13b_base | 65519 | 3.69 | 3.61 | 0.27 | | |
| skywork_13b_math | 65519 | 3.69 | 3.61 | 0.27 | | |
| solar_10_7b | 32000 | 2.36 | 2.3 | 0.42 | | |
| starchat_alpha | 49152 | 2.78 | 2.72 | 0.36 | | |
| switch_c_2048 | 32100 | 14.13 | 13.8 | 0.07 | | |
| t5_base | 32100 | 14.13 | 13.8 | 0.07 | | |
| t5_large | 32100 | 14.13 | 13.8 | 0.07 | | |
| t5_small | 32100 | 14.13 | 13.8 | 0.07 | | |
| text_davinci_003 | 50281 | 1.31 | 1.28 | 0.77 | | |
| tigerbot_13b_chat_v2 | 60512 | 4.25 | 4.15 | 0.24 | | |
| tigerbot_70b_chat_v4_4k | 65107 | 4.25 | 4.15 | 0.24 | | |
| wizardcoder_15b_v1 | 49152 | 2.78 | 2.72 | 0.36 | | |
| wizardcoder_python_7b_v1 | 32000 | 1.84 | 1.8 | 0.54 | | |
| wizardlm_7b_v1 | 32000 | 1.84 | 1.8 | 0.54 | | |
| wizardmath_70b_v1 | 32000 | 1.84 | 1.8 | 0.54 | | |
| xlm_roberta | 250002 | 3.96 | 3.86 | 0.25 | | |
| yi_34b | 64000 | 4.17 | 4.07 | 0.24 | | |
| yi_6b | 64000 | 4.17 | 4.07 | 0.24 | | |
| yi_vl34b | 64000 | 4.11 | 4.02 | 0.24 | | |
| zephyr_7b_beta | 32000 | 2.36 | 2.3 | 0.42 | | |
**结论** | |
larger vocabulary sizes | |
## Reference | |
- Getting the most out of your tokenizer for pre-training and domain adaptation | |
- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca | |
- https://huggingface.co/spaces/Xenova/the-tokenizer-playground |