Spaces:

xu-song
/

tokenizer-arena

Running

App Files Files Community

tokenizer-arena / README.md

xu-song

add compress rate

814ee6b 7 months ago

preview code

raw

history blame

9.15 kB

	---
	title: Tokenizer Arena
	emoji: ⚡
	colorFrom: red
	colorTo: gray
	sdk: gradio
	sdk_version: 3.41.2
	app_file: app.py
	pinned: false
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


	## ss


	## TODO


	- 搜索栏
	-



	## 统计


	## vocabsize

	- 增大能提到压缩率，副作用是增大计算量和内存（getting the most out of your tokenizer for pre-training and）
	-


	https://huggingface.co/spaces/yenniejun/tokenizers-languages


	## gradio app

	- https://arena.lmsys.org/


	## lang



	## number



	## diff






	## Compress Rate


	简介
	we tokenize in cc-100

	\| tokenizer \| vocab_size \| g_bytes/b_tokens \| t_bytes/t_tokens \| b_tokens/g_bytes \|
	\|:----------------------------\|-------------:\|-------------------:\|-------------------:\|-------------------:\|
	\| amber \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| aya_101 \| 250100 \| 3.89 \| 3.79 \| 0.26 \|
	\| baichuan \| 64000 \| 3.92 \| 3.82 \| 0.26 \|
	\| baichuan2 \| 125696 \| 4.53 \| 4.42 \| 0.22 \|
	\| bert_base_cased \| 28996 \| 2.73 \| 2.66 \| 0.37 \|
	\| bert_base_chinese \| 21128 \| 2.74 \| 2.67 \| 0.37 \|
	\| bert_base_uncased \| 30522 \| 2.73 \| 2.67 \| 0.37 \|
	\| bloom \| 250680 \| 4.28 \| 4.18 \| 0.23 \|
	\| byt5_small \| 256 \| 0.93 \| 0.91 \| 1.08 \|
	\| character_glm_6b \| 64794 \| 4.2 \| 4.1 \| 0.24 \|
	\| chatglm2_6b \| 64794 \| 4.2 \| 4.1 \| 0.24 \|
	\| chatglm3_6b \| 64798 \| 4.2 \| 4.1 \| 0.24 \|
	\| chatglm_6b \| 150344 \| 4.65 \| 4.54 \| 0.22 \|
	\| chatyuan_large_v2 \| 32128 \| 4.34 \| 4.24 \| 0.23 \|
	\| chinese_llama \| 49953 \| 3.93 \| 3.84 \| 0.25 \|
	\| chinese_llama2 \| 55296 \| 3.92 \| 3.83 \| 0.26 \|
	\| code_davinci_002 \| 50281 \| 1.31 \| 1.28 \| 0.77 \|
	\| crystal_coder \| 32000 \| 1.86 \| 1.81 \| 0.54 \|
	\| deepseek_coder_33b_instruct \| 32000 \| 3.4 \| 3.32 \| 0.29 \|
	\| deepseek_llm_7b_base \| 100000 \| 4.05 \| 3.96 \| 0.25 \|
	\| falcon_180b \| 65024 \| 2.18 \| 2.13 \| 0.46 \|
	\| falcon_7b \| 65024 \| 2.18 \| 2.13 \| 0.46 \|
	\| fastchat_t5_3b \| 32000 \| 13.7 \| 13.38 \| 0.07 \|
	\| flan_t5_base \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| gemma_7b \| 256000 \| 3.82 \| 3.73 \| 0.26 \|
	\| gpt2 \| 50257 \| 1.31 \| 1.28 \| 0.77 \|
	\| gpt2_chinese \| 21128 \| 2.73 \| 2.66 \| 0.37 \|
	\| gpt_35_turbo \| 100277 \| 2.26 \| 2.21 \| 0.44 \|
	\| gpt_4 \| 100277 \| 2.26 \| 2.21 \| 0.44 \|
	\| gpt_nexo_20b \| 50254 \| 2.01 \| 1.96 \| 0.5 \|
	\| internlm2_chat_7b \| 92544 \| 4.23 \| 4.13 \| 0.24 \|
	\| internlm2_math_7b \| 92544 \| 4.23 \| 4.13 \| 0.24 \|
	\| internlm_chat_7b \| 103168 \| 4.23 \| 4.14 \| 0.24 \|
	\| internlm_xcomposer_7b \| 103168 \| 4.23 \| 4.14 \| 0.24 \|
	\| kplug \| 10261 \| 2.72 \| 2.65 \| 0.37 \|
	\| llama \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| llama2 \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| mistral_7b \| 32000 \| 2.36 \| 2.3 \| 0.42 \|
	\| mixtral_8_7b \| 32000 \| 2.36 \| 2.3 \| 0.42 \|
	\| mobilebert_uncased \| 30522 \| 2.73 \| 2.67 \| 0.37 \|
	\| moss \| 106029 \| 4.4 \| 4.3 \| 0.23 \|
	\| mt5_large \| 250100 \| 3.89 \| 3.79 \| 0.26 \|
	\| olmo_7b \| 50280 \| 2.01 \| 1.96 \| 0.5 \|
	\| orion_14b_chat \| 84608 \| 4.63 \| 4.52 \| 0.22 \|
	\| phi_1 \| 50257 \| 1.31 \| 1.28 \| 0.77 \|
	\| phi_2 \| 50257 \| 1.31 \| 1.28 \| 0.77 \|
	\| pko_t5_large \| 50258 \| 0.97 \| 0.95 \| 1.03 \|
	\| prompt_clue \| 32128 \| 4.34 \| 4.24 \| 0.23 \|
	\| qwen1_5_14b_chat \| 151643 \| 4.16 \| 4.06 \| 0.24 \|
	\| qwen_1_8b_chat \| 151851 \| 4.16 \| 4.06 \| 0.24 \|
	\| qwen_72b_chat \| 151851 \| 4.16 \| 4.06 \| 0.24 \|
	\| qwen_7b_chat \| 151851 \| 4.16 \| 4.06 \| 0.24 \|
	\| roberta_chinese_clue \| 8021 \| 2.7 \| 2.64 \| 0.37 \|
	\| skywork_13b_base \| 65519 \| 3.69 \| 3.61 \| 0.27 \|
	\| skywork_13b_math \| 65519 \| 3.69 \| 3.61 \| 0.27 \|
	\| solar_10_7b \| 32000 \| 2.36 \| 2.3 \| 0.42 \|
	\| starchat_alpha \| 49152 \| 2.78 \| 2.72 \| 0.36 \|
	\| switch_c_2048 \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| t5_base \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| t5_large \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| t5_small \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| text_davinci_003 \| 50281 \| 1.31 \| 1.28 \| 0.77 \|
	\| tigerbot_13b_chat_v2 \| 60512 \| 4.25 \| 4.15 \| 0.24 \|
	\| tigerbot_70b_chat_v4_4k \| 65107 \| 4.25 \| 4.15 \| 0.24 \|
	\| wizardcoder_15b_v1 \| 49152 \| 2.78 \| 2.72 \| 0.36 \|
	\| wizardcoder_python_7b_v1 \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| wizardlm_7b_v1 \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| wizardmath_70b_v1 \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| xlm_roberta \| 250002 \| 3.96 \| 3.86 \| 0.25 \|
	\| yi_34b \| 64000 \| 4.17 \| 4.07 \| 0.24 \|
	\| yi_6b \| 64000 \| 4.17 \| 4.07 \| 0.24 \|
	\| yi_vl34b \| 64000 \| 4.11 \| 4.02 \| 0.24 \|
	\| zephyr_7b_beta \| 32000 \| 2.36 \| 2.3 \| 0.42 \|


	结论
	larger vocabulary sizes



	## Reference

	- Getting the most out of your tokenizer for pre-training and domain adaptation
	- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
	- https://huggingface.co/spaces/Xenova/the-tokenizer-playground

	---
	title: Tokenizer Arena
	emoji: ⚡
	colorFrom: red
	colorTo: gray
	sdk: gradio
	sdk_version: 3.41.2
	app_file: app.py
	pinned: false
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


	## ss


	## TODO


	- 搜索栏
	-



	## 统计


	## vocabsize

	- 增大能提到压缩率，副作用是增大计算量和内存（getting the most out of your tokenizer for pre-training and）
	-


	https://huggingface.co/spaces/yenniejun/tokenizers-languages


	## gradio app

	- https://arena.lmsys.org/


	## lang



	## number



	## diff






	## Compress Rate


	简介
	we tokenize in cc-100

	\| tokenizer \| vocab_size \| g_bytes/b_tokens \| t_bytes/t_tokens \| b_tokens/g_bytes \|
	\|:----------------------------\|-------------:\|-------------------:\|-------------------:\|-------------------:\|
	\| amber \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| aya_101 \| 250100 \| 3.89 \| 3.79 \| 0.26 \|
	\| baichuan \| 64000 \| 3.92 \| 3.82 \| 0.26 \|
	\| baichuan2 \| 125696 \| 4.53 \| 4.42 \| 0.22 \|
	\| bert_base_cased \| 28996 \| 2.73 \| 2.66 \| 0.37 \|
	\| bert_base_chinese \| 21128 \| 2.74 \| 2.67 \| 0.37 \|
	\| bert_base_uncased \| 30522 \| 2.73 \| 2.67 \| 0.37 \|
	\| bloom \| 250680 \| 4.28 \| 4.18 \| 0.23 \|
	\| byt5_small \| 256 \| 0.93 \| 0.91 \| 1.08 \|
	\| character_glm_6b \| 64794 \| 4.2 \| 4.1 \| 0.24 \|
	\| chatglm2_6b \| 64794 \| 4.2 \| 4.1 \| 0.24 \|
	\| chatglm3_6b \| 64798 \| 4.2 \| 4.1 \| 0.24 \|
	\| chatglm_6b \| 150344 \| 4.65 \| 4.54 \| 0.22 \|
	\| chatyuan_large_v2 \| 32128 \| 4.34 \| 4.24 \| 0.23 \|
	\| chinese_llama \| 49953 \| 3.93 \| 3.84 \| 0.25 \|
	\| chinese_llama2 \| 55296 \| 3.92 \| 3.83 \| 0.26 \|
	\| code_davinci_002 \| 50281 \| 1.31 \| 1.28 \| 0.77 \|
	\| crystal_coder \| 32000 \| 1.86 \| 1.81 \| 0.54 \|
	\| deepseek_coder_33b_instruct \| 32000 \| 3.4 \| 3.32 \| 0.29 \|
	\| deepseek_llm_7b_base \| 100000 \| 4.05 \| 3.96 \| 0.25 \|
	\| falcon_180b \| 65024 \| 2.18 \| 2.13 \| 0.46 \|
	\| falcon_7b \| 65024 \| 2.18 \| 2.13 \| 0.46 \|
	\| fastchat_t5_3b \| 32000 \| 13.7 \| 13.38 \| 0.07 \|
	\| flan_t5_base \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| gemma_7b \| 256000 \| 3.82 \| 3.73 \| 0.26 \|
	\| gpt2 \| 50257 \| 1.31 \| 1.28 \| 0.77 \|
	\| gpt2_chinese \| 21128 \| 2.73 \| 2.66 \| 0.37 \|
	\| gpt_35_turbo \| 100277 \| 2.26 \| 2.21 \| 0.44 \|
	\| gpt_4 \| 100277 \| 2.26 \| 2.21 \| 0.44 \|
	\| gpt_nexo_20b \| 50254 \| 2.01 \| 1.96 \| 0.5 \|
	\| internlm2_chat_7b \| 92544 \| 4.23 \| 4.13 \| 0.24 \|
	\| internlm2_math_7b \| 92544 \| 4.23 \| 4.13 \| 0.24 \|
	\| internlm_chat_7b \| 103168 \| 4.23 \| 4.14 \| 0.24 \|
	\| internlm_xcomposer_7b \| 103168 \| 4.23 \| 4.14 \| 0.24 \|
	\| kplug \| 10261 \| 2.72 \| 2.65 \| 0.37 \|
	\| llama \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| llama2 \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| mistral_7b \| 32000 \| 2.36 \| 2.3 \| 0.42 \|
	\| mixtral_8_7b \| 32000 \| 2.36 \| 2.3 \| 0.42 \|
	\| mobilebert_uncased \| 30522 \| 2.73 \| 2.67 \| 0.37 \|
	\| moss \| 106029 \| 4.4 \| 4.3 \| 0.23 \|
	\| mt5_large \| 250100 \| 3.89 \| 3.79 \| 0.26 \|
	\| olmo_7b \| 50280 \| 2.01 \| 1.96 \| 0.5 \|
	\| orion_14b_chat \| 84608 \| 4.63 \| 4.52 \| 0.22 \|
	\| phi_1 \| 50257 \| 1.31 \| 1.28 \| 0.77 \|
	\| phi_2 \| 50257 \| 1.31 \| 1.28 \| 0.77 \|
	\| pko_t5_large \| 50258 \| 0.97 \| 0.95 \| 1.03 \|
	\| prompt_clue \| 32128 \| 4.34 \| 4.24 \| 0.23 \|
	\| qwen1_5_14b_chat \| 151643 \| 4.16 \| 4.06 \| 0.24 \|
	\| qwen_1_8b_chat \| 151851 \| 4.16 \| 4.06 \| 0.24 \|
	\| qwen_72b_chat \| 151851 \| 4.16 \| 4.06 \| 0.24 \|
	\| qwen_7b_chat \| 151851 \| 4.16 \| 4.06 \| 0.24 \|
	\| roberta_chinese_clue \| 8021 \| 2.7 \| 2.64 \| 0.37 \|
	\| skywork_13b_base \| 65519 \| 3.69 \| 3.61 \| 0.27 \|
	\| skywork_13b_math \| 65519 \| 3.69 \| 3.61 \| 0.27 \|
	\| solar_10_7b \| 32000 \| 2.36 \| 2.3 \| 0.42 \|
	\| starchat_alpha \| 49152 \| 2.78 \| 2.72 \| 0.36 \|
	\| switch_c_2048 \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| t5_base \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| t5_large \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| t5_small \| 32100 \| 14.13 \| 13.8 \| 0.07 \|
	\| text_davinci_003 \| 50281 \| 1.31 \| 1.28 \| 0.77 \|
	\| tigerbot_13b_chat_v2 \| 60512 \| 4.25 \| 4.15 \| 0.24 \|
	\| tigerbot_70b_chat_v4_4k \| 65107 \| 4.25 \| 4.15 \| 0.24 \|
	\| wizardcoder_15b_v1 \| 49152 \| 2.78 \| 2.72 \| 0.36 \|
	\| wizardcoder_python_7b_v1 \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| wizardlm_7b_v1 \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| wizardmath_70b_v1 \| 32000 \| 1.84 \| 1.8 \| 0.54 \|
	\| xlm_roberta \| 250002 \| 3.96 \| 3.86 \| 0.25 \|
	\| yi_34b \| 64000 \| 4.17 \| 4.07 \| 0.24 \|
	\| yi_6b \| 64000 \| 4.17 \| 4.07 \| 0.24 \|
	\| yi_vl34b \| 64000 \| 4.11 \| 4.02 \| 0.24 \|
	\| zephyr_7b_beta \| 32000 \| 2.36 \| 2.3 \| 0.42 \|


	结论
	larger vocabulary sizes



	## Reference

	- Getting the most out of your tokenizer for pre-training and domain adaptation
	- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
	- https://huggingface.co/spaces/Xenova/the-tokenizer-playground