Spaces:
Running
Running
update
Browse files- compression_app.py +1 -1
- requirements.txt +1 -1
compression_app.py
CHANGED
@@ -37,7 +37,7 @@ The encoding and decoding process can be formulated as
|
|
37 |
- **Lossless** <br>
|
38 |
Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`.
|
39 |
|
40 |
-
- Most lossy tokenizers get many out-of-vocabulary tokens. 👉 Check the [oov of bert-base-uncased](https://huggingface.co/spaces/eson/tokenizer-arena/
|
41 |
- Some other tokenizers have no oov, but still be lossy due to text normalization. For example qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338),
|
42 |
which may bring some [slight difference](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) to the reconstructed text.
|
43 |
|
|
|
37 |
- **Lossless** <br>
|
38 |
Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`.
|
39 |
|
40 |
+
- Most lossy tokenizers get many out-of-vocabulary tokens. 👉 Check the [oov of bert-base-uncased](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json).
|
41 |
- Some other tokenizers have no oov, but still be lossy due to text normalization. For example qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338),
|
42 |
which may bring some [slight difference](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) to the reconstructed text.
|
43 |
|
requirements.txt
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
gradio>=4.
|
2 |
transformers
|
3 |
sentencepiece
|
4 |
tiktoken
|
|
|
1 |
+
gradio>=4.38.1
|
2 |
transformers
|
3 |
sentencepiece
|
4 |
tiktoken
|