Spaces:
Running
Running
词典大小 250680 来自 https://huggingface.co/bigscience/bloom#preprocessing | |
"vocab_size": 250880 | |
## OOV | |
有些空格没编码进去,详见`test_oov.py` | |
## 中文词典 | |
一个中文几个id? | |
## | |
``` | |
"pre_tokenizer": { | |
"type": "Sequence", | |
"pretokenizers": [ | |
{ | |
"type": "Split", | |
"pattern": { | |
"Regex": " ?[^(\\s|[.,!?…。,、।۔،])]+" | |
}, | |
"behavior": "Isolated", | |
"invert": false | |
}, | |
{ | |
"type": "ByteLevel", | |
"add_prefix_space": false, | |
"trim_offsets": true, | |
"use_regex": false | |
} | |
] | |
}, | |
"post_processor": { | |
"type": "ByteLevel", | |
"add_prefix_space": true, | |
"trim_offsets": false, | |
"use_regex": false | |
}, | |
``` |