ๆฅๆฌ่ชใใผใฟใปใใใง train ใใ Tokenizer ใงใ.
ๅไฝใงใฎๅฉ็จใฏๆณๅฎใใฆใใใ, LLaMa Tokenizer ใชใฉใซใใผใธใใฆๅฉ็จใใใฎใๆณๅฎใใฆใใพใ.
Training script
train_jp_tokenizer.py
ใๅ็
งใใ ใใ.
Trained tokenizer
tokenizer-cc100-ja.json
cc100 ja ใใผใฟใปใใใใใฎใพใพ(normalize ใชใฉ้ฉ็จใใใซ) train ใใใใฎ. vocab size 30000.
TODO
- Normalize ใใๆฅๆฌ่ชใใญในใใซๅฏพใใฆ train ใใ
- ใใผใธใใ Tokenizer ใใขใใใญใผใใใ
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.