Japanese

ๆ—ฅๆœฌ่ชžใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใง train ใ—ใŸ Tokenizer ใงใ™.

ๅ˜ไฝ“ใงใฎๅˆฉ็”จใฏๆƒณๅฎšใ—ใฆใŠใ‚‰ใš, LLaMa Tokenizer ใชใฉใซใƒžใƒผใ‚ธใ—ใฆๅˆฉ็”จใ™ใ‚‹ใฎใ‚’ๆƒณๅฎšใ—ใฆใ„ใพใ™.

Training script

train_jp_tokenizer.py ใ‚’ๅ‚็…งใใ ใ•ใ„.

Trained tokenizer

  • tokenizer-cc100-ja.json cc100 ja ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ‚’ใใฎใพใพ(normalize ใชใฉ้ฉ็”จใ›ใšใซ) train ใ—ใŸใ‚‚ใฎ. vocab size 30000.

TODO

  • Normalize ใ—ใŸๆ—ฅๆœฌ่ชžใƒ†ใ‚ญใ‚นใƒˆใซๅฏพใ—ใฆ train ใ™ใ‚‹
  • ใƒžใƒผใ‚ธใ—ใŸ Tokenizer ใ‚’ใ‚ขใƒƒใƒ—ใƒญใƒผใƒ‰ใ™ใ‚‹
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support