Can I train this model and get tokenizer with japanese datasets ?

#3
by Nguyen667201 - opened

Can I train this model and get tokenizer with japanese datasets ?
I get an error when i try runing process_asr_text_tokenizer.py, then i get an error " File "/home/team_voice/pdnguyen/finetune-fast-conformer/process_asr_text_tokenizer.py", line 380, in
main()
File "/home/team_voice/pdnguyen/finetune-fast-conformer/process_asr_text_tokenizer.py", line 354, in main
tokenizer_path = __process_data(
File "/home/team_voice/pdnguyen/finetune-fast-conformer/process_asr_text_tokenizer.py", line 291, in __process_data
tokenizer_path, vocab_path = create_spt_model(
File "/home/team_voice/miniconda3/envs/training-asr/lib/python3.10/site-packages/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py", line 374, in create_spt_model
sentencepiece.SentencePieceTrainer.Train(cmd)
File "/home/team_voice/miniconda3/envs/training-asr/lib/python3.10/site-packages/sentencepiece/init.py", line 1047, in Train
SentencePieceTrainer._Train(arg=arg, **kwargs)
File "/home/team_voice/miniconda3/envs/training-asr/lib/python3.10/site-packages/sentencepiece/init.py", line 1003, in _Train
return SentencePieceTrainer._TrainFromString(arg)
File "/home/team_voice/miniconda3/envs/training-asr/lib/python3.10/site-packages/sentencepiece/init.py", line 981, in _TrainFromString
return sentencepiece.SentencePieceTrainer__TrainFromString(arg)
RuntimeError: Internal: src/trainer_interface.cc(431) [!sentences
.empty()] "

Sign up or log in to comment