WIP turbo encoder frozen + 2 decoder layers. Trained 2^19 steps batch size 8 (~160 hours on 3060). Almost certainly undertrained.
Goals
- Japanese transcription
- Focus on anime adjacent domain
- No hallucination
- Drop in replacement (trained 50% with prompt, 25% notimestamps)
Acknowledgements
- Train sets: OOPPEENN, Reazon, Common Voice 19, 小虫哥_, deepghs
- Validation sets: simon3000, grider-withourai, kotoba-tech
- Test sets: KitsuneX07, TEDxJP
Test set
air | himanatsu | kanon | proseka | sakuuta | tedxjp | |
---|---|---|---|---|---|---|
turbo_b1 | 25.8 | 60.6 | 22.5 | 13.1 | 21.1 | 10.8 |
turbo_b5 | 20.9 | 48.3 | 19.1 | 11.8 | 18.9 | |
turbo_b1_nt | 25.8 | 61.6 | 23.1 | 13.6 | 20.4 | |
turbo_b5_nt | 17.1 | 25.8 | 23.5 | 9.4 | 12.5 | |
anime_b1 | 15.9 | 20.2 | 12.8 | 8.9 | 10.9 | 41.8 |
anime_b5 | 14.4 | 18.3 | 12.6 | 8.6 | 10.0 | |
anime_b1_n5 | 15.0 | 18.4 | 12.7 | 8.9 | 10.1 | |
anime_b5_n5 | 14.4 | 18.1 | 12.5 | 8.6 | 10.0 | |
anime_b1_nt | 14.4 | 18.7 | 11.4 | 8.3 | 10.1 | |
anime_b5_nt | 13.4 | 17.5 | 11.4 | 8.1 | 9.6 | |
b1 | 15.6 | 20.1 | 11.8 | 8.8 | 10.5 | 11.5 |
b5 | 15.2 | 19.8 | 11.6 | 8.8 | 10.7 | |
b1_nt | 15.6 | 20.1 | 11.9 | 8.7 | 10.5 | |
b5_nt | 15.3 | 19.4 | 11.8 | 8.6 | 10.5 |
b1 beam_size=1
b5 beam_size=5
n5 no_repeat_ngram_size=5
nt <|notimestamps|>
Anime sets equal to worse compared to anime-whisper, better than turbo (out of domain).
273 videos from TEDxJP-10K with youtube subtitles for long form with faster-whisper.
Slightly worse than turbo. Kotoba/anime-whisper not trained for long form.
Validation set
Used only for hyperparameter optimization.
bluearchive | genshin5.1 | nekopara | genshin | starrail | reazon | jsut | cv8 | cv19 | jsl | loopers | tedx10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
large-v3_b1 | 12.2 | 10.1 | 70.8 | 11.9 | 10.0 | 16.0 | 7.1 | 8.6 | 15.1 | 12.2 | 7.7 | |
large-v3_b5 | 11.0 | 10.0 | 63.7 | 11.6 | 9.8 | 14.1 | 7.1 | 8.3 | 14.8 | 11.0 | ||
large-v2_b1 | 14.4 | 103.4 | 18.3 | 12.9 | 31.6 | 8.2 | 9.8 | 18.5 | 18.0 | 8.0 | ||
large-v2_b5 | 12.7 | 100.9 | 16.8 | 12.9 | 28.0 | 8.0 | 9.5 | 17.5 | 16.2 | |||
turbo_b1 | 12.8 | 11.1 | 72.3 | 11.6 | 11.1 | 11.6 | 7.3 | 9.6 | 17.5 | 12.0 | 28.0 | 7.9 |
turbo_b5 | 10.4 | 10.0 | 64.3 | 12.0 | 10.2 | 10.4 | 7.2 | 9.1 | 16.6 | 10.8 | 20.2 | 8.8 |
kotoba-v1_b1 | 8.5 | 9.4 | 27.8 | 9.9 | 10.3 | 12.7 | 8.4 | 9.5 | 17.1 | 12.2 | 34.9 | |
kotoba-v1_b5 | 8.4 | 9.3 | 27.8 | 9.8 | 10.3 | 12.3 | 8.3 | 9.3 | 16.7 | 12.1 | ||
kotoba-v2_b1 | 8.5 | 9.6 | 27.7 | 10.2 | 10.4 | 11.6 | 8.2 | 9.2 | 16.9 | 12.3 | 25.3 | |
kotoba-v2_b5 | 8.6 | 9.5 | 27.7 | 10.1 | 10.5 | 11.4 | 8.2 | 9.0 | 16.6 | 12.2 | ||
kotoba-bi_b1 | 8.9 | 10.1 | 28.1 | 10.5 | 10.8 | 17.5 | 9.1 | 9.8 | 17.5 | 12.7 | 27.8 | |
kotoba-bi_b5 | 8.8 | 10.0 | 28.0 | 10.5 | 10.7 | 17.1 | 9.1 | 9.6 | 17.2 | 12.6 | ||
anime_b1 | 7.5 | 11.5 | 24.7 | 11.0 | 11.2 | 30.1 | 8.0 | 10.0 | 19.1 | 9.0 | 18.9 | 32.0 |
anime_b5 | 7.2 | 10.4 | 22.0 | 10.3 | 10.4 | 26.6 | 7.8 | 9.8 | 18.8 | 8.5 | 15.3 | 51.8 |
b1 | 6.9 | 6.3 | 22.8 | 6.7 | 7.4 | 16.2 | 7.1 | 8.9 | 17.1 | 8.5 | 14.7 | 8.2 |
b5 | 7.5 | 6.2 | 22.8 | 6.6 | 7.3 | 15.7 | 7.0 | 8.7 | 17.0 | 8.5 | 14.5 | 9.1 |
- bluearchive.wiki: beam 5 worse from extra usage of kana. Learnt from MiHoYo games?
- genshin5.1: Trained on 5.0, new audio from 5.1, possible minor overlap.
- nekopara: Hallucination test, anime would be better if not for increased hallucination. Openai is unusable.
- genshin/starrail: Mostly in the train set.
- reazon: Significantly higher cer from transcribing background/secondary audio.
- jsut: Surprisingly good?
- cv8: cv19 train includes some of cv8 test.
- cv19: No contamination, struggles with accents.
- jsl: Anime set.
- loopers: Anime set, has hallucination prone audio.
- tedxjp: 10 videos subset. See comments in test set. b1=batched, b5=sequential, beam_size=1, temperature=0, condition_on_previous_text=False
- Downloads last month
- 37
Unable to determine this model's library. Check the
docs
.