WIP turbo encoder frozen + 2 decoder layers. Trained 2^19 steps batch size 8 (~160 hours on 3060). Almost certainly undertrained.

Goals

  • Japanese transcription
  • Focus on anime adjacent domain
  • No hallucination
  • Drop in replacement (trained 50% with prompt, 25% notimestamps)

Acknowledgements

  • Train sets: OOPPEENN, Reazon, Common Voice 19, 小虫哥_, deepghs
  • Validation sets: simon3000, grider-withourai, kotoba-tech
  • Test sets: KitsuneX07, TEDxJP

Test set

air himanatsu kanon proseka sakuuta tedxjp
turbo_b1 25.8 60.6 22.5 13.1 21.1 10.8
turbo_b5 20.9 48.3 19.1 11.8 18.9
turbo_b1_nt 25.8 61.6 23.1 13.6 20.4
turbo_b5_nt 17.1 25.8 23.5 9.4 12.5
anime_b1 15.9 20.2 12.8 8.9 10.9 41.8
anime_b5 14.4 18.3 12.6 8.6 10.0
anime_b1_n5 15.0 18.4 12.7 8.9 10.1
anime_b5_n5 14.4 18.1 12.5 8.6 10.0
anime_b1_nt 14.4 18.7 11.4 8.3 10.1
anime_b5_nt 13.4 17.5 11.4 8.1 9.6
b1 15.6 20.1 11.8 8.8 10.5 11.5
b5 15.2 19.8 11.6 8.8 10.7
b1_nt 15.6 20.1 11.9 8.7 10.5
b5_nt 15.3 19.4 11.8 8.6 10.5
  • b1 beam_size=1

  • b5 beam_size=5

  • n5 no_repeat_ngram_size=5

  • nt <|notimestamps|>

  • Anime sets equal to worse compared to anime-whisper, better than turbo (out of domain).

  • 273 videos from TEDxJP-10K with youtube subtitles for long form with faster-whisper.

  • Slightly worse than turbo. Kotoba/anime-whisper not trained for long form.

Validation set

Used only for hyperparameter optimization.

bluearchive genshin5.1 nekopara genshin starrail reazon jsut cv8 cv19 jsl loopers tedx10
large-v3_b1 12.2 10.1 70.8 11.9 10.0 16.0 7.1 8.6 15.1 12.2 7.7
large-v3_b5 11.0 10.0 63.7 11.6 9.8 14.1 7.1 8.3 14.8 11.0
large-v2_b1 14.4 103.4 18.3 12.9 31.6 8.2 9.8 18.5 18.0 8.0
large-v2_b5 12.7 100.9 16.8 12.9 28.0 8.0 9.5 17.5 16.2
turbo_b1 12.8 11.1 72.3 11.6 11.1 11.6 7.3 9.6 17.5 12.0 28.0 7.9
turbo_b5 10.4 10.0 64.3 12.0 10.2 10.4 7.2 9.1 16.6 10.8 20.2 8.8
kotoba-v1_b1 8.5 9.4 27.8 9.9 10.3 12.7 8.4 9.5 17.1 12.2 34.9
kotoba-v1_b5 8.4 9.3 27.8 9.8 10.3 12.3 8.3 9.3 16.7 12.1
kotoba-v2_b1 8.5 9.6 27.7 10.2 10.4 11.6 8.2 9.2 16.9 12.3 25.3
kotoba-v2_b5 8.6 9.5 27.7 10.1 10.5 11.4 8.2 9.0 16.6 12.2
kotoba-bi_b1 8.9 10.1 28.1 10.5 10.8 17.5 9.1 9.8 17.5 12.7 27.8
kotoba-bi_b5 8.8 10.0 28.0 10.5 10.7 17.1 9.1 9.6 17.2 12.6
anime_b1 7.5 11.5 24.7 11.0 11.2 30.1 8.0 10.0 19.1 9.0 18.9 32.0
anime_b5 7.2 10.4 22.0 10.3 10.4 26.6 7.8 9.8 18.8 8.5 15.3 51.8
b1 6.9 6.3 22.8 6.7 7.4 16.2 7.1 8.9 17.1 8.5 14.7 8.2
b5 7.5 6.2 22.8 6.6 7.3 15.7 7.0 8.7 17.0 8.5 14.5 9.1
  • bluearchive.wiki: beam 5 worse from extra usage of kana. Learnt from MiHoYo games?
  • genshin5.1: Trained on 5.0, new audio from 5.1, possible minor overlap.
  • nekopara: Hallucination test, anime would be better if not for increased hallucination. Openai is unusable.
  • genshin/starrail: Mostly in the train set.
  • reazon: Significantly higher cer from transcribing background/secondary audio.
  • jsut: Surprisingly good?
  • cv8: cv19 train includes some of cv8 test.
  • cv19: No contamination, struggles with accents.
  • jsl: Anime set.
  • loopers: Anime set, has hallucination prone audio.
  • tedxjp: 10 videos subset. See comments in test set. b1=batched, b5=sequential, beam_size=1, temperature=0, condition_on_previous_text=False
Downloads last month
37
Safetensors
Model size
756M params
Tensor type
BF16
·
Inference Examples
Unable to determine this model's library. Check the docs .