WIP turbo encoder frozen + 2 decoder layers. Trained 2^19 steps batch size 8 (~160 hours on 3060). Almost certainly undertrained.

Goals

Acknowledgements

	air	himanatsu	kanon	proseka	sakuuta	tedxjp
turbo_b1	25.8	60.6	22.5	13.1	21.1	10.8
turbo_b5	20.9	48.3	19.1	11.8	18.9
turbo_b1_nt	25.8	61.6	23.1	13.6	20.4
turbo_b5_nt	17.1	25.8	23.5	9.4	12.5
anime_b1	15.9	20.2	12.8	8.9	10.9	41.8
anime_b5	14.4	18.3	12.6	8.6	10.0
anime_b1_n5	15.0	18.4	12.7	8.9	10.1
anime_b5_n5	14.4	18.1	12.5	8.6	10.0
anime_b1_nt	14.4	18.7	11.4	8.3	10.1
anime_b5_nt	13.4	17.5	11.4	8.1	9.6

b1	15.6	20.1	11.8	8.8	10.5	11.5
b5	15.2	19.8	11.6	8.8	10.7
b1_nt	15.6	20.1	11.9	8.7	10.5
b5_nt	15.3	19.4	11.8	8.6	10.5

b1 beam_size=1
b5 beam_size=5
n5 no_repeat_ngram_size=5
nt <|notimestamps|>
Anime sets equal to worse compared to anime-whisper, better than turbo (out of domain).
273 videos from TEDxJP-10K with youtube subtitles for long form with faster-whisper.
Slightly worse than turbo. Kotoba/anime-whisper not trained for long form.

Used only for hyperparameter optimization.

	bluearchive	genshin5.1	nekopara	genshin	starrail	reazon	jsut	cv8	cv19	jsl	loopers	tedx10
large-v3_b1	12.2	10.1	70.8	11.9	10.0	16.0	7.1	8.6	15.1	12.2		7.7
large-v3_b5	11.0	10.0	63.7	11.6	9.8	14.1	7.1	8.3	14.8	11.0
large-v2_b1		14.4	103.4	18.3	12.9	31.6	8.2	9.8	18.5	18.0		8.0
large-v2_b5		12.7	100.9	16.8	12.9	28.0	8.0	9.5	17.5	16.2
turbo_b1	12.8	11.1	72.3	11.6	11.1	11.6	7.3	9.6	17.5	12.0	28.0	7.9
turbo_b5	10.4	10.0	64.3	12.0	10.2	10.4	7.2	9.1	16.6	10.8	20.2	8.8
kotoba-v1_b1	8.5	9.4	27.8	9.9	10.3	12.7	8.4	9.5	17.1	12.2		34.9
kotoba-v1_b5	8.4	9.3	27.8	9.8	10.3	12.3	8.3	9.3	16.7	12.1
kotoba-v2_b1	8.5	9.6	27.7	10.2	10.4	11.6	8.2	9.2	16.9	12.3		25.3
kotoba-v2_b5	8.6	9.5	27.7	10.1	10.5	11.4	8.2	9.0	16.6	12.2
kotoba-bi_b1	8.9	10.1	28.1	10.5	10.8	17.5	9.1	9.8	17.5	12.7		27.8
kotoba-bi_b5	8.8	10.0	28.0	10.5	10.7	17.1	9.1	9.6	17.2	12.6
anime_b1	7.5	11.5	24.7	11.0	11.2	30.1	8.0	10.0	19.1	9.0	18.9	32.0
anime_b5	7.2	10.4	22.0	10.3	10.4	26.6	7.8	9.8	18.8	8.5	15.3	51.8

b1	6.9	6.3	22.8	6.7	7.4	16.2	7.1	8.9	17.1	8.5	14.7	8.2
b5	7.5	6.2	22.8	6.6	7.3	15.7	7.0	8.7	17.0	8.5	14.5	9.1

bluearchive.wiki: beam 5 worse from extra usage of kana. Learnt from MiHoYo games?
genshin5.1: Trained on 5.0, new audio from 5.1, possible minor overlap.
nekopara: Hallucination test, anime would be better if not for increased hallucination. Openai is unusable.
genshin/starrail: Mostly in the train set.
reazon: Significantly higher cer from transcribing background/secondary audio.
jsut: Surprisingly good?
cv8: cv19 train includes some of cv8 test.
cv19: No contamination, struggles with accents.
jsl: Anime set.
loopers: Anime set, has hallucination prone audio.
tedxjp: 10 videos subset. See comments in test set. b1=batched, b5=sequential, beam_size=1, temperature=0, condition_on_previous_text=False