kotoba-tech
/

kotoba-whisper-v1.0

@@ -60,11 +60,15 @@ model-index:
 ---
 # Kotoba-Whisper
-_Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
-we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
-teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
-As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
 which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
 those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
 The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
@@ -77,8 +81,8 @@ the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/commo
 | Model                                                                                           | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
 |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
-| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)   |                   **9.44** |        **8.48** |         **12.60** |
-| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                       |                       8.52 |            7.18 |             15.18 |
 | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                           |                      11.34 |            9.87 |             29.56 |
 | [openai/whisper-small](https://huggingface.co/openai/whisper-small)                             |                      15.26 |           14.22 |             34.29 |
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                               |                      46.86 |           35.69 |             96.69 |
@@ -87,8 +91,8 @@ the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/commo
 | Model                                                                                           | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
 |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
-| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)   |                  **59.27** |       **64.36** |         **56.62** |
-| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                       |                      55.41 |           59.34 |             60.23 |
 | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                           |                      63.64 |           69.52 |             76.04 |
 | [openai/whisper-small](https://huggingface.co/openai/whisper-small)                             |                      74.21 |           82.02 |             82.99 |
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                               |                      93.78 |           97.72 |             94.85 |

 ---
 # Kotoba-Whisper
+_Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR, developed through the collaboration bewteen
+[Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://www.kotoba.tech/).
+Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
+we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model consists the full encoder of the
+teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
+Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
+As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
+(the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
 which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
 those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
 The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
 | Model                                                                                           | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
 |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
+| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)   |                   9.44 |        8.48 |         **12.60** |
+| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                       |                       **8.52** |            **7.18** |             15.18 |
 | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                           |                      11.34 |            9.87 |             29.56 |
 | [openai/whisper-small](https://huggingface.co/openai/whisper-small)                             |                      15.26 |           14.22 |             34.29 |
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                               |                      46.86 |           35.69 |             96.69 |
 | Model                                                                                           | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
 |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
+| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)   |                  59.27 |       64.36 |         **56.62** |
+| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                       |                      **55.41** |           **59.34** |             60.23 |
 | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                           |                      63.64 |           69.52 |             76.04 |
 | [openai/whisper-small](https://huggingface.co/openai/whisper-small)                             |                      74.21 |           82.02 |             82.99 |
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                               |                      93.78 |           97.72 |             94.85 |