Update README.md
Browse files
README.md
CHANGED
@@ -60,11 +60,15 @@ model-index:
|
|
60 |
---
|
61 |
|
62 |
# Kotoba-Whisper
|
63 |
-
_Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
|
|
|
|
|
|
|
|
68 |
which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
|
69 |
those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
|
70 |
The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
|
@@ -77,8 +81,8 @@ the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/commo
|
|
77 |
|
78 |
| Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
|
79 |
|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
|
80 |
-
| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) |
|
81 |
-
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 8.52 | 7.18 | 15.18 |
|
82 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
|
83 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
|
84 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
|
@@ -87,8 +91,8 @@ the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/commo
|
|
87 |
|
88 |
| Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
|
89 |
|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
|
90 |
-
| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) |
|
91 |
-
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 55.41 | 59.34 | 60.23 |
|
92 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
|
93 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
|
94 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
|
|
|
60 |
---
|
61 |
|
62 |
# Kotoba-Whisper
|
63 |
+
_Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR, developed through the collaboration bewteen
|
64 |
+
[Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://www.kotoba.tech/).
|
65 |
+
Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
|
66 |
+
we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model consists the full encoder of the
|
67 |
+
teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
|
68 |
+
Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
|
69 |
+
|
70 |
+
As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
|
71 |
+
(the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
|
72 |
which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
|
73 |
those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
|
74 |
The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
|
|
|
81 |
|
82 |
| Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
|
83 |
|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
|
84 |
+
| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 9.44 | 8.48 | **12.60** |
|
85 |
+
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **8.52** | **7.18** | 15.18 |
|
86 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
|
87 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
|
88 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
|
|
|
91 |
|
92 |
| Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
|
93 |
|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
|
94 |
+
| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 59.27 | 64.36 | **56.62** |
|
95 |
+
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **55.41** | **59.34** | 60.23 |
|
96 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
|
97 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
|
98 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
|