Automatic Speech Recognition
Transformers
Safetensors
Japanese
whisper
audio
hf-asr-leaderboard
Eval Results
Inference Endpoints
asahi417 commited on
Commit
39a6bd4
·
verified ·
1 Parent(s): 4921477

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -9
README.md CHANGED
@@ -60,11 +60,15 @@ model-index:
60
  ---
61
 
62
  # Kotoba-Whisper
63
- _Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
64
- we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
65
- teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
66
-
67
- As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
 
 
 
 
68
  which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
69
  those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
70
  The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
@@ -77,8 +81,8 @@ the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/commo
77
 
78
  | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
79
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
80
- | [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | **9.44** | **8.48** | **12.60** |
81
- | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 8.52 | 7.18 | 15.18 |
82
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
83
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
84
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
@@ -87,8 +91,8 @@ the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/commo
87
 
88
  | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
89
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
90
- | [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | **59.27** | **64.36** | **56.62** |
91
- | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 55.41 | 59.34 | 60.23 |
92
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
93
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
94
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
 
60
  ---
61
 
62
  # Kotoba-Whisper
63
+ _Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR, developed through the collaboration bewteen
64
+ [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://www.kotoba.tech/).
65
+ Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
66
+ we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model consists the full encoder of the
67
+ teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
68
+ Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
69
+
70
+ As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
71
+ (the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
72
  which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
73
  those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
74
  The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
 
81
 
82
  | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
83
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
84
+ | [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 9.44 | 8.48 | **12.60** |
85
+ | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **8.52** | **7.18** | 15.18 |
86
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
87
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
88
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
 
91
 
92
  | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
93
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
94
+ | [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 59.27 | 64.36 | **56.62** |
95
+ | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **55.41** | **59.34** | 60.23 |
96
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
97
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
98
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |