Automatic Speech Recognition
Transformers
Safetensors
Japanese
whisper
audio
hf-asr-leaderboard
Eval Results
Inference Endpoints
asahi417 commited on
Commit
c9c5c56
·
verified ·
1 Parent(s): 32e4188

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -374
README.md CHANGED
@@ -64,70 +64,48 @@ _Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/22
64
  we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
65
  teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
66
  As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
67
- which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average).
68
- Kotoba-whisper-v1.0 is competitive or even outpeform the largest whisper model in Japanese ASR benchmarks, while being 6.3 times faster than the whisper model.
 
69
 
70
 
71
-
72
- ## Table of Contents
73
- Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
74
- (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.
75
- You can expect significant performance gains by switching from previous Distil-Whisper checkpoints to distil-large-v3
76
- when using these libraries. For convenience, the weights for the most popular libraries are already converted,
77
- with instructions for getting started below.
78
-
79
- 1. [Evaluation Results](#evaluation-results)
80
- 2. [Transformers Usage](#transformers-usage)
81
- * [Short-Form Transcription](#short-form-transcription)
82
- * [Sequential Long-Form](#sequential-long-form)
83
- * [Chunked Long-Form](#chunked-long-form)
84
- * [Speculative Decoding](#speculative-decoding)
85
- * [Additional Speed and Memory Improvements](#additional-speed--memory-improvements)
86
- 2. [Library Integrations](#library-integrations)
87
- * [Whisper cpp](#whispercpp)
88
- * [Faster Whisper](#faster-whisper)
89
- 3. [Model Details](#model-details)
90
-
91
-
92
- ## Evaluation Results
93
- ***kotoba-whisper-v1.0*** achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set from ReazonSpeech, and
94
  achieves competitive CER and WER on the out-of-domain test set including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
95
  the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice).
96
 
97
- ### CER
 
98
 
99
  | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
100
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
101
- | [***kotoba-tech/kotoba-whisper-v1.0***](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 9.44 | 8.48 | 12.60 |
102
  | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 8.52 | 7.18 | 15.18 |
103
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
104
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
105
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
106
 
107
- ### WER
108
 
109
  | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
110
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
111
- | [***kotoba-tech/kotoba-whisper-v1.0***](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 59.27 | 64.36 | 56.62 |
112
  | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 55.41 | 59.34 | 60.23 |
113
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
114
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
115
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
116
 
117
- ### Latency
118
- As kotoba-whisper uses the same architecture as [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3),
119
  it inherits the benefit of the improved latency compared to [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)
120
  (**6.3x faster than large-v3**, see the table below taken from [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)).
121
 
122
- | Model | Params / M | Rel. Latency |
123
- |------------------------------------------------------------------------------|------------|--------------|
124
  | **[kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)**| **756** | **6.3** |
125
- | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
126
 
127
 
128
  ## Transformers Usage
129
-
130
- distil-large-v3 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first
131
  install the latest version of Transformers. For this example, we'll also install 🤗 Datasets to load a toy audio dataset
132
  from the Hugging Face Hub:
133
 
@@ -137,7 +115,6 @@ pip install --upgrade transformers accelerate datasets[audio]
137
  ```
138
 
139
  ### Short-Form Transcription
140
-
141
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
142
  class to transcribe short-form audio files (< 30-seconds) as follows:
143
 
@@ -146,19 +123,15 @@ import torch
146
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
147
  from datasets import load_dataset
148
 
149
-
150
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
151
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
152
-
153
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
 
 
154
 
155
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
156
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
157
- )
158
  model.to(device)
159
-
160
  processor = AutoProcessor.from_pretrained(model_id)
161
-
162
  pipe = pipeline(
163
  "automatic-speech-recognition",
164
  model=model,
@@ -169,110 +142,55 @@ pipe = pipeline(
169
  device=device,
170
  )
171
 
172
- dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
 
173
  sample = dataset[0]["audio"]
174
 
 
175
  result = pipe(sample)
176
  print(result["text"])
177
  ```
178
 
179
- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
180
  ```diff
181
  - result = pipe(sample)
182
  + result = pipe("audio.mp3")
183
  ```
184
 
185
- For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
186
  ```python
187
  result = pipe(sample, return_timestamps=True)
188
  print(result["chunks"])
189
  ```
190
 
191
- <details>
192
-
193
- <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
194
-
195
- Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps`
196
- for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate)
197
- for more details.
198
-
199
- ```python
200
- import torch
201
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
202
- from datasets import Audio, load_dataset
203
-
204
-
205
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
206
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
207
-
208
- model_id = "kotoba-tech/kotoba-whisper-v1.0"
209
-
210
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
211
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
212
- )
213
- model.to(device)
214
-
215
- processor = AutoProcessor.from_pretrained(model_id)
216
-
217
- dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
218
- dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
219
- sample = dataset[0]["audio"]
220
-
221
- input_features = processor(
222
- sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
223
- ).input_features
224
-
225
- input_features = input_features.to(device, dtype=torch_dtype)
226
-
227
- gen_kwargs = {
228
- "max_new_tokens": 128,
229
- "num_beams": 1,
230
- "return_timestamps": False,
231
- }
232
-
233
- pred_ids = model.generate(input_features, **gen_kwargs)
234
- pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
235
-
236
- print(pred_text)
237
- ```
238
-
239
- </details>
240
-
241
  ### Sequential Long-Form
242
-
243
- Unlike previous Distil-Whisper releases, distil-large-v3 is specifically designed to be compatible with OpenAI's sequential
244
- long-form transcription algorithm. This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds),
245
- and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
246
-
247
  The sequential long-form algorithm should be used in either of the following scenarios:
 
248
  1. Transcription accuracy is the most important factor, and latency is less of a consideration
249
  2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
250
 
251
  If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm
252
  described [below](#chunked-long-form). For a detailed explanation of the different algorithms, refer to Sections 5 of
253
- the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf).
254
-
255
- The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
256
  class can be used to transcribe long audio files with the sequential algorithm as follows:
257
 
258
  ```python
259
  import torch
 
260
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
261
  from datasets import load_dataset
262
 
263
-
264
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
265
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
266
-
267
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
 
 
268
 
269
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
270
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
271
- )
272
  model.to(device)
273
-
274
  processor = AutoProcessor.from_pretrained(model_id)
275
-
276
  pipe = pipeline(
277
  "automatic-speech-recognition",
278
  model=model,
@@ -283,75 +201,19 @@ pipe = pipeline(
283
  device=device,
284
  )
285
 
286
- dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
287
- sample = dataset[0]["audio"]
 
288
 
 
289
  result = pipe(sample)
290
  print(result["text"])
291
  ```
292
 
293
- <details>
294
-
295
- <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
296
-
297
- ```python
298
- import torch
299
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
300
- from datasets import Audio, load_dataset
301
-
302
-
303
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
304
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
305
-
306
- model_id = "kotoba-tech/kotoba-whisper-v1.0"
307
-
308
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
309
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
310
- )
311
- model.to(device)
312
-
313
- processor = AutoProcessor.from_pretrained(model_id)
314
-
315
- dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
316
- dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
317
- sample = dataset[0]["audio"]
318
-
319
- inputs = processor(
320
- sample["array"],
321
- sampling_rate=sample["sampling_rate"],
322
- return_tensors="pt",
323
- truncation=False,
324
- padding="longest",
325
- return_attention_mask=True,
326
- )
327
- inputs = inputs.to(device, dtype=torch_dtype)
328
-
329
- gen_kwargs = {
330
- "max_new_tokens": 448,
331
- "num_beams": 1,
332
- "condition_on_prev_tokens": False,
333
- "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
334
- "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
335
- "logprob_threshold": -1.0,
336
- "no_speech_threshold": 0.6,
337
- "return_timestamps": True,
338
- }
339
-
340
- pred_ids = model.generate(**i nputs, **gen_kwargs)
341
- pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
342
-
343
- print(pred_text)
344
- ```
345
-
346
- </details>
347
 
348
  ### Chunked Long-Form
349
-
350
- distil-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when
351
- a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
352
- the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the
353
- [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).
354
-
355
  To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds
356
  is optimal. To activate batching over long audio files, pass the argument `batch_size`:
357
 
@@ -360,19 +222,15 @@ import torch
360
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
361
  from datasets import load_dataset
362
 
363
-
364
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
365
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
366
-
367
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
 
 
368
 
369
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
370
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
371
- )
372
  model.to(device)
373
-
374
  processor = AutoProcessor.from_pretrained(model_id)
375
-
376
  pipe = pipeline(
377
  "automatic-speech-recognition",
378
  model=model,
@@ -385,17 +243,17 @@ pipe = pipeline(
385
  device=device,
386
  )
387
 
388
- dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
389
- sample = dataset[0]["audio"]
 
390
 
 
391
  result = pipe(sample)
392
  print(result["text"])
393
  ```
394
 
395
-
396
  ### Additional Speed & Memory Improvements
397
-
398
- You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM
399
  requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
400
  more efficient flash attention version.
401
 
@@ -438,106 +296,10 @@ Once a valid PyTorch version is installed, SDPA is activated by default. It can
438
  + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
439
  ```
440
 
441
- ## Library Integrations
442
-
443
- ### Whisper.cpp
444
-
445
- Distil-Whisper can be run with the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) package with the original
446
- sequential long-form transcription algorithm. In a provisional benchmark on Mac M1, distil-large-v3 is over 5x faster
447
- than Whisper large-v3, while performing to within 0.8% WER over long-form audio.
448
-
449
- Steps for getting started:
450
-
451
- 1. Clone the Whisper.cpp repository:
452
- ```
453
- git clone https://github.com/ggerganov/whisper.cpp.git
454
- cd whisper.cpp
455
- ```
456
- 2. Install the Hugging Face Hub Python package:
457
- ```bash
458
- pip install --upgrade huggingface_hub
459
- ```
460
- And download the GGML weights for distil-large-v3 using the following Python snippet:
461
-
462
- ```python
463
- from huggingface_hub import hf_hub_download
464
-
465
- hf_hub_download(repo_id='kotoba-tech/kotoba-whisper-v1.0-ggml', filename='ggml-distil-large-v3.bin', local_dir='./models')
466
- ```
467
-
468
- Note that if you do not have a Python environment set-up, you can also download the weights directly with `wget`:
469
-
470
- ```bash
471
- wget https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0-ggml/resolve/main/ggml-distil-large-v3.bin -P ./models
472
- ```
473
-
474
- 3. Run inference using the provided sample audio:
475
-
476
- ```bash
477
- make -j && ./main -m models/ggml-distil-large-v3.bin -f samples/jfk.wav
478
- ```
479
-
480
- ### Faster-Whisper
481
-
482
- Faster-Whisper is a reimplementation of Whisper using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), a fast
483
- inference engine for Transformer models.
484
-
485
- First, install the Faster-Whisper package according to the [official instructions](https://github.com/SYSTRAN/faster-whisper#installation).
486
- For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:
487
-
488
- ```bash
489
- pip install --upgrade pip
490
- pip install --upgrade git+https://github.com/SYSTRAN/faster-whisper datasets[audio]
491
- ```
492
-
493
- The following code snippet loads the distil-large-v3 model and runs inference on an example file from the LibriSpeech ASR
494
- dataset:
495
-
496
- ```python
497
- import torch
498
- from faster_whisper import WhisperModel
499
- from datasets import load_dataset
500
-
501
- # define our torch configuration
502
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
503
- compute_type = "float16" if torch.cuda.is_available() else "float32"
504
-
505
- # load model on GPU if available, else cpu
506
- model = WhisperModel("distil-large-v3", device=device, compute_type=compute_type)
507
-
508
- # load toy dataset for example
509
- dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
510
- sample = dataset[1]["audio"]["path"]
511
-
512
- segments, info = model.transcribe(sample, beam_size=1)
513
-
514
- for segment in segments:
515
- print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
516
- ```
517
-
518
- To transcribe a local audio file, simply pass the path to the audio file as the `audio` argument to transcribe:
519
-
520
- ```python
521
- segments, info = model.transcribe("audio.mp3", beam_size=1)
522
- ```
523
-
524
 
525
  ## Model Details
 
526
 
527
- Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector
528
- inputs to a sequence of hidden-state vectors. The decoder auto-regressively predicts text tokens, conditional on all
529
- previous tokens and the encoder hidden-states. Consequently, the encoder is only run forward once, whereas the decoder
530
- is run as many times as the number of tokens generated. In practice, this means the decoder accounts for over 90% of
531
- total inference time. Thus, to optimise for latency, the focus is on minimising the inference time of the decoder.
532
-
533
- To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed.
534
- The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training.
535
- The student's decoder consists of a subset of the teacher decoder layers, which are intialised from maximally spaced layers.
536
- The model is then trained on a weighted sum of the KL divergence and pseudo-label loss terms.
537
-
538
- <p align="center">
539
- <img src="https://huggingface.co/datasets/distil-whisper/figures/resolve/main/architecture.png?raw=true" width="600"/>
540
- </p>
541
 
542
  ## Evaluation
543
 
@@ -557,123 +319,62 @@ Evaluation can then be run end-to-end with the following example:
557
 
558
  ```python
559
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
560
- from datasets import load_dataset
561
  from evaluate import load
562
  import torch
563
  from tqdm import tqdm
564
 
565
- # define our torch configuration
566
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
567
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
568
-
569
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
 
 
 
 
 
570
 
571
- # load the model + processor
572
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, use_safetensors=True, low_cpu_mem_usage=True)
573
- model = model.to(device)
574
  processor = AutoProcessor.from_pretrained(model_id)
575
 
576
- # load the dataset with streaming mode
577
- dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)
 
 
578
 
579
- # define the evaluation metric
580
- wer_metric = load("wer")
581
 
582
  def inference(batch):
583
  # 1. Pre-process the audio data to log-mel spectrogram inputs
584
  audio = [sample["array"] for sample in batch["audio"]]
585
  input_features = processor(audio, sampling_rate=batch["audio"][0]["sampling_rate"], return_tensors="pt").input_features
586
  input_features = input_features.to(device, dtype=torch_dtype)
587
-
588
  # 2. Auto-regressively generate the predicted token ids
589
  pred_ids = model.generate(input_features, max_new_tokens=128)
590
-
591
  # 3. Decode the token ids to the final transcription
592
  batch["transcription"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
593
- batch["reference"] = batch["text"]
594
  return batch
595
 
596
- # batch size 16 inference
597
- dataset = dataset.map(function=inference, batched=True, batch_size=16)
598
 
 
599
  all_transcriptions = []
600
  all_references = []
601
-
602
- # iterate over the dataset and run inference
603
  for result in tqdm(dataset, desc="Evaluating..."):
604
  all_transcriptions.append(result["transcription"])
605
  all_references.append(result["reference"])
606
 
607
  # normalize predictions and references
608
- all_transcriptions = [processor.normalize(transcription) for transcription in all_transcriptions]
609
- all_references = [processor.normalize(reference) for reference in all_references]
610
-
611
- # compute the WER metric
612
- wer = 100 * wer_metric.compute(predictions=all_transcriptions, references=all_references)
613
- print(wer)
614
 
 
 
 
 
615
  ```
616
- **Print Output:**
617
- ```
618
- 2.428920763531516
619
- ```
620
-
621
-
622
- ## Data
623
-
624
- Distil-Whisper is trained on 22,000 hours of audio data from nine open-source, permissively licensed speech datasets on the
625
- Hugging Face Hub:
626
-
627
- | Dataset | Size / h | Speakers | Domain | Licence |
628
- |-----------------------------------------------------------------------------------------|----------|----------|-----------------------------|-----------------|
629
- | [People's Speech](https://huggingface.co/datasets/MLCommons/peoples_speech) | 12,000 | unknown | Internet Archive | CC-BY-SA-4.0 |
630
- | [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) | 3,000 | unknown | Narrated Wikipedia | CC0-1.0 |
631
- | [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | 2,500 | unknown | Audiobook, podcast, YouTube | apache-2.0 |
632
- | Fisher | 1,960 | 11,900 | Telephone conversations | LDC |
633
- | [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | 960 | 2,480 | Audiobooks | CC-BY-4.0 |
634
- | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 540 | 1,310 | European Parliament | CC0 |
635
- | [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | 450 | 2,030 | TED talks | CC-BY-NC-ND 3.0 |
636
- | SwitchBoard | 260 | 540 | Telephone conversations | LDC |
637
- | [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | 100 | unknown | Meetings | CC-BY-4.0 |
638
- ||||||
639
- | **Total** | 21,770 | 18,260+ | | |
640
-
641
- The combined dataset spans 10 distinct domains and over 50k speakers. The diversity of this dataset is crucial to ensuring
642
- the distilled model is robust to audio distributions and noise.
643
-
644
- The audio data is then pseudo-labelled using the Whisper large-v3 model: we use Whisper to generate predictions for all
645
- the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
646
- transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
647
-
648
- ## WER Filter
649
-
650
- The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
651
- accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
652
- and the ground truth labels provided by each dataset. We then compute the WER between these labels. If the WER exceeds
653
- a specified threshold, we discard the training example. Otherwise, we keep it for training.
654
-
655
- Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demonstrates the effectiveness of this filter
656
- for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
657
- hallucinations to this filter.
658
-
659
- ## Training
660
-
661
- The model was trained for 80,000 optimisation steps (or 11 epochs) with batch size 256. The Tensorboard training logs can
662
- be found under: https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0/tensorboard?params=scalars#frame
663
-
664
- ## Results
665
-
666
- The distilled model performs to within 1.5% WER of Whisper large-v3 on out-of-distribution (OOD) short-form audio, within
667
- 1% WER on sequential long-form decoding, and outperforms large-v3 by 0.1% on chunked long-form. This performance gain is
668
- attributed to lower hallucinations.
669
-
670
- For a detailed per-dataset breakdown of the evaluation results, refer to Tables 16 and 17 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)
671
-
672
- Distil-Whisper is also evaluated on the [ESB benchmark](https://arxiv.org/abs/2210.13352) datasets as part of the [OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard),
673
- where it performs to within 0.2% WER of Whisper.
674
 
675
- ## Reproducing Kotoba-Whisper
676
- Training and evaluation code to reproduce Kotoba-Whisper is available at the repository: [TBA](TBA).
677
 
678
  ## Acknowledgements
679
  * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
 
64
  we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
65
  teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
66
  As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
67
+ which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
68
+ those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter)).
69
+ The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the raining and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
70
 
71
 
72
+ Kotoba-whisper-v1.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set from ReazonSpeech, and
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  achieves competitive CER and WER on the out-of-domain test set including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
74
  the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice).
75
 
76
+
77
+ - ***CER***
78
 
79
  | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
80
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
81
+ | [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) **9.44** | **8.48** | **12.60** |
82
  | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 8.52 | 7.18 | 15.18 |
83
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
84
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
85
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
86
 
87
+ - ***WER***
88
 
89
  | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
90
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
91
+ | [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | **59.27** | **64.36** | **56.62** |
92
  | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 55.41 | 59.34 | 60.23 |
93
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
94
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
95
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
96
 
97
+ - ***Latency***: As kotoba-whisper uses the same architecture as [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3),
 
98
  it inherits the benefit of the improved latency compared to [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)
99
  (**6.3x faster than large-v3**, see the table below taken from [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)).
100
 
101
+ | Model | Params / M | Rel. Latency |
102
+ |----------------------------------------------------------------------------------------------|------------|--------------|
103
  | **[kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)**| **756** | **6.3** |
104
+ | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
105
 
106
 
107
  ## Transformers Usage
108
+ Kotoba-Whisper is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first
 
109
  install the latest version of Transformers. For this example, we'll also install 🤗 Datasets to load a toy audio dataset
110
  from the Hugging Face Hub:
111
 
 
115
  ```
116
 
117
  ### Short-Form Transcription
 
118
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
119
  class to transcribe short-form audio files (< 30-seconds) as follows:
120
 
 
123
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
124
  from datasets import load_dataset
125
 
126
+ # config
 
 
 
127
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
128
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
129
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
130
 
131
+ # load model
132
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
 
133
  model.to(device)
 
134
  processor = AutoProcessor.from_pretrained(model_id)
 
135
  pipe = pipeline(
136
  "automatic-speech-recognition",
137
  model=model,
 
142
  device=device,
143
  )
144
 
145
+ # load sample audio
146
+ dataset = load_dataset("japanese-asr/ja_asr.common_voice_8_0", split="test")
147
  sample = dataset[0]["audio"]
148
 
149
+ # run inference
150
  result = pipe(sample)
151
  print(result["text"])
152
  ```
153
 
154
+ - To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
155
  ```diff
156
  - result = pipe(sample)
157
  + result = pipe("audio.mp3")
158
  ```
159
 
160
+ - For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
161
  ```python
162
  result = pipe(sample, return_timestamps=True)
163
  print(result["chunks"])
164
  ```
165
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  ### Sequential Long-Form
167
+ Kotoba-whisper is designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered
168
+ inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
 
 
 
169
  The sequential long-form algorithm should be used in either of the following scenarios:
170
+
171
  1. Transcription accuracy is the most important factor, and latency is less of a consideration
172
  2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
173
 
174
  If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm
175
  described [below](#chunked-long-form). For a detailed explanation of the different algorithms, refer to Sections 5 of
176
+ the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf). The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
 
 
177
  class can be used to transcribe long audio files with the sequential algorithm as follows:
178
 
179
  ```python
180
  import torch
181
+ import numpy as np
182
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
183
  from datasets import load_dataset
184
 
185
+ # config
 
 
 
186
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
187
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
188
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
189
 
190
+ # load model
191
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
 
192
  model.to(device)
 
193
  processor = AutoProcessor.from_pretrained(model_id)
 
194
  pipe = pipeline(
195
  "automatic-speech-recognition",
196
  model=model,
 
201
  device=device,
202
  )
203
 
204
+ # load sample audio (concatenate instances to creaete a long audio)
205
+ dataset = load_dataset("japanese-asr/ja_asr.common_voice_8_0", split="test")
206
+ sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate'], "path": "tmp"}
207
 
208
+ # run inference
209
  result = pipe(sample)
210
  print(result["text"])
211
  ```
212
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
 
214
  ### Chunked Long-Form
215
+ This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
216
+ the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).
 
 
 
 
217
  To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds
218
  is optimal. To activate batching over long audio files, pass the argument `batch_size`:
219
 
 
222
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
223
  from datasets import load_dataset
224
 
225
+ # config
 
 
 
226
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
227
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
228
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
229
 
230
+ # load model
231
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
 
232
  model.to(device)
 
233
  processor = AutoProcessor.from_pretrained(model_id)
 
234
  pipe = pipeline(
235
  "automatic-speech-recognition",
236
  model=model,
 
243
  device=device,
244
  )
245
 
246
+ # load sample audio (concatenate instances to creaete a long audio)
247
+ dataset = load_dataset("japanese-asr/ja_asr.common_voice_8_0", split="test")
248
+ sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate'], "path": "tmp"}
249
 
250
+ # run inference
251
  result = pipe(sample)
252
  print(result["text"])
253
  ```
254
 
 
255
  ### Additional Speed & Memory Improvements
256
+ You can apply additional speed and memory improvements to further reduce the inference speed and VRAM
 
257
  requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
258
  more efficient flash attention version.
259
 
 
296
  + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
297
  ```
298
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
 
300
  ## Model Details
301
+ See [https://huggingface.co/distil-whisper/distil-large-v3#model-details](https://huggingface.co/distil-whisper/distil-large-v3#model-details).
302
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
303
 
304
  ## Evaluation
305
 
 
319
 
320
  ```python
321
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
322
+ from datasets import load_dataset, features
323
  from evaluate import load
324
  import torch
325
  from tqdm import tqdm
326
 
327
+ # config
 
 
 
328
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
329
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
330
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
331
+ audio_column = 'audio'
332
+ text_column = 'transcription'
333
+ batch_size = 16
334
 
335
+ # load model
336
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
337
+ model.to(device)
338
  processor = AutoProcessor.from_pretrained(model_id)
339
 
340
+ # load the dataset and sample the audio with 16kHz
341
+ dataset = load_dataset("japanese-asr/ja_asr.common_voice_8_0", split="test")
342
+ dataset = dataset.cast_column(audio_column, features.Audio(sampling_rate=processor.feature_extractor.sampling_rate))
343
+ dataset = dataset.select([0, 1, 2, 3, 4, 5, 6])
344
 
345
+ # preprocess and batch the dataset
 
346
 
347
  def inference(batch):
348
  # 1. Pre-process the audio data to log-mel spectrogram inputs
349
  audio = [sample["array"] for sample in batch["audio"]]
350
  input_features = processor(audio, sampling_rate=batch["audio"][0]["sampling_rate"], return_tensors="pt").input_features
351
  input_features = input_features.to(device, dtype=torch_dtype)
 
352
  # 2. Auto-regressively generate the predicted token ids
353
  pred_ids = model.generate(input_features, max_new_tokens=128)
 
354
  # 3. Decode the token ids to the final transcription
355
  batch["transcription"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
356
+ batch["reference"] = batch[text_column]
357
  return batch
358
 
359
+ dataset = dataset.map(function=inference, batched=True, batch_size=batch_size)
 
360
 
361
+ # iterate over the dataset and run inference
362
  all_transcriptions = []
363
  all_references = []
 
 
364
  for result in tqdm(dataset, desc="Evaluating..."):
365
  all_transcriptions.append(result["transcription"])
366
  all_references.append(result["reference"])
367
 
368
  # normalize predictions and references
369
+ all_transcriptions = [transcription.replace(" ", "") for transcription in all_transcriptions]
370
+ all_references = [reference.replace(" ", "") for reference in all_references]
 
 
 
 
371
 
372
+ # compute the CER metric
373
+ cer_metric = load("cer")
374
+ cer = 100 * cer_metric.compute(predictions=all_transcriptions, references=all_references)
375
+ print(cer)
376
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377
 
 
 
378
 
379
  ## Acknowledgements
380
  * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).