Automatic Speech Recognition
Transformers
Safetensors
Japanese
whisper
audio
hf-asr-leaderboard
Eval Results
Inference Endpoints
asahi417 commited on
Commit
3b0c26a
Β·
verified Β·
1 Parent(s): 6a1eb4a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -101
README.md CHANGED
@@ -123,53 +123,49 @@ class to transcribe short-form audio files (< 30-seconds) as follows:
123
 
124
  ```python
125
  import torch
126
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
127
  from datasets import load_dataset, Audio
128
 
129
  # config
130
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
131
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
132
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
 
 
133
 
134
  # load model
135
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
136
- model.to(device)
137
- processor = AutoProcessor.from_pretrained(model_id)
138
  pipe = pipeline(
139
  "automatic-speech-recognition",
140
- model=model,
141
- tokenizer=processor.tokenizer,
142
- feature_extractor=processor.feature_extractor,
143
- max_new_tokens=128,
144
  torch_dtype=torch_dtype,
145
  device=device,
 
146
  )
147
 
148
  # load sample audio & downsample to 16kHz
149
  dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
150
- dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
151
  sample = dataset[0]["audio"]
152
 
153
  # run inference
154
- result = pipe(sample)
155
  print(result["text"])
156
  ```
157
 
158
  - To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline (make sure the audio is sampled in 16kHz):
159
  ```diff
160
- - result = pipe(sample)
161
- + result = pipe("audio.mp3")
162
  ```
163
 
164
  - For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
165
  ```python
166
- result = pipe(sample, return_timestamps=True)
167
  print(result["chunks"])
168
  ```
169
 
170
- ### Sequential Long-Form
171
- Kotoba-whisper is designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered
172
  inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
 
173
  The sequential long-form algorithm should be used in either of the following scenarios:
174
 
175
  1. Transcription accuracy is the most important factor, and latency is less of a consideration
@@ -180,41 +176,6 @@ described [below](#chunked-long-form). For a detailed explanation of the differe
180
  the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf). The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
181
  class can be used to transcribe long audio files with the sequential algorithm as follows:
182
 
183
- ```python
184
- import torch
185
- import numpy as np
186
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
187
- from datasets import load_dataset
188
-
189
- # config
190
- model_id = "kotoba-tech/kotoba-whisper-v1.0"
191
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
192
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
193
-
194
- # load model
195
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
196
- model.to(device)
197
- processor = AutoProcessor.from_pretrained(model_id)
198
- pipe = pipeline(
199
- "automatic-speech-recognition",
200
- model=model,
201
- tokenizer=processor.tokenizer,
202
- feature_extractor=processor.feature_extractor,
203
- max_new_tokens=128,
204
- torch_dtype=torch_dtype,
205
- device=device,
206
- )
207
-
208
- # load sample audio (concatenate instances to create a long audio)
209
- dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
210
- dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
211
- sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate'], "path": "tmp"}
212
-
213
- # run inference
214
- result = pipe(sample)
215
- print(result["text"])
216
- ```
217
-
218
 
219
  ### Chunked Long-Form
220
  This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
@@ -224,37 +185,33 @@ is optimal. To activate batching over long audio files, pass the argument `batch
224
 
225
  ```python
226
  import torch
227
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
228
  from datasets import load_dataset
229
 
230
  # config
231
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
232
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
233
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
 
 
234
 
235
  # load model
236
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
237
- model.to(device)
238
- processor = AutoProcessor.from_pretrained(model_id)
239
  pipe = pipeline(
240
  "automatic-speech-recognition",
241
- model=model,
242
- tokenizer=processor.tokenizer,
243
- feature_extractor=processor.feature_extractor,
244
- max_new_tokens=128,
245
- chunk_length_s=25,
246
- batch_size=16,
247
  torch_dtype=torch_dtype,
248
  device=device,
 
 
 
249
  )
250
 
251
  # load sample audio (concatenate instances to create a long audio)
252
  dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
253
- dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
254
- sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate'], "path": "tmp"}
255
 
256
  # run inference
257
- result = pipe(sample)
258
  print(result["text"])
259
  ```
260
 
@@ -263,34 +220,41 @@ Kotoba-whisper can generate transcription with prompting as below:
263
 
264
  ```python
265
  import torch
266
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
267
  from datasets import load_dataset, Audio
268
 
269
  # config
270
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
271
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
272
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
 
 
273
 
274
  # load model
275
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
276
- model.to(device)
277
- processor = AutoProcessor.from_pretrained(model_id)
 
 
 
 
 
278
 
279
  # load sample audio & downsample to 16kHz
280
  dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
281
- dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
282
- input_features = processor(dataset[10]["audio"]["array"], return_tensors="pt").input_features
283
 
284
  # --- Without prompt ---
285
- output_without_prompt = model.generate(input_features)
286
- print(processor.decode(output_without_prompt[0]))
287
- # <|startoftranscript|><|ko|><|transcribe|><|notimestamps|>81ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚<|endoftext|>
288
 
289
  # --- With prompt ---: Let's change `81` to `91`.
290
- prompt_ids = processor.get_prompt_ids("91ζ­³", return_tensors="pt")
291
- output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
292
- print(processor.decode(output_with_prompt[0]))
293
- # <|startofprev|> 91ζ­³<|startoftranscript|><|ko|><|transcribe|><|notimestamps|> γ‚γ£γΆγ£γŸγ§γ‚‚γ‚Ήγƒ«γ‚¬γ•γ‚“γ€91ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚<|endoftext|>
 
 
294
  ```
295
 
296
  ### Additional Speed & Memory Improvements
@@ -310,31 +274,8 @@ pip install flash-attn --no-build-isolation
310
  Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
311
 
312
  ```diff
313
- - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
314
- + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
315
- ```
316
-
317
- #### Torch Scale-Product-Attention (SDPA)
318
-
319
- If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
320
- This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check
321
- whether you have a compatible PyTorch version, run the following Python code snippet:
322
-
323
- ```python
324
- from transformers.utils import is_torch_sdpa_available
325
-
326
- print(is_torch_sdpa_available())
327
- ```
328
-
329
- If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it
330
- returns `False`, you need to upgrade your PyTorch version according to the [official instructions](https://pytorch.org/get-started/locally/)
331
-
332
- Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying
333
- `attn_implementation="sdpa"` as follows:
334
-
335
- ```diff
336
- - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
337
- + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
338
  ```
339
 
340
 
 
123
 
124
  ```python
125
  import torch
126
+ from transformers import pipeline
127
  from datasets import load_dataset, Audio
128
 
129
  # config
130
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
131
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
132
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
133
+ model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
134
+ generate_kwargs = {"language": "japanese", "task": "transcribe"}
135
 
136
  # load model
 
 
 
137
  pipe = pipeline(
138
  "automatic-speech-recognition",
139
+ model=model_id,
 
 
 
140
  torch_dtype=torch_dtype,
141
  device=device,
142
+ model_kwargs=model_kwargs
143
  )
144
 
145
  # load sample audio & downsample to 16kHz
146
  dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
 
147
  sample = dataset[0]["audio"]
148
 
149
  # run inference
150
+ result = pipe(sample, generate_kwargs=generate_kwargs)
151
  print(result["text"])
152
  ```
153
 
154
  - To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline (make sure the audio is sampled in 16kHz):
155
  ```diff
156
+ - result = pipe(sample, generate_kwargs=generate_kwargs)
157
+ + result = pipe("audio.mp3", generate_kwargs=generate_kwargs)
158
  ```
159
 
160
  - For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
161
  ```python
162
+ result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
163
  print(result["chunks"])
164
  ```
165
 
166
+ ***Sequential Long-Form:*** Kotoba-whisper is designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered
 
167
  inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
168
+ As default, if long audio files are passed to the model, it will transcribes with the sequential long-form transcription.
169
  The sequential long-form algorithm should be used in either of the following scenarios:
170
 
171
  1. Transcription accuracy is the most important factor, and latency is less of a consideration
 
176
  the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf). The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
177
  class can be used to transcribe long audio files with the sequential algorithm as follows:
178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
 
180
  ### Chunked Long-Form
181
  This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
 
185
 
186
  ```python
187
  import torch
188
+ from transformers import pipeline
189
  from datasets import load_dataset
190
 
191
  # config
192
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
193
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
194
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
195
+ model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
196
+ generate_kwargs = {"language": "japanese", "task": "transcribe"}
197
 
198
  # load model
 
 
 
199
  pipe = pipeline(
200
  "automatic-speech-recognition",
201
+ model=model_id,
 
 
 
 
 
202
  torch_dtype=torch_dtype,
203
  device=device,
204
+ model_kwargs=model_kwargs,
205
+ chunk_length_s=25,
206
+ batch_size=16
207
  )
208
 
209
  # load sample audio (concatenate instances to create a long audio)
210
  dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
211
+ sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate']}
 
212
 
213
  # run inference
214
+ result = pipe(sample, generate_kwargs=generate_kwargs)
215
  print(result["text"])
216
  ```
217
 
 
220
 
221
  ```python
222
  import torch
223
+ from transformers import pipeline
224
  from datasets import load_dataset, Audio
225
 
226
  # config
227
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
228
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
229
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
230
+ model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
231
+ generate_kwargs = {"language": "japanese", "task": "transcribe"}
232
 
233
  # load model
234
+ pipe = pipeline(
235
+ "automatic-speech-recognition",
236
+ model=model_id,
237
+ torch_dtype=torch_dtype,
238
+ device=device,
239
+ model_kwargs=model_kwargs
240
+ )
241
+
242
 
243
  # load sample audio & downsample to 16kHz
244
  dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
 
 
245
 
246
  # --- Without prompt ---
247
+ result = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)
248
+ print(result['text'])
249
+ # 81ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚
250
 
251
  # --- With prompt ---: Let's change `81` to `91`.
252
+ prompt = "91ζ­³"
253
+ generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
254
+ result = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)
255
+ result['text'] = result['text'][1 + len(prompt) + 1:] # prompt has been added at the beginning of the output now, so remove it.
256
+ print(result['text'])
257
+ # γ‚γ£γΆγ£γŸγ§γ‚‚γ‚Ήγƒ«γ‚¬γ•γ‚“γ€91ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚
258
  ```
259
 
260
  ### Additional Speed & Memory Improvements
 
274
  Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
275
 
276
  ```diff
277
+ - model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
278
+ + model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279
  ```
280
 
281