Upload README.md

fab26d4 11 months ago

4.03 kB

	---
	language:
	- pl
	tags:
	- audio
	- automatic-speech-recognition
	- transformers.js
	pipeline_tag: automatic-speech-recognition
	license: mit
	library_name: transformers
	---

	# Polish Distil-Whisper: distil-large-v3

	Distil-Whisper was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).

	It is a distilled version of the Whisper model that is 3 times faster, 49% smaller. This is the repository for distil-large-v3-pl, a distilled variant of [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3).


	## Usage

	Distil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first
	install the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load toy
	audio dataset from the Hugging Face Hub:

	```bash
	pip install --upgrade pip
	pip install --upgrade transformers accelerate datasets[audio]
	```

	### Short-Form Transcription

	The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
	class to transcribe short-form audio files (< 30-seconds) as follows:

	```python
	import torch
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
	from datasets import load_dataset


	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	model_id = "Aspik101/distil-whisper-large-v3-pl"

	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
	)
	model.to(device)

	processor = AutoProcessor.from_pretrained(model_id)

	pipe = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	max_new_tokens=128,
	torch_dtype=torch_dtype,
	device=device,
	)

	dataset = load_dataset("mozilla-foundation/common_voice_13_0", "pl", split="test")
	sample = dataset[0]["audio"]

	result = pipe(sample)
	print(result["text"])
	```

	To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
	```diff
	- result = pipe(sample)
	+ result = pipe("audio.mp3")
	```

	### Long-Form Transcription

	Distil-Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
	is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).

	To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For Distil-Whisper, a chunk length of 15-seconds
	is optimal. To activate batching, pass the argument `batch_size`:

	```python
	import torch
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
	from datasets import load_dataset


	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	model_id = "Aspik101/distil-whisper-large-v3-pl"

	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
	)
	model.to(device)

	processor = AutoProcessor.from_pretrained(model_id)

	pipe = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	max_new_tokens=128,
	chunk_length_s=15,
	batch_size=16,
	torch_dtype=torch_dtype,
	device=device,
	)

	dataset = load_dataset("mozilla-foundation/common_voice_13_0", "pl", split="test")
	sample = dataset[0]["audio"]

	result = pipe(sample)
	print(result["text"])
	```

	<!---
	Tip: The pipeline can also be used to transcribe an audio file from a remote URL, for example:

	```python
	result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav")
	```
	--->