waveletdeboshir
/

whisper-small-ru-pruned

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

whisper-small-ru-pruned / README.md

waveletdeboshir's picture

waveletdeboshir

Add git link

0111641 verified about 2 months ago

|

3.58 kB

	---
	license: apache-2.0
	language:
	- ru
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	tags:
	- asr
	- Pytorch
	- pruned
	- audio
	- automatic-speech-recognition
	metrics:
	- cer
	- wer
	model-index:
	- name: Whisper Small Pruned for Russian
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 15.0 (Russian part, test)
	type: mozilla-foundation/common_voice_15_0
	args: ru
	metrics:
	- name: WER
	type: wer
	value: 24.98
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 15.0 (Russian part, test)
	type: mozilla-foundation/common_voice_15_0
	args: ru
	metrics:
	- name: WER (without punctuation)
	type: wer
	value: 17.48
	---

	# Whisper-small-ru-pruned

	## Model info
	This is a pruned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) model with only russian tokens left.
	Pruning was made without any fine-tuning. Method from [this post](https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) was used.

	## Size
	Only 10% tokens was left including special whisper tokens (no language tokens except \<\|ru\|\> and \<\|en\|\>, no timestamp tokens), 200 most popular tokens from tokenizer and 4000 most popular Russian tokens computed by tokenization of russian text corpus.

	Model size is 15% less then original whisper-small:
	\| \| openai/whisper-small \| waveletdeboshir/whisper-small-ru-pruned \|
	\| :------ \| :------ \| :------ \|
	\| n of parameters \| 242 M \| 205 M \|
	\| n of parameters (with proj_out layer) \| 281 M \| 208 M \|
	\| model file size \| 967 Mb \| 834 Mb \|
	\| vocab_size \| 51865 \| 4207 \|

	## Usage
	Model can be used as an original whisper:

	```python
	>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
	>>> import torchaudio

	>>> # load audio
	>>> wav, sr = torchaudio.load("audio.wav")

	>>> # load model and processor
	>>> processor = WhisperProcessor.from_pretrained("waveletdeboshir/whisper-small-ru-pruned")
	>>> model = WhisperForConditionalGeneration.from_pretrained("waveletdeboshir/whisper-small-ru-pruned")

	>>> input_features = processor(wav[0], sampling_rate=sr, return_tensors="pt").input_features

	>>> # generate token ids
	>>> predicted_ids = model.generate(input_features)
	>>> # decode token ids to text
	>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
	['<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Начинаем работу.<\|endoftext\|>']

	```
	The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`.

	## Other pruned whisper models
	* [waveletdeboshir/whisper-tiny-ru-pruned](https://huggingface.co/waveletdeboshir/whisper-tiny-ru-pruned)
	* [waveletdeboshir/whisper-base-ru-pruned](https://huggingface.co/waveletdeboshir/whisper-base-ru-pruned)

	## Metrics
	\| metric \| dataset \| openai/whisper-small \| waveletdeboshir/whisper-small-ru-pruned \|
	\| :------ \| :------ \| :------ \| :------ \|
	\| WER* \| golos-test-crowd \| 0.3358 \| 0.3471 \|
	\| CER* \| golos-test-crowd \| 0.1561 \| 0.1444 \|
	\| WER* \| common_voice_15_0_test \| 0.1749 \| 0.1748 \|
	\| WER \| common_voice_15_0_test \| 0.2492 \| 0.2498 \|
	*Metrics were computed after text normalization

	You can fine-tune this model on your data to achive better performance.

	## Colab for vocab pruning
	Check https://github.com/waveletdeboshir/whisper-lang-remover