waveletdeboshir
/

whisper-large-v3-no-numbers

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

whisper-large-v3-no-numbers / README.md

waveletdeboshir's picture

waveletdeboshir

Remove additional tokens

856646d verified 2 months ago

|

history blame contribute delete

2.24 kB

	---
	base_model:
	- openai/whisper-large-v3
	language:
	- en
	- zh
	- de
	- es
	- ru
	- ko
	- fr
	- ja
	- pt
	- tr
	- pl
	- ca
	- nl
	- ar
	- sv
	- it
	- id
	- hi
	- fi
	- vi
	- he
	- uk
	- el
	- ms
	- cs
	- ro
	- da
	- hu
	- ta
	- 'no'
	- th
	- ur
	- hr
	- bg
	- lt
	- la
	- mi
	- ml
	- cy
	- sk
	- te
	- fa
	- lv
	- bn
	- sr
	- az
	- sl
	- kn
	- et
	- mk
	- br
	- eu
	- is
	- hy
	- ne
	- mn
	- bs
	- kk
	- sq
	- sw
	- gl
	- mr
	- pa
	- si
	- km
	- sn
	- yo
	- so
	- af
	- oc
	- ka
	- be
	- tg
	- sd
	- gu
	- am
	- yi
	- lo
	- uz
	- fo
	- ht
	- ps
	- tk
	- nn
	- mt
	- sa
	- lb
	- my
	- bo
	- tl
	- mg
	- as
	- tt
	- haw
	- ln
	- ha
	- ba
	- jw
	- su
	library_name: transformers
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	tags:
	- asr
	- Pytorch
	- pruned
	- audio
	- automatic-speech-recognition
	---

	# Whisper-large-v3-no-numbers

	## Model info
	This is a version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model without number tokens (token ids corresponding to numbers are excluded).
	NO fine-tuning was used.

	Phrases with spoken numbers will be transcribed with numbers as words. It can be useful for TTS data preparation.

	Example: Instead of "25" this model will transcribe phrase as "twenty five".

	## Usage
	`transformers` version `4.45.2`

	Model can be used as an original whisper:

	```python
	>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
	>>> import torchaudio

	>>> # load audio
	>>> wav, sr = torchaudio.load("audio.wav")
	>>> # resample if necessary
	>>> wav = torchaudio.functional.resample(wav, sr, 16000)

	>>> # load model and processor
	>>> processor = WhisperProcessor.from_pretrained("waveletdeboshir/whisper-large-v3-no-numbers")
	>>> model = WhisperForConditionalGeneration.from_pretrained("waveletdeboshir/whisper-large-v3-no-numbers")

	>>> input_features = processor(wav[0], sampling_rate=16000, return_tensors="pt").input_features

	>>> # generate token ids
	>>> predicted_ids = model.generate(input_features)
	>>> # decode token ids to text
	>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
	['<\|startoftranscript\|><\|en\|><\|transcribe\|><\|notimestamps\|> Twenty seven years. <\|endoftext\|>']

	```
	The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`.