|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- zh |
|
- de |
|
- es |
|
- ru |
|
- ko |
|
- fr |
|
- ja |
|
- pt |
|
- tr |
|
- pl |
|
- ca |
|
- nl |
|
- ar |
|
- sv |
|
- it |
|
- id |
|
- hi |
|
- fi |
|
- vi |
|
- he |
|
- uk |
|
- el |
|
- ms |
|
- cs |
|
- ro |
|
- da |
|
- hu |
|
- ta |
|
- 'no' |
|
- th |
|
- ur |
|
- hr |
|
- bg |
|
- lt |
|
- la |
|
- mi |
|
- ml |
|
- cy |
|
- sk |
|
- te |
|
- fa |
|
- lv |
|
- bn |
|
- sr |
|
- az |
|
- sl |
|
- kn |
|
- et |
|
- mk |
|
- br |
|
- eu |
|
- is |
|
- hy |
|
- ne |
|
- mn |
|
- bs |
|
- kk |
|
- sq |
|
- sw |
|
- gl |
|
- mr |
|
- pa |
|
- si |
|
- km |
|
- sn |
|
- yo |
|
- so |
|
- af |
|
- oc |
|
- ka |
|
- be |
|
- tg |
|
- sd |
|
- gu |
|
- am |
|
- yi |
|
- lo |
|
- uz |
|
- fo |
|
- ht |
|
- ps |
|
- tk |
|
- nn |
|
- mt |
|
- sa |
|
- lb |
|
- my |
|
- bo |
|
- tl |
|
- mg |
|
- as |
|
- tt |
|
- haw |
|
- ln |
|
- ha |
|
- ba |
|
- jw |
|
- su |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
base_model: openai/whisper-small |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
# Whisper-small OpenVINO IR |
|
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours |
|
of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need |
|
for fine-tuning. |
|
|
|
Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) |
|
by Alec Radford et al from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper). |
|
|
|
**Disclaimer**: Content for this model card has partly been copied and pasted from [this model card](https://huggingface.co/openai/whisper-small). |
|
|
|
# Model details |
|
Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. |
|
|
|
 |
|
|
|
| Model Type | Parameters | n_audio_ctx | n_audio_state | n_audio_head | n_audio_layer | n_text_ctx | n_text_state | n_text_head | n_text_layer | n_mels | n_vocab | |
|
|---------------------------|------------|-------------|---------------|--------------|---------------|------------|--------------|-------------|--------------|--------|---------| |
|
| whisper-tiny | 39 M | 1500 | 384 | 6 | 4 | 224 | 384 | 6 | 4 | 80 | 51865 | |
|
| whisper-base | 74 M | 1500 | 512 | 8 | 6 | 224 | 512 | 8 | 6 | 80 | 51865 | |
|
| **whisper-small** | 244 M | 1500 | 768 | 12 | 12 | 224 | 768 | 12 | 12 | 80 | 51865 | |
|
| whisper-medium | 769 M | 1500 | 1024 | 16 | 24 | 224 | 1024 | 16 | 16 | 80 | 51865 | |
|
| whisper-large-v1 | 1550 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 20 | 80 | 51865 | |
|
| whisper-large-v2 | 1550 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 20 | 80 | 51865 | |
|
| distil-whisper-large-v2 | 756 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 2 | 80 | 51865 | |
|
| whisper-large-v3 | 1550 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 20 | 128 | 51866 | |
|
| distil-whisper-large-v3 | 756 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 2 | 128 | 51866 | |
|
| whisper-large-v3-turbo | 809 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 4 | 128 | 51866 | |
|
|