metadata

language:
  - en
  - it
license: apache-2.0
base_model: openai/whisper-small
tags:
  - hf-asr-leaderboard
  - generated_from_trainer
datasets:
  - screevoai/code-switch
metrics:
  - wer
model-index:
  - name: Heero-STT-Model
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Code-Switch Dataset
          type: screevoai/code-switch
          config: None
          split: None
          args: None
        metrics:
          - name: Wer
            type: wer
            value: 4.446809768789546

Heero-STT-Model

This model is a fine-tuned version of openai/whisper-small on the screevoai/code-switch dataset. It achieves the following results on the evaluation set:

Loss: 0.0895
Wer: 4.4468

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.0345	3	1250	0.0895	4.4468

Libraries to Install

pip install transformers datasets safetensors librosa huggingface-hub

Authentication needed before running the script

Run the following command in the terminal/jupyter_notebook:

Terminal: huggingface-cli login

Jupyter_notebook:

>>> from huggingface_hub import notebook_login
>>> notebook_login()

NOTE: Copy and Paste the token from your Huggingface Account Settings > Access Tokens > Create a new token / Copy the existing one.

Script

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset
>>> import librosa
>>> import requests
>>> from io import BytesIO

>>> # Load model and processor
>>> processor = WhisperProcessor.from_pretrained("screevoai/heero-small-v1")
>>> model = WhisperForConditionalGeneration.from_pretrained("screevoai/heero-small-v1")
>>> model.config.forced_decoder_ids = None

>>> # Load the dataset
>>> ds = load_dataset("screevoai/code-switch", split="test")
>>> sample_url = ds[2]["audio_file_path"]  # change the row number for testing different audio files

>>> # Download the audio file
>>> response = requests.get(sample_url)
>>> audio_file_data = BytesIO(response.content)

>>> # Down-sampling the audio file to 16KHz
>>> audio, sr = librosa.load(audio_file_data, sr=None)
>>> audio_resampled = librosa.resample(audio, orig_sr=sr, target_sr=16000)

>>> processed_audio = processor(audio_resampled, sampling_rate=16000, return_tensors="pt")
>>> input_features = processed_audio['input_features']

>>> # Generate predictions using the model
>>> output_ids = model.generate(input_features, max_new_tokens=400)
>>> transcription = processor.batch_decode(output_ids, skip_special_tokens=True)[0]

>>> print(transcription)