SWRA (SWARA)

SWRA (SWARA) is a Speech to Text Transformer (S2T) model trained by @binarybardakshat for automatic speech recognition (ASR).

Model Description

SWRA (SWARA) is an end-to-end sequence-to-sequence transformer model. It is trained with standard autoregressive cross-entropy loss and generates the transcripts autoregressively.

How to Use

As this is a standard sequence-to-sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.

Note: The Speech2TextProcessor object uses torchaudio to extract the filter bank features. Make sure to install the torchaudio package before running this example.

Note: The feature extractor depends on torchaudio and the tokenizer depends on sentencepiece, so be sure to install those packages before running the examples.

You could either install those as extra speech dependencies with pip install transformers"[speech, sentencepiece]" or install the packages separately with pip install torchaudio sentencepiece.

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset

model = Speech2TextForConditionalGeneration.from_pretrained("binarybardakshat/swra-swara")
processor = Speech2TextProcessor.from_pretrained("binarybardakshat/swra-swara")

ds = load_dataset(
    "patrickvonplaten/librispeech_asr_dummy",
    "clean",
    split="validation"
)

input_features = processor(
    ds[0]["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt"
).input_features  # Batch size 1
generated_ids = model.generate(input_features=input_features)

transcription = processor.batch_decode(generated_ids)

#### Evaluation on LibriSpeech Test

The following script shows how to evaluate this model on the [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)
*"clean"* and *"other"* test dataset.

```python
from datasets import load_dataset
from evaluate import load
from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")  # change to "other" for other test dataset
wer = load("wer")

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True)

def map_to_pred(batch):
    features = processor(batch["audio"]["array"], sampling_rate=16000, padding=True, return_tensors="pt")
    input_features = features.input_features.to("cuda")
    attention_mask = features.attention_mask.to("cuda")

    gen_tokens = model.generate(input_features=input_features, attention_mask=attention_mask)
    batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)[0]
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print("WER:", wer.compute(predictions=result["transcription"], references=result["text"]))

Result (WER):

"clean"	"other"
4.3	9.0

Training data

The S2T-SMALL-LIBRISPEECH-ASR is trained on LibriSpeech ASR Corpus, a dataset consisting of approximately 1000 hours of 16kHz read English speech.

Downloads last month: 3

Safetensors

Model size

29.5M params

Tensor type

F32

Dataset used to train Binarybardakshat/SWRA

Evaluation results

Test WER on LibriSpeech (clean)
test set self-reported

4.300
Test WER on LibriSpeech (other)
test set self-reported

9.000