Model Card

Model Summary
Use
Training
License
Citation

Model Summary

This model has been fine-tuned as a part of Speech Translation 3-Week Mentorship by Yasmin Moslem.

Use

Intended use

The model has been trained on the Ukrainian speech (source) and English text (target) data and can be used for speech-to-text translation between the specified source and target languages.

Generation

The model accepts mono-channel audio files with the sampling rate of 16kHz.

import torchaudio

from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained('whisper-uk2en-speech-translation')
processor = WhisperProcessor.from_pretrained('whisper-uk2en-speech-translation')

# Audio files in `datasets` format
test_dataset = load_dataset('your-dataset-name-goes-here', split='test')
sample = test_dataset[123]['audio']
inputs = processor(sample['array'].squeeze(), sampling_rate=16000, return_tensors='pt', return_attention_mask=True)
with torch.inference_mode():
    predictions = model.generate(**inputs)
sample['translation'] = processor.batch_decode(predictions, skip_special_tokens=True)[0].strip()

# Standalone audio files
waveform, _ = torchaudio.load('ukrainian_speech.wav')
inputs = processor(waveform, sampling_rate=16000, return_tensors='pt', return_attention_mask=True)
with torch.inference_mode():
    predictions = model.generate(**inputs)
print(processor.batch_decode(predictions, skip_special_tokens=True)[0].strip())

Attribution & Other Requirements

The following datasets, all licensed under CC-BY-4.0 license agreement, were used for the model fine-tuning:

google/fleurs (fully authentic)
skypro1111/elevenlabs_dataset (fully synthetic)
MLCommons/ml_spoken_words (authentic + synthetic)

The Fleurs dataset only contains authentic human speech and translations. For the elevenlabs dataset, the Ukrainian text was generated by ChatGPT and later voiced by the elevenlabs TTS model. The transcripts were machine-translated into English by Azure Translator. Ukrainian peech and transcripts in the ML Spoken Words dataset are the authentic human data; the English text is machine-translated from Ukrainian by Azure Translator. NOTE: English translations were not human-verified or proofread due to time limitations and, as such, may contain mistakes and inaccuracies.

Total (train): 10390 samples (10 hours 45 minutes 12 seconds)
Total (dev): 2058 samples (1 hour 36 minutes 7 seconds)
Total (test): 2828 samples (3 hours 1 minute 28 seconds)

Training

The model has been fine-tuned on a mix of authentic human and synthetic speech and text translations on a T4 GPU in Google Colab with the following training parameters:

learning_rate: 1e-6
batch_size: 32
num_train_epochs: 3 (975 training steps)
warmup_steps: 0

The table below demonstrates the values of both training and validation losses as well as the BLEU score calculated on the development set during the fine-tuning. The model converged at step 900, or approximately epoch 3, and clearly started to overfit the dataset afterwards.

Step	Training loss	Validation loss	BLEU
100	2.491100	2.007935	21.813000
200	1.600800	1.383696	23.344800
300	1.430900	1.309672	23.846300
400	1.320600	1.268230	23.911000
500	1.289200	1.248684	24.192300
600	1.243800	1.239911	24.385900
700	1.194200	1.207502	23.941100
800	1.170800	1.211733	24.888100
900	1.143800	1.199629	24.946900
1000	1.153400	1.206929	24.919100
1100	1.119200	1.201825	24.597300

Evaluation

Both original and fine-tuned checkpoints have been evaluated on the test split of the dataset. The selected evaluation metrics are BLEU and ChrF++ implemented in sacrebleu library.

Model	BLEU	ChrF++
`whisper-small`	16.36	43.81
`checkpoint-900`	22.34	48.1

The fine-tuning improved the model's performance compared to the baseline score by almost 6 points. (For comparison, checkpoints 800 and 1100 scored at BLEU 22.11 and 21.83 as well as ChrF++ 47.81 and 47.8, respectively.)

License

The fine-tuned model is licensed under the same Apache-2.0 license agreement as the original openai/whisper-small checkpoint.

Citations

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

@article{fleurs2022arxiv,
  title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
  author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  journal={arXiv preprint arXiv:2205.12446},
  url = {https://arxiv.org/abs/2205.12446},
  year = {2022},
}

@misc{synthetic_tts_dataset,
author = {@skypro1111},
title = {Synthetic TTS Dataset for Training Models},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
url= {https://github.com/skypro1111/pflowtts_pytorch_uk}
}

@inproceedings{mazumder2021multilingual,
  title={Multilingual Spoken Words Corpus},
  author={Mazumder, Mark and Chitlangia, Sharad and Banbury, Colby and Kang, Yiping and Ciro, Juan Manuel and Achorn, Keith and Galvez, Daniel and Sabini, Mark and Mattson, Peter and Kanter, David and others},
  booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
  year={2021}
}

oovword
/

whisper-uk2en-speech-translation