File size: 5,079 Bytes
a741cff
4c7d944
 
 
a741cff
 
 
 
 
 
4c7d944
 
 
a741cff
 
4c7d944
 
a741cff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c7d944
a741cff
 
 
 
 
 
4c7d944
 
 
a741cff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c7d944
a741cff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6017b88
a741cff
 
 
603dfe3
a741cff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
603dfe3
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
datasets:
- oovword/speech-translation-uk-en
language:
- uk
- en
metrics:
- bleu
- chrf
base_model:
- openai/whisper-small
pipeline_tag: translation
inference: true
library_name: transformers
tags:
- speech-translation
---

# Model Card

1. [Model Summary](##model-summary)
2. [Use](##use)
4. [Training](##training)
5. [License](##license)
6. [Citation](##citation)

## Model Summary

This model has been fine-tuned as a part of Speech Translation 3-Week Mentorship by Yasmin Moslem.

## Use

### Intended use

The model has been trained on the Ukrainian speech (source) and English text (target) data and can be used for speech-to-text translation between the specified source and target languages.

### Generation

The model accepts mono-channel audio files with the sampling rate of 16kHz.

```python
import torchaudio

from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained('whisper-uk2en-speech-translation')
processor = WhisperProcessor.from_pretrained('whisper-uk2en-speech-translation')

# Audio files in `datasets` format
test_dataset = load_dataset('your-dataset-name-goes-here', split='test')
sample = test_dataset[123]['audio']
inputs = processor(sample['array'].squeeze(), sampling_rate=16000, return_tensors='pt', return_attention_mask=True)
with torch.inference_mode():
    predictions = model.generate(**inputs)
sample['translation'] = processor.batch_decode(predictions, skip_special_tokens=True)[0].strip()

# Standalone audio files
waveform, _ = torchaudio.load('ukrainian_speech.wav')
inputs = processor(waveform, sampling_rate=16000, return_tensors='pt', return_attention_mask=True)
with torch.inference_mode():
    predictions = model.generate(**inputs)
print(processor.batch_decode(predictions, skip_special_tokens=True)[0].strip())
```

### Attribution & Other Requirements

The following datasets, all licensed under CC-BY-4.0, were used for the model fine-tuning:

- [`google/fleurs`](https://huggingface.co/datasets/google/fleurs) (fully authentic)
- [`skypro1111/elevenlabs_dataset`](ttps://huggingface.co/datasets/skypro1111/elevenlabs_dataset) (fully synthetic)
- [`MLCommons/ml_spoken_words`](https://huggingface.co/datasets/MLCommons/ml_spoken_words) (authentic + synthetic)

The Fleurs dataset only contains authentic human speech and translations.
For the `elevenlabs` dataset, the Ukrainian text was generated by ChatGPT and later voiced by the `elevenlabs` TTS model. The transcripts were machine-translated into English by Azure Translator.
Ukrainian peech and transcripts in the ML Spoken Words dataset are the authentic human data; the English text is machine-translated from Ukrainian by Azure Translator.
**NOTE:** English translations were not human-verified or proofread due to time limitations and, as such, may contain mistakes and inaccuracies.

Total (train): 10390 samples
Total (dev):   2058 samples
Total (test):  2828 samples

Total duration (train): 10 hours 45 minutes 12 seconds
Total duration (dev):   1 hour 36 minutes 7 seconds
Total duration (test):  3 hours 1 minute 28 seconds

## Training

The model has been fine-tuned on a mix of authentic human and synthetic speech and text translations on a T4 GPU in Google Colab with the following training parameters:

- learning_rate: 1e-6
- batch_size: 32
- num_train_epochs: 3 (975 training steps)
- warmup_steps: 0

## License

The fine-tuned model is licensed under the same Apache-2.0 license agreement as the original `openai/whisper-small` checkpoint.

## Citations

```
@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

@article{fleurs2022arxiv,
  title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
  author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  journal={arXiv preprint arXiv:2205.12446},
  url = {https://arxiv.org/abs/2205.12446},
  year = {2022},
}

@misc{synthetic_tts_dataset,
author = {@skypro1111},
title = {Synthetic TTS Dataset for Training Models},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
url= {https://github.com/skypro1111/pflowtts_pytorch_uk}
}

@inproceedings{mazumder2021multilingual,
  title={Multilingual Spoken Words Corpus},
  author={Mazumder, Mark and Chitlangia, Sharad and Banbury, Colby and Kang, Yiping and Ciro, Juan Manuel and Achorn, Keith and Galvez, Daniel and Sabini, Mark and Mattson, Peter and Kanter, David and others},
  booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
  year={2021}
}
```