File size: 5,768 Bytes
094e9d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f2761c
094e9d9
2f2761c
094e9d9
2f2761c
 
 
 
 
094e9d9
2f2761c
 
 
094e9d9
2f2761c
094e9d9
 
 
 
 
 
 
 
 
2f2761c
094e9d9
 
2f2761c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
094e9d9
 
 
 
 
2f2761c
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
library_name: transformers
license: apache-2.0
base_model: openai/whisper-small
tags:
- generated_from_trainer
model-index:
- name: whisper-small-indo-eng
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# whisper-small-indo-eng


## Model description

This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an [cobrayyxx/FLEURS_INDO-ENG_Speech_Translation](https://huggingface.co/datasets/cobrayyxx/FLEURS_INDO-ENG_Speech_Translation) dataset.

## Dataset: FLEURS_INDO-ENG_Speech_Translation

This model was fine-tuned using the `cobrayyxx/FLEURS_INDO-ENG_Speech_Translation` dataset, a speech translation dataset for the **Indonesian ↔ English** language pair. The dataset is part of the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) collection and is specifically designed for speech-to-text translation tasks.
### Key Features:
- **audio**: Audio clip in Bahasa/Indonesian
- **text_indo**: Audio transcription in Bahasa/Indonesian.
- **text_en**: Audio transcription in English.

### Dataset Usage
- **Training Data**: Used to fine-tune the Whisper model for Indonesian → English speech-to-text translation.
- **Validation Data**: Used to evaluate the performance of the model during training.

## Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps (epoch): 100
- mixed_precision_training: Native AMP

## Model Evaluation
The performance of the baseline and fine-tuned models was evaluated using the BLEU and CHRF metrics on the validation dataset.
This fine-tuned model shows a slight improvement over the baseline model.
| Model           | BLEU Score | CHRF Score |
|------------------|------------|------------|
| Baseline Model   | **33.03**  | **52.71**  |
| Fine-Tuned Model | **34.82**  | **61.45**  |

### Evaluation Details
- **BLEU**: Measures the overlap between predicted and reference text based on n-grams.
- **CHRF**: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.

### Reproduce Steps
After [training](https://huggingface.co/blog/fine-tune-whisper) and push the training model to hugging-face.
we have to follow several steps before we can evaluate it:
1. Push tokenizer manually by creating it from WhisperTokenizerFast.
     ```
     from transformers import WhisperTokenizerFast

    # Load your fine-tuned tokenizer
    tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-small", language="en", task="translate")
    
    # Save the tokenizer locally
    tokenizer.save_pretrained("whisper-small-indo-eng",legacy_format=False)
    
    # Push the tokenizer to the Hugging Face Hub
    tokenizer.push_to_hub("cobrayyxx/whisper-small-indo-eng")
     ```
2. Convert your model from the model compatible with Transformers to model compatible with CTranslate2 (src: https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#model-conversion)
    ```
    !ct2-transformers-converter --model cobrayyxx/whisper-small-indo-eng --output_dir cobrayyxx/whisper-small-indo-eng-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16
    ```
3. Load the model for WhisperModel with your ct2-model, in this case is `cobrayyxx/whisper-small-indo-eng-ct2`.
4. Now we can do the evaluation process using faster-whisper to load the model and sacrebleu to use metric evaluation.
   ```
     def predict(audio_array):
    model_name = "cobrayyxx/whisper-small-indo-eng-ct2"  # pretrained model - try "tiny", "base", "small", or "medium"
    model = WhisperModel(model_name, device="cuda", compute_type="float16")

    segments, info = model.transcribe(audio_array,
                                      beam_size=5,
                                      language="en",
                                      vad_filter=True
                                      )
    return segments, info

    def metric_calculation(dataset):
        val_data = fleurs_dataset["validation"]
        bleu = BLEU()
        chrf = CHRF()
        lst_pred = []
        lst_gold = []
        for data in tqdm(val_data):
            gold_standard = data["text_en"]
            gold_standard = gold_standard.lower().strip()
            audio_array = data["audio"]["array"]
            # Ensure it's 1D
            audio_array = np.ravel(audio_array)
    
            # Convert to float32 if necessary
            audio_array = audio_array.astype(np.float32)
            pred_segments, pred_info = predict(audio_array)
            prediction_text = " ".join(segment.text for segment in pred_segments).lower().strip()
            lst_pred.append(prediction_text)
            lst_gold.append([gold_standard])
        bleu_score = bleu.corpus_score(lst_pred, lst_gold).score
        chrf_score = chrf.corpus_score(lst_pred, lst_gold).score
    
        return bleu_score, chrf_score
     ```
     Now run the evaluation.
     ```
     pretrain_bleu_score, pretrain_chrf_score   = metric_calculation(fleurs_dataset)
     pretrain_bleu_score, pretrain_chrf_score
     ```
## Framework versions

- Transformers 4.46.3
- Pytorch 2.5.1+cu121
- Datasets 3.2.0
- Tokenizers 0.20.3

## Reference
- https://huggingface.co/blog/fine-tune-whisper

## Credits
Huge thanks to [Yasmin Moslem ](https://huggingface.co/ymoslem) for mentoring me.