speecht5_finetuned
This model is a fine-tuned version of microsoft/speecht5_tts on the LJ Speech dataset. It achieves the following results on the evaluation set:
- Pretrained Model: WER = 0.2130, CER = 0.0802
- Fine-tuned Model: WER = 0.1427, CER = 0.0285
Model description
This model improves speech-to-text performance by fine-tuning on the LJ Speech dataset, achieving lower Word Error Rate (WER) and Character Error Rate (CER) than the base SpeechT5 model.
Intended uses & limitations
- Designed for English speech-to-text transcription.
- Works well for clean speech but may struggle with noisy environments.
- Requires further fine-tuning for domain-specific vocabularies.
Training and evaluation data
- Dataset: LJ Speech
- Training samples: 13,100 audio clips (approximately 24 hours of speech)
- Validation: Split from LJ Speech dataset
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 32
- optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 100
- training_steps: 1500
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
0.468 | 0.3053 | 100 | 0.4286 |
0.449 | 0.6107 | 200 | 0.4043 |
0.4338 | 0.9160 | 300 | 0.3968 |
0.4229 | 1.2198 | 400 | 0.3898 |
0.4208 | 1.5252 | 500 | 0.3879 |
0.4216 | 1.8305 | 600 | 0.3859 |
0.4149 | 2.1344 | 700 | 0.3821 |
0.4127 | 2.4397 | 800 | 0.3809 |
0.4104 | 2.7450 | 900 | 0.3787 |
0.4006 | 3.0489 | 1000 | 0.3767 |
0.4047 | 3.3542 | 1100 | 0.3757 |
0.4011 | 3.6595 | 1200 | 0.3741 |
0.4006 | 3.9649 | 1300 | 0.3726 |
0.3991 | 4.2687 | 1400 | 0.3723 |
0.4009 | 4.5740 | 1500 | 0.3722 |
SpeechT5 Model Comparison
Model | WER | CER |
---|---|---|
Pretrained | 0.2130 | 0.0802 |
Fine-tuned | 0.1427 | 0.0285 |
Example Transcriptions
๐ข Fine-tuned Model - File: LJLJ001-0003.wav
๐น Ground Truth: for although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process
๐น Transcribed: although the Chinese took impressions from wood blocks engraved in relief for centuries before the wood cutters of the Netherlands by a similar process
โ
WER: 0.1667, CER: 0.0387
๐ข Fine-tuned Model - File: LJLJ001-0001.wav
๐น Ground Truth: printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition
๐น Transcribed: printing in the only sense with which we are at present concerned differs from most if not from all the arts and crafts represented in the exhibition
โ
WER: 0.0741, CER: 0.0132
Framework versions
- Transformers 4.50.0
- Pytorch 2.6.0+cu124
- Datasets 3.4.1
- Tokenizers 0.21.1
- Downloads last month
- 87
Model tree for farazashraf/speecht5_finetuned_enhanced
Base model
microsoft/speecht5_ttsDataset used to train farazashraf/speecht5_finetuned_enhanced
Evaluation results
- Word Error Rate (WER) on LJ Speechself-reported0.143
- Character Error Rate (CER) on LJ Speechself-reported0.029