speecht5_finetuned

This model is a fine-tuned version of microsoft/speecht5_tts on the LJ Speech dataset. It achieves the following results on the evaluation set:

Pretrained Model: WER = 0.2130, CER = 0.0802
Fine-tuned Model: WER = 0.1427, CER = 0.0285

Model description

This model improves speech-to-text performance by fine-tuning on the LJ Speech dataset, achieving lower Word Error Rate (WER) and Character Error Rate (CER) than the base SpeechT5 model.

Intended uses & limitations

Designed for English speech-to-text transcription.
Works well for clean speech but may struggle with noisy environments.
Requires further fine-tuning for domain-specific vocabularies.

Training and evaluation data

Dataset: LJ Speech
Training samples: 13,100 audio clips (approximately 24 hours of speech)
Validation: Split from LJ Speech dataset

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 32
optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 100
training_steps: 1500
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss
0.468	0.3053	100	0.4286
0.449	0.6107	200	0.4043
0.4338	0.9160	300	0.3968
0.4229	1.2198	400	0.3898
0.4208	1.5252	500	0.3879
0.4216	1.8305	600	0.3859
0.4149	2.1344	700	0.3821
0.4127	2.4397	800	0.3809
0.4104	2.7450	900	0.3787
0.4006	3.0489	1000	0.3767
0.4047	3.3542	1100	0.3757
0.4011	3.6595	1200	0.3741
0.4006	3.9649	1300	0.3726
0.3991	4.2687	1400	0.3723
0.4009	4.5740	1500	0.3722

SpeechT5 Model Comparison

Model	WER	CER
Pretrained	0.2130	0.0802
Fine-tuned	0.1427	0.0285

Example Transcriptions

📢 Fine-tuned Model - File: LJLJ001-0003.wav
🔹 Ground Truth: for although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process
🔹 Transcribed: although the Chinese took impressions from wood blocks engraved in relief for centuries before the wood cutters of the Netherlands by a similar process
✅ WER: 0.1667, CER: 0.0387

📢 Fine-tuned Model - File: LJLJ001-0001.wav
🔹 Ground Truth: printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition
🔹 Transcribed: printing in the only sense with which we are at present concerned differs from most if not from all the arts and crafts represented in the exhibition
✅ WER: 0.0741, CER: 0.0132

Framework versions

Transformers 4.50.0
Pytorch 2.6.0+cu124
Datasets 3.4.1
Tokenizers 0.21.1

farazashraf
/

speecht5_finetuned_enhanced