speecht5_finetuned

This model is a fine-tuned version of microsoft/speecht5_tts on the LJ Speech dataset. It achieves the following results on the evaluation set:

  • Pretrained Model: WER = 0.2130, CER = 0.0802
  • Fine-tuned Model: WER = 0.1427, CER = 0.0285

Model description

This model improves speech-to-text performance by fine-tuning on the LJ Speech dataset, achieving lower Word Error Rate (WER) and Character Error Rate (CER) than the base SpeechT5 model.

Intended uses & limitations

  • Designed for English speech-to-text transcription.
  • Works well for clean speech but may struggle with noisy environments.
  • Requires further fine-tuning for domain-specific vocabularies.

Training and evaluation data

  • Dataset: LJ Speech
  • Training samples: 13,100 audio clips (approximately 24 hours of speech)
  • Validation: Split from LJ Speech dataset

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 32
  • optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 100
  • training_steps: 1500
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss
0.468 0.3053 100 0.4286
0.449 0.6107 200 0.4043
0.4338 0.9160 300 0.3968
0.4229 1.2198 400 0.3898
0.4208 1.5252 500 0.3879
0.4216 1.8305 600 0.3859
0.4149 2.1344 700 0.3821
0.4127 2.4397 800 0.3809
0.4104 2.7450 900 0.3787
0.4006 3.0489 1000 0.3767
0.4047 3.3542 1100 0.3757
0.4011 3.6595 1200 0.3741
0.4006 3.9649 1300 0.3726
0.3991 4.2687 1400 0.3723
0.4009 4.5740 1500 0.3722

SpeechT5 Model Comparison

Model WER CER
Pretrained 0.2130 0.0802
Fine-tuned 0.1427 0.0285

Example Transcriptions

๐Ÿ“ข Fine-tuned Model - File: LJLJ001-0003.wav
๐Ÿ”น Ground Truth: for although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process
๐Ÿ”น Transcribed: although the Chinese took impressions from wood blocks engraved in relief for centuries before the wood cutters of the Netherlands by a similar process
โœ… WER: 0.1667, CER: 0.0387

๐Ÿ“ข Fine-tuned Model - File: LJLJ001-0001.wav
๐Ÿ”น Ground Truth: printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition
๐Ÿ”น Transcribed: printing in the only sense with which we are at present concerned differs from most if not from all the arts and crafts represented in the exhibition
โœ… WER: 0.0741, CER: 0.0132

Framework versions

  • Transformers 4.50.0
  • Pytorch 2.6.0+cu124
  • Datasets 3.4.1
  • Tokenizers 0.21.1
Downloads last month
87
Safetensors
Model size
144M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for farazashraf/speecht5_finetuned_enhanced

Finetuned
(1046)
this model

Dataset used to train farazashraf/speecht5_finetuned_enhanced

Evaluation results