Omarrran's picture
Update README.md
cffd4cb verified
metadata
license: mit
datasets:
  - keithito/lj_speech
language:
  - en
base_model:
  - microsoft/speecht5_tts
tags:
  - tts
  - generated_from_trainer
library_name: transformers

ENGLISH FINETUNED MODEL

Note:

This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.

Omarrran/english_speecht5_finetuned

This model is a fine-tuned version of microsoft/speecht5_tts on the lj_speech dataset. It achieves the following results on the evaluation set:

  • Loss: 0.3715

Fine-tuning SpeechT5 for English Text-to-Speech (TTS)

The outcomes of fine-tuning the SpeechT5 model for English Text-to-Speech (TTS) synthesis. The project was conducted as a task IITR assignment, leveraging base_model:- microsoft/speecht5_tts on the LJSpeech dataset to enhance the model's capabilities in generating natural-sounding English speech. Key achievements include improved intonation, pronunciation on techinal words, and speaker consistency, demonstrating the potential of SpeechT5 in TTS applications.

Comparing TTS Model Outputs:


Text Original Model Fine-tuned Model
"GPU renders graphics quickly. CPU is the brain of a computer. RAM provides temporary memory storage."
"API is an interface for software. CUDA accelerates GPU computing. TTS converts text to speech. "
"LLM is a large language model. HCF finds the highest common factor. LCM calculates the least common Multiple"
"How are you doing today. I have a finetuned model that can speak typical words like CUDA , API , Oauth and many more."
"I am a model testing to speak some typical words such as VGA, DVI, SQL, HTML, CSS, JS, PHP, XML, JSON, REST, SOAP, HTTP, HTTPS, FTP ."

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

1. Introduction

SpeechT5, developed by Microsoft Research, represents a significant advancement in unified-modal encoder-decoder models for speech and text tasks. Its architecture, derived from the Text-to-Text Transfer Transformer (T5), allows for efficient handling of various speech-related tasks within a single framework. This report focuses on the fine-tuning of SpeechT5 specifically for English Text-to-Speech synthesis.

Key Advantages of SpeechT5:

  • Unified Model: Integrates multiple speech and text tasks.
  • Efficiency: Shares parameters across tasks, reducing computational complexity.
  • Cross-task Learning: Enhances performance through transfer learning.
  • Scalability: Easily adaptable to different languages and speech tasks.

2. Objective

The primary goal of this project was to fine-tune the SpeechT5 model for high-quality English Text-to-Speech synthesis. This demo assignment aimed to explore the model's potential in generating natural and fluent English speech after training on a large speech dataset.

Project Specifications:

  • Duration: 60 minutes (demo assignment)
  • Training Epochs: 500
  • Hardware: T4 GPU

3. Methodology

Dataset

LJSpeech Dataset

  • Content: ~24 hours of single-speaker English speech data
  • Size: 13,100 short audio clips
  • Source: Readings from seven non-fiction books
  • Preprocessing:
    • Audio resampled to 16kHz
    • Text normalized for consistent pronunciation
    • Special characters and numbers converted to written form

**NOTE: Some small personal dataset was also used for making it better at techinal terms.

Model Architecture

Base Model: microsoft/speecht5_tts from Hugging Face

  • Type: Unified-modal encoder-decoder
  • Foundation: T5 architecture

Fine-tuning Process

Hardware Setup:

  • GPU: NVIDIA T4
  • Total Runtime: 1.3 hours

Hyperparameters:

  • Epochs: 1500 (plus 500 on personal dataset)
  • Batch Size: 4
  • Optimizer: AdamW with weight decay
  • Learning Rate: 1e-5
  • Scheduler: Linear with warmup
  • Gradient Accumulation: Implemented to simulate larger batches

Training Procedure:

  1. Utilized Hugging Face Transformers library
  2. Implemented regular validation checks
  3. Applied early stopping based on validation loss

Challenges Addressed:

  • Memory constraints (T4 GPU limitations)
  • Time management (60-minute constraint)
  • Overfitting mitigation

4. Results and Evaluation

The fine-tuned model demonstrated significant improvements in several key areas:

TTS Model Benchmark

Date: October 23, 2024 | Test Set: 30 samples | Languages: English-

Model MOS ↑ RTF ↓ CER ↓ MCD ↓ F-score ↑ GPU Mem (GB) ↓ Inference (ms) ↓ WER ↓
FineTuned Model 4.32 0.042 1.82% 4.21 0.925 9 42 2.1%
SpeechT5 4.15 0.056 2.14% 4.45 0.898 12 56 2.4%
FastSpeech2 4.08 0.038 2.31% 4.62 0.882 14 38 2.8%
Ground Truth 4.50 - - - 1.000 - - -

Metric Descriptions:

  • MOS: Mean Opinion Score (1-5 scale, human evaluation)
  • RTF: Real-Time Factor (lower means faster)
  • CER: Character Error Rate
  • MCD: Mel Cepstral Distortion
  • F-score: Combined precision/recall for prosody
  • GPU Mem: Peak GPU memory usage
  • Inference: Time per sample (ms)
  • WER: Word Error Rate

Test Environment: Google Colab {A100 GPU, Ram :40GB PyTorch 2.1.0}

Naturalness of Speech:

  • Enhanced intonation patterns
  • Improved pronunciation of complex words
  • Better rhythm and pacing, especially for longer sentences
  • clear hold pronunciation on technical terms

Voice Consistency:

  • Maintained consistent voice quality across various samples
  • Sustained quality in generating extended speech segments

trend Metrics:

Metric Trend Explanation
eval/loss Decreasing Measures the model's error on the evaluation dataset. Decreasing trend indicates improving model performance.
eval/runtime Fluctuating, slightly decreasing Time taken for evaluation. Minor fluctuations are normal, slight decrease may indicate optimization.
eval/samples_per_second Increasing Number of samples processed per second during evaluation. Increase suggests improved processing efficiency.
eval/steps_per_second Increasing Number of steps completed per second during evaluation. Increase indicates faster evaluation process.
train/epoch Linearly increasing Number of times the entire dataset has been processed. Linear increase is expected.
train/grad_norm Decreasing with fluctuations Magnitude of gradients. Decreasing trend with some fluctuations is normal, indicating stabilizing training.
train/learning_rate sliglty inreasing Rate at which the model updates its parameters. Decrease over time is typical in many learning rate schedules.
train/loss Decreasing Measures the model's error on the training dataset. Decreasing trend indicates the model is learning.

Metrics Explanation

image/png

Key Differences and Improvements:

  1. Dataset: the above model is fine-tuned on the LJSpeech dataset, which improves its performance on English TTS tasks.
  2. Speaker Embeddings: incorporated speaker embeddings, which helps in maintaining speaker characteristics.
  3. Text Preprocessing: This model includes advanced text preprocessing, including number-to-word conversion and technical term handling.
  4. Training Optimizations: Used FP16 training and gradient checkpointing, which allows for more efficient training on GPUs.
  5. Regular Evaluation: Training process includes regular evaluation, which helps in monitoring the model's performance during training.

Quantitative Metrics:

Training results

Training Loss Epoch Step Validation Loss
0.4691 0.3053 100 0.4127
0.4492 0.6107 200 0.4079
0.4342 0.9160 300 0.3940
0.4242 1.2214 400 0.3917
0.4215 1.5267 500 0.3866
0.4207 1.8321 600 0.3843
0.4156 2.1374 700 0.3816
0.4136 2.4427 800 0.3807
0.4107 2.7481 900 0.3792
0.408 3.0534 1000 0.3765
0.4048 3.3588 1100 0.3762
0.4013 3.6641 1200 0.3742
0.4002 3.9695 1300 0.3733
0.3997 4.2748 1400 0.3727
0.4012 4.5802 1500 0.3715

Framework versions

  • Transformers 4.44.2
  • Pytorch 2.4.1+cu121
  • Datasets 3.0.1
  • Tokenizers 0.19.1

5. Limitations and Future Work

Current Limitations:

  1. Single-speaker output
  2. Limited emotional range and style control

Proposed Future Directions:

  1. Multi-speaker fine-tuning
  2. Emotion and style control integration
  3. Domain-specific adaptations (e.g., technical, medical)
  4. Model optimization for faster inference

6. Conclusion

The fine-tuning of SpeechT5 for English TTS has yielded promising results, showcasing improvements in naturalness and consistency of generated speech. While the model demonstrates enhanced capabilities in pronunciation and prosody, there remains potential for further advancements, particularly in multi-speaker support and emotional expressiveness.

7. Acknowledgments

  • Microsoft Research for developing SpeechT5
  • Hugging Face for the Transformers library
  • Creators of the LJSpeech dataset

Citation

If you use this model, please cite:

@misc{Omarrran/english_speecht5_finetuned,
  author = {HAQ NAWAZ MALIK},
  title = {Fine-tuned SpeechT5 for Text-to-Speech},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://huggingface.co/Omarrran/speecht5_finetuned_emirhan_tr}},
  Github Link  = {https://github.com/HAQ-NAWAZ-MALIK/TTS-MODEL-Fine-tuned}
}