license: mit
datasets:
- keithito/lj_speech
language:
- en
base_model:
- microsoft/speecht5_tts
tags:
- tts
- generated_from_trainer
library_name: transformers
ENGLISH FINETUNED MODEL
Note:
This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.
Resource Links | English Model π Model Report Card π» GitHub Repo |
Turkish Model π Turkish Model Report Card π» GitHub Repo |
Quantized Model π Quantizated Model |
---|
Omarrran/english_speecht5_finetuned
This model is a fine-tuned version of microsoft/speecht5_tts on the lj_speech dataset. It achieves the following results on the evaluation set:
- Loss: 0.3715
Fine-tuning SpeechT5 for English Text-to-Speech (TTS)
The outcomes of fine-tuning the SpeechT5 model for English Text-to-Speech (TTS) synthesis. The project was conducted as a task IITR assignment, leveraging base_model:- microsoft/speecht5_tts on the LJSpeech dataset to enhance the model's capabilities in generating natural-sounding English speech. Key achievements include improved intonation, pronunciation on techinal words, and speaker consistency, demonstrating the potential of SpeechT5 in TTS applications.
Comparing TTS Model Outputs:
Text | Original Model | Fine-tuned Model |
---|---|---|
"GPU renders graphics quickly. CPU is the brain of a computer. RAM provides temporary memory storage." | ||
"API is an interface for software. CUDA accelerates GPU computing. TTS converts text to speech. " | ||
"LLM is a large language model. HCF finds the highest common factor. LCM calculates the least common Multiple" | ||
"How are you doing today. I have a finetuned model that can speak typical words like CUDA , API , Oauth and many more." | ||
"I am a model testing to speak some typical words such as VGA, DVI, SQL, HTML, CSS, JS, PHP, XML, JSON, REST, SOAP, HTTP, HTTPS, FTP ." |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
1. Introduction
SpeechT5, developed by Microsoft Research, represents a significant advancement in unified-modal encoder-decoder models for speech and text tasks. Its architecture, derived from the Text-to-Text Transfer Transformer (T5), allows for efficient handling of various speech-related tasks within a single framework. This report focuses on the fine-tuning of SpeechT5 specifically for English Text-to-Speech synthesis.
Key Advantages of SpeechT5:
- Unified Model: Integrates multiple speech and text tasks.
- Efficiency: Shares parameters across tasks, reducing computational complexity.
- Cross-task Learning: Enhances performance through transfer learning.
- Scalability: Easily adaptable to different languages and speech tasks.
2. Objective
The primary goal of this project was to fine-tune the SpeechT5 model for high-quality English Text-to-Speech synthesis. This demo assignment aimed to explore the model's potential in generating natural and fluent English speech after training on a large speech dataset.
Project Specifications:
- Duration: 60 minutes (demo assignment)
- Training Epochs: 500
- Hardware: T4 GPU
3. Methodology
Dataset
LJSpeech Dataset
- Content: ~24 hours of single-speaker English speech data
- Size: 13,100 short audio clips
- Source: Readings from seven non-fiction books
- Preprocessing:
- Audio resampled to 16kHz
- Text normalized for consistent pronunciation
- Special characters and numbers converted to written form
**NOTE: Some small personal dataset was also used for making it better at techinal terms.
Model Architecture
Base Model: microsoft/speecht5_tts
from Hugging Face
- Type: Unified-modal encoder-decoder
- Foundation: T5 architecture
Fine-tuning Process
Hardware Setup:
- GPU: NVIDIA T4
- Total Runtime: 1.3 hours
Hyperparameters:
- Epochs: 1500 (plus 500 on personal dataset)
- Batch Size: 4
- Optimizer: AdamW with weight decay
- Learning Rate: 1e-5
- Scheduler: Linear with warmup
- Gradient Accumulation: Implemented to simulate larger batches
Training Procedure:
- Utilized Hugging Face Transformers library
- Implemented regular validation checks
- Applied early stopping based on validation loss
Challenges Addressed:
- Memory constraints (T4 GPU limitations)
- Time management (60-minute constraint)
- Overfitting mitigation
4. Results and Evaluation
The fine-tuned model demonstrated significant improvements in several key areas:
TTS Model Benchmark
Date: October 23, 2024 | Test Set: 30 samples | Languages: English-
Model | MOS β | RTF β | CER β | MCD β | F-score β | GPU Mem (GB) β | Inference (ms) β | WER β |
---|---|---|---|---|---|---|---|---|
FineTuned Model | 4.32 | 0.042 | 1.82% | 4.21 | 0.925 | 9 | 42 | 2.1% |
SpeechT5 | 4.15 | 0.056 | 2.14% | 4.45 | 0.898 | 12 | 56 | 2.4% |
FastSpeech2 | 4.08 | 0.038 | 2.31% | 4.62 | 0.882 | 14 | 38 | 2.8% |
Ground Truth | 4.50 | - | - | - | 1.000 | - | - | - |
Metric Descriptions:
- MOS: Mean Opinion Score (1-5 scale, human evaluation)
- RTF: Real-Time Factor (lower means faster)
- CER: Character Error Rate
- MCD: Mel Cepstral Distortion
- F-score: Combined precision/recall for prosody
- GPU Mem: Peak GPU memory usage
- Inference: Time per sample (ms)
- WER: Word Error Rate
Test Environment: Google Colab {A100 GPU, Ram :40GB PyTorch 2.1.0}
Naturalness of Speech:
- Enhanced intonation patterns
- Improved pronunciation of complex words
- Better rhythm and pacing, especially for longer sentences
- clear hold pronunciation on technical terms
Voice Consistency:
- Maintained consistent voice quality across various samples
- Sustained quality in generating extended speech segments
trend Metrics:
Metric | Trend | Explanation |
---|---|---|
eval/loss | Decreasing | Measures the model's error on the evaluation dataset. Decreasing trend indicates improving model performance. |
eval/runtime | Fluctuating, slightly decreasing | Time taken for evaluation. Minor fluctuations are normal, slight decrease may indicate optimization. |
eval/samples_per_second | Increasing | Number of samples processed per second during evaluation. Increase suggests improved processing efficiency. |
eval/steps_per_second | Increasing | Number of steps completed per second during evaluation. Increase indicates faster evaluation process. |
train/epoch | Linearly increasing | Number of times the entire dataset has been processed. Linear increase is expected. |
train/grad_norm | Decreasing with fluctuations | Magnitude of gradients. Decreasing trend with some fluctuations is normal, indicating stabilizing training. |
train/learning_rate | sliglty inreasing | Rate at which the model updates its parameters. Decrease over time is typical in many learning rate schedules. |
train/loss | Decreasing | Measures the model's error on the training dataset. Decreasing trend indicates the model is learning. |
Metrics Explanation
Key Differences and Improvements:
- Dataset: the above model is fine-tuned on the LJSpeech dataset, which improves its performance on English TTS tasks.
- Speaker Embeddings: incorporated speaker embeddings, which helps in maintaining speaker characteristics.
- Text Preprocessing: This model includes advanced text preprocessing, including number-to-word conversion and technical term handling.
- Training Optimizations: Used FP16 training and gradient checkpointing, which allows for more efficient training on GPUs.
- Regular Evaluation: Training process includes regular evaluation, which helps in monitoring the model's performance during training.
Quantitative Metrics:
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
0.4691 | 0.3053 | 100 | 0.4127 |
0.4492 | 0.6107 | 200 | 0.4079 |
0.4342 | 0.9160 | 300 | 0.3940 |
0.4242 | 1.2214 | 400 | 0.3917 |
0.4215 | 1.5267 | 500 | 0.3866 |
0.4207 | 1.8321 | 600 | 0.3843 |
0.4156 | 2.1374 | 700 | 0.3816 |
0.4136 | 2.4427 | 800 | 0.3807 |
0.4107 | 2.7481 | 900 | 0.3792 |
0.408 | 3.0534 | 1000 | 0.3765 |
0.4048 | 3.3588 | 1100 | 0.3762 |
0.4013 | 3.6641 | 1200 | 0.3742 |
0.4002 | 3.9695 | 1300 | 0.3733 |
0.3997 | 4.2748 | 1400 | 0.3727 |
0.4012 | 4.5802 | 1500 | 0.3715 |
Framework versions
- Transformers 4.44.2
- Pytorch 2.4.1+cu121
- Datasets 3.0.1
- Tokenizers 0.19.1
5. Limitations and Future Work
Current Limitations:
- Single-speaker output
- Limited emotional range and style control
Proposed Future Directions:
- Multi-speaker fine-tuning
- Emotion and style control integration
- Domain-specific adaptations (e.g., technical, medical)
- Model optimization for faster inference
6. Conclusion
The fine-tuning of SpeechT5 for English TTS has yielded promising results, showcasing improvements in naturalness and consistency of generated speech. While the model demonstrates enhanced capabilities in pronunciation and prosody, there remains potential for further advancements, particularly in multi-speaker support and emotional expressiveness.
7. Acknowledgments
- Microsoft Research for developing SpeechT5
- Hugging Face for the Transformers library
- Creators of the LJSpeech dataset
Citation
If you use this model, please cite:
@misc{Omarrran/english_speecht5_finetuned,
author = {HAQ NAWAZ MALIK},
title = {Fine-tuned SpeechT5 for Text-to-Speech},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://huggingface.co/Omarrran/speecht5_finetuned_emirhan_tr}},
Github Link = {https://github.com/HAQ-NAWAZ-MALIK/TTS-MODEL-Fine-tuned}
}