RamananR
/

Ratan_Tata_SpeechT5_Voice_Cloning_Model

@@ -13,36 +13,36 @@ tags:
   - tts
 ---
-Ratan Tata SpeechT5 Voice Cloning Model
 This model is a Text-to-Speech (TTS) system using SpeechT5 architecture, trained on the Ratan Tata TTS Dataset to generate high-quality synthetic speech resembling the voice of Ratan Tata. The dataset and model pay tribute to his legacy, preserving his voice through cutting-edge AI technology.
-Model Information
-    Model Architecture: SpeechT5 (Text-to-Speech)
-    Training Dataset: Ratan Tata TTS Dataset (English)
-    Checkpoints: 60,000 steps
-    Framework: PyTorch
-    Model Size: Approximately 1.9GB
-    License: OpenRAIL
-Dataset Summary
 This model was trained on over 2,800 seconds (~48 minutes) of high-quality speech samples from Ratan Tata, with detailed transcriptions for each audio file. The audio data was pre-processed, converted to a uniform format, and aligned with the corresponding text to ensure optimal training performance.
-Model Performance
-    Voice Quality: The model replicates the unique tone, cadence, and voice texture of Ratan Tata with high accuracy, making it suitable for various voice cloning applications.
-    Sample Rate: 16 kHz (consistent with the training data)
-    Audio Channels: Mono
-    Bit Depth: 16-bit
-    Precision: High-quality synthesis using SpeechT5
-How to Use the Model
 You can use this model for a variety of TTS and voice synthesis tasks. It is designed to work with any standard TTS pipeline and can be integrated into projects for generating Ratan Tata’s voice in any text-based scenario.
-'''
 from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
 from speechbrain.pretrained import EncoderClassifier
 from IPython.display import Audio
@@ -53,87 +53,68 @@ import os, torchaudio
 import numpy as np
 import torch
-processor = SpeechT5Processor.from_pretrained("checkpoint-60000")#Replace with the model folder
 processor.tokenizer.split_special_tokens = True
-model = SpeechT5ForTextToSpeech.from_pretrained("checkpoint-60000")#Replace with the model folder
 vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
 embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
 speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
 spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
 device = "cuda" if torch.cuda.is_available() else "cpu"
 speaker_model = EncoderClassifier.from_hparams(
     source=spk_model_name,
     run_opts={"device": device},
     savedir=os.path.join("/tmp", spk_model_name),
 )
-signal, fs =torchaudio.load('wavs/converted_ratan_tata_tts_200.wav')#replace a voice of ratan tata
-speaker_embeddings = speaker_model.encode_batch(signal)  # Directly passing signal as a tensor, no need to wrap in torch.tensor
-speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)  # Normalize the embeddings
-speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()  # Squeeze and convert to numpy array
-speaker_embeddings = torch.tensor(np.array([speaker_embeddings]))  # Convert back to tensor if necessary
-input_text=''' This is Generated Audio,
-India, a land of ancient wisdom and boundless potential, stands at the cusp of a new era. Our youth, the vibrant heartbeat of our nation, hold the key to unlocking this potential. They are the digital natives, the innovators, the dreamers who will shape the India of tomorrow.
-Knowledge is the most powerful weapon in today's world. It's not just about education, but about the ability to think critically, to adapt, and to innovate. Our youth, with their thirst for knowledge and access to technology, have the potential to become global leaders.
-The power of India lies in its diversity. It is our diversity that makes us unique, that fuels our creativity, and that drives our progress. Our youth, with their understanding of different cultures and perspectives, can bridge divides and foster unity.
-Technology is the catalyst for change. It has the power to transform lives, to create opportunities, and to address challenges. Our youth, with their expertise in technology, can develop solutions that benefit society as a whole.
-I believe in the potential of India's youth. I believe in their ability to build a nation that is prosperous, inclusive, and sustainable. Let us empower them, support their dreams, and provide them with the resources they need to succeed. Together, we can create an India that is a beacon of hope for the world.
-This is Generated Audio,
- '''
-def split_text_by_length(text, max_length=60):#from the paper speech_t5 max char length 120 char "max_length=60"
-    # Splits the text into chunks of max_length, preserving words
     words = text.split()
     result = []
     current_line = []
     for word in words:
-        # Check if adding the next word exceeds the maximum length
         if len(' '.join(current_line + [word])) > max_length:
             result.append(' '.join(current_line))
             current_line = [word]
         else:
             current_line.append(word)
-    # Add the last remaining part
     if current_line:
         result.append(' '.join(current_line))
     return result
-splited_text=split_text_by_length(input_text,max_length=80)
-print(splited_text)
 all_speech = []
 for i in splited_text:
     inputs = processor(text=i, return_tensors="pt")
-    speech_chunk = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
     if isinstance(speech_chunk, torch.Tensor):
         speech_chunk = speech_chunk.cpu().numpy()
-    # Apply noise reduction to each speech chunk
     reduced_noise_chunk = nr.reduce_noise(y=speech_chunk, sr=16000)  # assuming 16kHz sample rate
     all_speech.append(reduced_noise_chunk)
-concatenated_speech = np.concatenate(all_speech)# Concatenate the noise-reduced speech chunks
-Audio(concatenated_speech, rate=16000)# Display the final audio with noise reduced
-''

   - tts
 ---
+# Ratan Tata SpeechT5 Voice Cloning Model
 This model is a Text-to-Speech (TTS) system using SpeechT5 architecture, trained on the Ratan Tata TTS Dataset to generate high-quality synthetic speech resembling the voice of Ratan Tata. The dataset and model pay tribute to his legacy, preserving his voice through cutting-edge AI technology.
+## Model Information
+- **Model Architecture:** SpeechT5 (Text-to-Speech)
+- **Training Dataset:** Ratan Tata TTS Dataset (English)
+- **Checkpoints:** 60,000 steps
+- **Framework:** PyTorch
+- **Model Size:** Approximately 1.9GB
+- **License:** OpenRAIL
+## Dataset Summary
 This model was trained on over 2,800 seconds (~48 minutes) of high-quality speech samples from Ratan Tata, with detailed transcriptions for each audio file. The audio data was pre-processed, converted to a uniform format, and aligned with the corresponding text to ensure optimal training performance.
+## Model Performance
+- **Voice Quality:** The model replicates the unique tone, cadence, and voice texture of Ratan Tata with high accuracy, making it suitable for various voice cloning applications.
+- **Sample Rate:** 16 kHz (consistent with the training data)
+- **Audio Channels:** Mono
+- **Bit Depth:** 16-bit
+- **Precision:** High-quality synthesis using SpeechT5
+## How to Use the Model
 You can use this model for a variety of TTS and voice synthesis tasks. It is designed to work with any standard TTS pipeline and can be integrated into projects for generating Ratan Tata’s voice in any text-based scenario.
+```python
 from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
 from speechbrain.pretrained import EncoderClassifier
 from IPython.display import Audio
 import numpy as np
 import torch
+# Load the processor and model
+processor = SpeechT5Processor.from_pretrained("checkpoint-60000") # Replace with the model folder
 processor.tokenizer.split_special_tokens = True
+model = SpeechT5ForTextToSpeech.from_pretrained("checkpoint-60000") # Replace with the model folder
 vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
+# Load speaker embeddings dataset
 embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
 speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
+# Load the speaker model
 spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
 device = "cuda" if torch.cuda.is_available() else "cpu"
 speaker_model = EncoderClassifier.from_hparams(
     source=spk_model_name,
     run_opts={"device": device},
     savedir=os.path.join("/tmp", spk_model_name),
 )
+# Load and process the Ratan Tata voice file
+signal, fs = torchaudio.load('wavs/converted_ratan_tata_tts_200.wav') # Replace with a Ratan Tata voice file
+speaker_embeddings = speaker_model.encode_batch(signal)
+speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2).squeeze().cpu().numpy()
+speaker_embeddings = torch.tensor(np.array([speaker_embeddings]))
+# Define input text
+input_text = '''
+This is Generated Audio.
+India, a land of ancient wisdom and boundless potential, stands at the cusp of a new era. Our youth, the vibrant heartbeat of our nation, hold the key to unlocking this potential...
+'''
+# Split text into chunks based on character length
+def split_text_by_length(text, max_length=60):
     words = text.split()
     result = []
     current_line = []
     for word in words:
         if len(' '.join(current_line + [word])) > max_length:
             result.append(' '.join(current_line))
             current_line = [word]
         else:
             current_line.append(word)
     if current_line:
         result.append(' '.join(current_line))
     return result
+splited_text = split_text_by_length(input_text, max_length=80)
+# Generate speech for each text chunk and apply noise reduction
 all_speech = []
 for i in splited_text:
     inputs = processor(text=i, return_tensors="pt")
+    speech_chunk = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
     if isinstance(speech_chunk, torch.Tensor):
         speech_chunk = speech_chunk.cpu().numpy()
     reduced_noise_chunk = nr.reduce_noise(y=speech_chunk, sr=16000)  # assuming 16kHz sample rate
     all_speech.append(reduced_noise_chunk)
+# Concatenate all speech chunks
+concatenated_speech = np.concatenate(all_speech)
+# Play the final audio
+Audio(concatenated_speech, rate=16000)