--- license: openrail datasets: - RamananR/Ratan_Tata_TTS_Data_English language: - en pipeline_tag: text-to-speech library_name: "speechbrain, ipython, datasets, noisereduce, soundfile, os, torchaudio, torch, transformers" tags: - ratan - tata - voice-cloning - tts --- # Ratan Tata SpeechT5 Voice Cloning Model This model is a Text-to-Speech (TTS) system using SpeechT5 architecture, trained on the Ratan Tata TTS Dataset to generate high-quality synthetic speech resembling the voice of Ratan Tata. The dataset and model pay tribute to his legacy, preserving his voice through cutting-edge AI technology. ## Model Information - **Model Architecture:** SpeechT5 (Text-to-Speech) - **Training Dataset:** Ratan Tata TTS Dataset (English) - **Checkpoints:** 60,000 steps - **Framework:** PyTorch - **Model Size:** Approximately 1.9GB - **License:** OpenRAIL ## Dataset Summary This model was trained on over 2,800 seconds (~48 minutes) of high-quality speech samples from Ratan Tata, with detailed transcriptions for each audio file. The audio data was pre-processed, converted to a uniform format, and aligned with the corresponding text to ensure optimal training performance. ## Model Performance - **Voice Quality:** The model replicates the unique tone, cadence, and voice texture of Ratan Tata with high accuracy, making it suitable for various voice cloning applications. - **Sample Rate:** 16 kHz (consistent with the training data) - **Audio Channels:** Mono - **Bit Depth:** 16-bit - **Precision:** High-quality synthesis using SpeechT5 ## How to Use the Model You can use this model for a variety of TTS and voice synthesis tasks. It is designed to work with any standard TTS pipeline and can be integrated into projects for generating Ratan Tata’s voice in any text-based scenario. ```python from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan from speechbrain.pretrained import EncoderClassifier from IPython.display import Audio from datasets import load_dataset import noisereduce as nr import soundfile as sf import os, torchaudio import numpy as np import torch # Load the processor and model processor = SpeechT5Processor.from_pretrained("checkpoint-60000") # Replace with the model folder processor.tokenizer.split_special_tokens = True model = SpeechT5ForTextToSpeech.from_pretrained("checkpoint-60000") # Replace with the model folder vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") # Load speaker embeddings dataset embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0) # Load the speaker model spk_model_name = "speechbrain/spkrec-xvect-voxceleb" device = "cuda" if torch.cuda.is_available() else "cpu" speaker_model = EncoderClassifier.from_hparams( source=spk_model_name, run_opts={"device": device}, savedir=os.path.join("/tmp", spk_model_name), ) # Load and process the Ratan Tata voice file signal, fs = torchaudio.load('wavs/converted_ratan_tata_tts_200.wav') # Replace with a Ratan Tata voice file speaker_embeddings = speaker_model.encode_batch(signal) speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2).squeeze().cpu().numpy() speaker_embeddings = torch.tensor(np.array([speaker_embeddings])) # Define input text input_text = ''' This is Generated Audio. India, a land of ancient wisdom and boundless potential, stands at the cusp of a new era. Our youth, the vibrant heartbeat of our nation, hold the key to unlocking this potential... ''' # Split text into chunks based on character length def split_text_by_length(text, max_length=60): words = text.split() result = [] current_line = [] for word in words: if len(' '.join(current_line + [word])) > max_length: result.append(' '.join(current_line)) current_line = [word] else: current_line.append(word) if current_line: result.append(' '.join(current_line)) return result splited_text = split_text_by_length(input_text, max_length=80) # Generate speech for each text chunk and apply noise reduction all_speech = [] for i in splited_text: inputs = processor(text=i, return_tensors="pt") speech_chunk = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder) if isinstance(speech_chunk, torch.Tensor): speech_chunk = speech_chunk.cpu().numpy() reduced_noise_chunk = nr.reduce_noise(y=speech_chunk, sr=16000) # assuming 16kHz sample rate all_speech.append(reduced_noise_chunk) # Concatenate all speech chunks concatenated_speech = np.concatenate(all_speech) # Play the final audio Audio(concatenated_speech, rate=16000)