Update README.md
Browse files
README.md
CHANGED
@@ -13,36 +13,36 @@ tags:
|
|
13 |
- tts
|
14 |
---
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
Ratan Tata SpeechT5 Voice Cloning Model
|
19 |
|
20 |
This model is a Text-to-Speech (TTS) system using SpeechT5 architecture, trained on the Ratan Tata TTS Dataset to generate high-quality synthetic speech resembling the voice of Ratan Tata. The dataset and model pay tribute to his legacy, preserving his voice through cutting-edge AI technology.
|
21 |
-
Model Information
|
22 |
|
23 |
-
|
24 |
-
Training Dataset: Ratan Tata TTS Dataset (English)
|
25 |
-
Checkpoints: 60,000 steps
|
26 |
-
Framework: PyTorch
|
27 |
-
Model Size: Approximately 1.9GB
|
28 |
-
License: OpenRAIL
|
29 |
|
30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
This model was trained on over 2,800 seconds (~48 minutes) of high-quality speech samples from Ratan Tata, with detailed transcriptions for each audio file. The audio data was pre-processed, converted to a uniform format, and aligned with the corresponding text to ensure optimal training performance.
|
33 |
-
Model Performance
|
34 |
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
|
|
|
|
40 |
|
41 |
-
How to Use the Model
|
42 |
|
43 |
You can use this model for a variety of TTS and voice synthesis tasks. It is designed to work with any standard TTS pipeline and can be integrated into projects for generating Ratan Tata’s voice in any text-based scenario.
|
44 |
|
45 |
-
|
46 |
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
47 |
from speechbrain.pretrained import EncoderClassifier
|
48 |
from IPython.display import Audio
|
@@ -53,87 +53,68 @@ import os, torchaudio
|
|
53 |
import numpy as np
|
54 |
import torch
|
55 |
|
56 |
-
|
57 |
-
processor = SpeechT5Processor.from_pretrained("checkpoint-60000")#Replace with the model folder
|
58 |
processor.tokenizer.split_special_tokens = True
|
59 |
-
model = SpeechT5ForTextToSpeech.from_pretrained("checkpoint-60000")#Replace with the model folder
|
60 |
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
|
|
|
|
|
61 |
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
|
62 |
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
|
63 |
|
64 |
-
|
65 |
spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
|
66 |
-
|
67 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
68 |
speaker_model = EncoderClassifier.from_hparams(
|
69 |
source=spk_model_name,
|
70 |
run_opts={"device": device},
|
71 |
savedir=os.path.join("/tmp", spk_model_name),
|
72 |
)
|
73 |
-
signal, fs =torchaudio.load('wavs/converted_ratan_tata_tts_200.wav')#replace a voice of ratan tata
|
74 |
-
|
75 |
-
speaker_embeddings = speaker_model.encode_batch(signal) # Directly passing signal as a tensor, no need to wrap in torch.tensor
|
76 |
-
speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2) # Normalize the embeddings
|
77 |
-
speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy() # Squeeze and convert to numpy array
|
78 |
-
speaker_embeddings = torch.tensor(np.array([speaker_embeddings])) # Convert back to tensor if necessary
|
79 |
-
|
80 |
-
|
81 |
-
input_text=''' This is Generated Audio,
|
82 |
-
India, a land of ancient wisdom and boundless potential, stands at the cusp of a new era. Our youth, the vibrant heartbeat of our nation, hold the key to unlocking this potential. They are the digital natives, the innovators, the dreamers who will shape the India of tomorrow.
|
83 |
-
|
84 |
-
Knowledge is the most powerful weapon in today's world. It's not just about education, but about the ability to think critically, to adapt, and to innovate. Our youth, with their thirst for knowledge and access to technology, have the potential to become global leaders.
|
85 |
-
|
86 |
-
The power of India lies in its diversity. It is our diversity that makes us unique, that fuels our creativity, and that drives our progress. Our youth, with their understanding of different cultures and perspectives, can bridge divides and foster unity.
|
87 |
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
-
|
96 |
-
|
97 |
words = text.split()
|
98 |
result = []
|
99 |
current_line = []
|
100 |
-
|
101 |
for word in words:
|
102 |
-
# Check if adding the next word exceeds the maximum length
|
103 |
if len(' '.join(current_line + [word])) > max_length:
|
104 |
result.append(' '.join(current_line))
|
105 |
current_line = [word]
|
106 |
else:
|
107 |
current_line.append(word)
|
108 |
-
|
109 |
-
# Add the last remaining part
|
110 |
if current_line:
|
111 |
result.append(' '.join(current_line))
|
112 |
-
|
113 |
return result
|
114 |
|
|
|
115 |
|
116 |
-
|
117 |
-
splited_text=split_text_by_length(input_text,max_length=80)
|
118 |
-
print(splited_text)
|
119 |
-
|
120 |
all_speech = []
|
121 |
-
|
122 |
for i in splited_text:
|
123 |
-
|
124 |
inputs = processor(text=i, return_tensors="pt")
|
125 |
-
speech_chunk = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
|
|
|
126 |
if isinstance(speech_chunk, torch.Tensor):
|
127 |
speech_chunk = speech_chunk.cpu().numpy()
|
128 |
|
129 |
-
# Apply noise reduction to each speech chunk
|
130 |
reduced_noise_chunk = nr.reduce_noise(y=speech_chunk, sr=16000) # assuming 16kHz sample rate
|
131 |
-
|
132 |
all_speech.append(reduced_noise_chunk)
|
133 |
|
|
|
|
|
134 |
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
Audio(concatenated_speech, rate=16000)# Display the final audio with noise reduced
|
139 |
-
''
|
|
|
13 |
- tts
|
14 |
---
|
15 |
|
16 |
+
# Ratan Tata SpeechT5 Voice Cloning Model
|
|
|
|
|
17 |
|
18 |
This model is a Text-to-Speech (TTS) system using SpeechT5 architecture, trained on the Ratan Tata TTS Dataset to generate high-quality synthetic speech resembling the voice of Ratan Tata. The dataset and model pay tribute to his legacy, preserving his voice through cutting-edge AI technology.
|
|
|
19 |
|
20 |
+
## Model Information
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
+
- **Model Architecture:** SpeechT5 (Text-to-Speech)
|
23 |
+
- **Training Dataset:** Ratan Tata TTS Dataset (English)
|
24 |
+
- **Checkpoints:** 60,000 steps
|
25 |
+
- **Framework:** PyTorch
|
26 |
+
- **Model Size:** Approximately 1.9GB
|
27 |
+
- **License:** OpenRAIL
|
28 |
+
|
29 |
+
## Dataset Summary
|
30 |
|
31 |
This model was trained on over 2,800 seconds (~48 minutes) of high-quality speech samples from Ratan Tata, with detailed transcriptions for each audio file. The audio data was pre-processed, converted to a uniform format, and aligned with the corresponding text to ensure optimal training performance.
|
|
|
32 |
|
33 |
+
## Model Performance
|
34 |
+
|
35 |
+
- **Voice Quality:** The model replicates the unique tone, cadence, and voice texture of Ratan Tata with high accuracy, making it suitable for various voice cloning applications.
|
36 |
+
- **Sample Rate:** 16 kHz (consistent with the training data)
|
37 |
+
- **Audio Channels:** Mono
|
38 |
+
- **Bit Depth:** 16-bit
|
39 |
+
- **Precision:** High-quality synthesis using SpeechT5
|
40 |
|
41 |
+
## How to Use the Model
|
42 |
|
43 |
You can use this model for a variety of TTS and voice synthesis tasks. It is designed to work with any standard TTS pipeline and can be integrated into projects for generating Ratan Tata’s voice in any text-based scenario.
|
44 |
|
45 |
+
```python
|
46 |
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
47 |
from speechbrain.pretrained import EncoderClassifier
|
48 |
from IPython.display import Audio
|
|
|
53 |
import numpy as np
|
54 |
import torch
|
55 |
|
56 |
+
# Load the processor and model
|
57 |
+
processor = SpeechT5Processor.from_pretrained("checkpoint-60000") # Replace with the model folder
|
58 |
processor.tokenizer.split_special_tokens = True
|
59 |
+
model = SpeechT5ForTextToSpeech.from_pretrained("checkpoint-60000") # Replace with the model folder
|
60 |
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
|
61 |
+
|
62 |
+
# Load speaker embeddings dataset
|
63 |
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
|
64 |
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
|
65 |
|
66 |
+
# Load the speaker model
|
67 |
spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
|
|
|
68 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
69 |
speaker_model = EncoderClassifier.from_hparams(
|
70 |
source=spk_model_name,
|
71 |
run_opts={"device": device},
|
72 |
savedir=os.path.join("/tmp", spk_model_name),
|
73 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
+
# Load and process the Ratan Tata voice file
|
76 |
+
signal, fs = torchaudio.load('wavs/converted_ratan_tata_tts_200.wav') # Replace with a Ratan Tata voice file
|
77 |
+
speaker_embeddings = speaker_model.encode_batch(signal)
|
78 |
+
speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2).squeeze().cpu().numpy()
|
79 |
+
speaker_embeddings = torch.tensor(np.array([speaker_embeddings]))
|
80 |
|
81 |
+
# Define input text
|
82 |
+
input_text = '''
|
83 |
+
This is Generated Audio.
|
84 |
+
India, a land of ancient wisdom and boundless potential, stands at the cusp of a new era. Our youth, the vibrant heartbeat of our nation, hold the key to unlocking this potential...
|
85 |
+
'''
|
86 |
|
87 |
+
# Split text into chunks based on character length
|
88 |
+
def split_text_by_length(text, max_length=60):
|
89 |
words = text.split()
|
90 |
result = []
|
91 |
current_line = []
|
|
|
92 |
for word in words:
|
|
|
93 |
if len(' '.join(current_line + [word])) > max_length:
|
94 |
result.append(' '.join(current_line))
|
95 |
current_line = [word]
|
96 |
else:
|
97 |
current_line.append(word)
|
|
|
|
|
98 |
if current_line:
|
99 |
result.append(' '.join(current_line))
|
|
|
100 |
return result
|
101 |
|
102 |
+
splited_text = split_text_by_length(input_text, max_length=80)
|
103 |
|
104 |
+
# Generate speech for each text chunk and apply noise reduction
|
|
|
|
|
|
|
105 |
all_speech = []
|
|
|
106 |
for i in splited_text:
|
|
|
107 |
inputs = processor(text=i, return_tensors="pt")
|
108 |
+
speech_chunk = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
|
109 |
+
|
110 |
if isinstance(speech_chunk, torch.Tensor):
|
111 |
speech_chunk = speech_chunk.cpu().numpy()
|
112 |
|
|
|
113 |
reduced_noise_chunk = nr.reduce_noise(y=speech_chunk, sr=16000) # assuming 16kHz sample rate
|
|
|
114 |
all_speech.append(reduced_noise_chunk)
|
115 |
|
116 |
+
# Concatenate all speech chunks
|
117 |
+
concatenated_speech = np.concatenate(all_speech)
|
118 |
|
119 |
+
# Play the final audio
|
120 |
+
Audio(concatenated_speech, rate=16000)
|
|
|
|
|
|