RamananR commited on
Commit
fd3f640
·
verified ·
1 Parent(s): 274c007

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -64
README.md CHANGED
@@ -13,36 +13,36 @@ tags:
13
  - tts
14
  ---
15
 
16
-
17
-
18
- Ratan Tata SpeechT5 Voice Cloning Model
19
 
20
  This model is a Text-to-Speech (TTS) system using SpeechT5 architecture, trained on the Ratan Tata TTS Dataset to generate high-quality synthetic speech resembling the voice of Ratan Tata. The dataset and model pay tribute to his legacy, preserving his voice through cutting-edge AI technology.
21
- Model Information
22
 
23
- Model Architecture: SpeechT5 (Text-to-Speech)
24
- Training Dataset: Ratan Tata TTS Dataset (English)
25
- Checkpoints: 60,000 steps
26
- Framework: PyTorch
27
- Model Size: Approximately 1.9GB
28
- License: OpenRAIL
29
 
30
- Dataset Summary
 
 
 
 
 
 
 
31
 
32
  This model was trained on over 2,800 seconds (~48 minutes) of high-quality speech samples from Ratan Tata, with detailed transcriptions for each audio file. The audio data was pre-processed, converted to a uniform format, and aligned with the corresponding text to ensure optimal training performance.
33
- Model Performance
34
 
35
- Voice Quality: The model replicates the unique tone, cadence, and voice texture of Ratan Tata with high accuracy, making it suitable for various voice cloning applications.
36
- Sample Rate: 16 kHz (consistent with the training data)
37
- Audio Channels: Mono
38
- Bit Depth: 16-bit
39
- Precision: High-quality synthesis using SpeechT5
 
 
40
 
41
- How to Use the Model
42
 
43
  You can use this model for a variety of TTS and voice synthesis tasks. It is designed to work with any standard TTS pipeline and can be integrated into projects for generating Ratan Tata’s voice in any text-based scenario.
44
 
45
- '''
46
  from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
47
  from speechbrain.pretrained import EncoderClassifier
48
  from IPython.display import Audio
@@ -53,87 +53,68 @@ import os, torchaudio
53
  import numpy as np
54
  import torch
55
 
56
-
57
- processor = SpeechT5Processor.from_pretrained("checkpoint-60000")#Replace with the model folder
58
  processor.tokenizer.split_special_tokens = True
59
- model = SpeechT5ForTextToSpeech.from_pretrained("checkpoint-60000")#Replace with the model folder
60
  vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
 
 
61
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
62
  speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
63
 
64
-
65
  spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
66
-
67
  device = "cuda" if torch.cuda.is_available() else "cpu"
68
  speaker_model = EncoderClassifier.from_hparams(
69
  source=spk_model_name,
70
  run_opts={"device": device},
71
  savedir=os.path.join("/tmp", spk_model_name),
72
  )
73
- signal, fs =torchaudio.load('wavs/converted_ratan_tata_tts_200.wav')#replace a voice of ratan tata
74
-
75
- speaker_embeddings = speaker_model.encode_batch(signal) # Directly passing signal as a tensor, no need to wrap in torch.tensor
76
- speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2) # Normalize the embeddings
77
- speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy() # Squeeze and convert to numpy array
78
- speaker_embeddings = torch.tensor(np.array([speaker_embeddings])) # Convert back to tensor if necessary
79
-
80
-
81
- input_text=''' This is Generated Audio,
82
- India, a land of ancient wisdom and boundless potential, stands at the cusp of a new era. Our youth, the vibrant heartbeat of our nation, hold the key to unlocking this potential. They are the digital natives, the innovators, the dreamers who will shape the India of tomorrow.
83
-
84
- Knowledge is the most powerful weapon in today's world. It's not just about education, but about the ability to think critically, to adapt, and to innovate. Our youth, with their thirst for knowledge and access to technology, have the potential to become global leaders.
85
-
86
- The power of India lies in its diversity. It is our diversity that makes us unique, that fuels our creativity, and that drives our progress. Our youth, with their understanding of different cultures and perspectives, can bridge divides and foster unity.
87
 
88
- Technology is the catalyst for change. It has the power to transform lives, to create opportunities, and to address challenges. Our youth, with their expertise in technology, can develop solutions that benefit society as a whole.
89
-
90
- I believe in the potential of India's youth. I believe in their ability to build a nation that is prosperous, inclusive, and sustainable. Let us empower them, support their dreams, and provide them with the resources they need to succeed. Together, we can create an India that is a beacon of hope for the world.
91
- This is Generated Audio,
92
- '''
93
 
 
 
 
 
 
94
 
95
- def split_text_by_length(text, max_length=60):#from the paper speech_t5 max char length 120 char "max_length=60"
96
- # Splits the text into chunks of max_length, preserving words
97
  words = text.split()
98
  result = []
99
  current_line = []
100
-
101
  for word in words:
102
- # Check if adding the next word exceeds the maximum length
103
  if len(' '.join(current_line + [word])) > max_length:
104
  result.append(' '.join(current_line))
105
  current_line = [word]
106
  else:
107
  current_line.append(word)
108
-
109
- # Add the last remaining part
110
  if current_line:
111
  result.append(' '.join(current_line))
112
-
113
  return result
114
 
 
115
 
116
-
117
- splited_text=split_text_by_length(input_text,max_length=80)
118
- print(splited_text)
119
-
120
  all_speech = []
121
-
122
  for i in splited_text:
123
-
124
  inputs = processor(text=i, return_tensors="pt")
125
- speech_chunk = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
 
126
  if isinstance(speech_chunk, torch.Tensor):
127
  speech_chunk = speech_chunk.cpu().numpy()
128
 
129
- # Apply noise reduction to each speech chunk
130
  reduced_noise_chunk = nr.reduce_noise(y=speech_chunk, sr=16000) # assuming 16kHz sample rate
131
-
132
  all_speech.append(reduced_noise_chunk)
133
 
 
 
134
 
135
- concatenated_speech = np.concatenate(all_speech)# Concatenate the noise-reduced speech chunks
136
-
137
-
138
- Audio(concatenated_speech, rate=16000)# Display the final audio with noise reduced
139
- ''
 
13
  - tts
14
  ---
15
 
16
+ # Ratan Tata SpeechT5 Voice Cloning Model
 
 
17
 
18
  This model is a Text-to-Speech (TTS) system using SpeechT5 architecture, trained on the Ratan Tata TTS Dataset to generate high-quality synthetic speech resembling the voice of Ratan Tata. The dataset and model pay tribute to his legacy, preserving his voice through cutting-edge AI technology.
 
19
 
20
+ ## Model Information
 
 
 
 
 
21
 
22
+ - **Model Architecture:** SpeechT5 (Text-to-Speech)
23
+ - **Training Dataset:** Ratan Tata TTS Dataset (English)
24
+ - **Checkpoints:** 60,000 steps
25
+ - **Framework:** PyTorch
26
+ - **Model Size:** Approximately 1.9GB
27
+ - **License:** OpenRAIL
28
+
29
+ ## Dataset Summary
30
 
31
  This model was trained on over 2,800 seconds (~48 minutes) of high-quality speech samples from Ratan Tata, with detailed transcriptions for each audio file. The audio data was pre-processed, converted to a uniform format, and aligned with the corresponding text to ensure optimal training performance.
 
32
 
33
+ ## Model Performance
34
+
35
+ - **Voice Quality:** The model replicates the unique tone, cadence, and voice texture of Ratan Tata with high accuracy, making it suitable for various voice cloning applications.
36
+ - **Sample Rate:** 16 kHz (consistent with the training data)
37
+ - **Audio Channels:** Mono
38
+ - **Bit Depth:** 16-bit
39
+ - **Precision:** High-quality synthesis using SpeechT5
40
 
41
+ ## How to Use the Model
42
 
43
  You can use this model for a variety of TTS and voice synthesis tasks. It is designed to work with any standard TTS pipeline and can be integrated into projects for generating Ratan Tata’s voice in any text-based scenario.
44
 
45
+ ```python
46
  from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
47
  from speechbrain.pretrained import EncoderClassifier
48
  from IPython.display import Audio
 
53
  import numpy as np
54
  import torch
55
 
56
+ # Load the processor and model
57
+ processor = SpeechT5Processor.from_pretrained("checkpoint-60000") # Replace with the model folder
58
  processor.tokenizer.split_special_tokens = True
59
+ model = SpeechT5ForTextToSpeech.from_pretrained("checkpoint-60000") # Replace with the model folder
60
  vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
61
+
62
+ # Load speaker embeddings dataset
63
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
64
  speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
65
 
66
+ # Load the speaker model
67
  spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
 
68
  device = "cuda" if torch.cuda.is_available() else "cpu"
69
  speaker_model = EncoderClassifier.from_hparams(
70
  source=spk_model_name,
71
  run_opts={"device": device},
72
  savedir=os.path.join("/tmp", spk_model_name),
73
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
+ # Load and process the Ratan Tata voice file
76
+ signal, fs = torchaudio.load('wavs/converted_ratan_tata_tts_200.wav') # Replace with a Ratan Tata voice file
77
+ speaker_embeddings = speaker_model.encode_batch(signal)
78
+ speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2).squeeze().cpu().numpy()
79
+ speaker_embeddings = torch.tensor(np.array([speaker_embeddings]))
80
 
81
+ # Define input text
82
+ input_text = '''
83
+ This is Generated Audio.
84
+ India, a land of ancient wisdom and boundless potential, stands at the cusp of a new era. Our youth, the vibrant heartbeat of our nation, hold the key to unlocking this potential...
85
+ '''
86
 
87
+ # Split text into chunks based on character length
88
+ def split_text_by_length(text, max_length=60):
89
  words = text.split()
90
  result = []
91
  current_line = []
 
92
  for word in words:
 
93
  if len(' '.join(current_line + [word])) > max_length:
94
  result.append(' '.join(current_line))
95
  current_line = [word]
96
  else:
97
  current_line.append(word)
 
 
98
  if current_line:
99
  result.append(' '.join(current_line))
 
100
  return result
101
 
102
+ splited_text = split_text_by_length(input_text, max_length=80)
103
 
104
+ # Generate speech for each text chunk and apply noise reduction
 
 
 
105
  all_speech = []
 
106
  for i in splited_text:
 
107
  inputs = processor(text=i, return_tensors="pt")
108
+ speech_chunk = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
109
+
110
  if isinstance(speech_chunk, torch.Tensor):
111
  speech_chunk = speech_chunk.cpu().numpy()
112
 
 
113
  reduced_noise_chunk = nr.reduce_noise(y=speech_chunk, sr=16000) # assuming 16kHz sample rate
 
114
  all_speech.append(reduced_noise_chunk)
115
 
116
+ # Concatenate all speech chunks
117
+ concatenated_speech = np.concatenate(all_speech)
118
 
119
+ # Play the final audio
120
+ Audio(concatenated_speech, rate=16000)