datasets:
- galsenai/anta_women_tts
language:
- wo
base_model:
- coqui/XTTS-v2
tags:
- nlp
- tts
- speech
Wolof Text To Speech
This is a text-to-speech model allowing you to create a synthetic voice speaking in Wolof
from any textual input in the same language. The model is based on xTTS V2 and has been trained on Wolof-TTS data cleaned by the GalsenAI Lab.
Checkpoint ID
To download the model, you'll need the gdown utility included in the Git project dependencies and the model ID indicated in the checkpoint-id yaml file (cf the Files and versions
section above).
Then, use the command below to download the model checkpoint:
gdown <Checkpoint ID>
Usage
Configurations
Start by cloning the project:
git clone https://github.com/Galsenaicommunity/Wolof-TTS.git
Then, install the dependencies:
cd Wolof-TTS/notebooks/Models/xTTS\ v2
pip install -r requirements.txt
IMPORTANT
: You don't need to install the TTS library, a modified version is already included in the project's git repository.
You can now download the model checkpoint with gdown
as indicated previously and unzip it:
unzip galsenai-xtts-wo-checkpoints.zip && rm galsenai-xtts-wo-checkpoints.zip
Attention: The model is over 7GB in size.
Model Loading
import torch, torchaudio, os
import numpy as np
from tqdm import tqdm
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
root_path = "../../../../galsenai-xtts-wo-checkpoints/"
checkpoint_path = root_path+"Anta_GPT_XTTS_Wo"
model_path = "best_model_89250.pth"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
xtts_checkpoint = os.path.join(checkpoint_path, model_path)
xtts_config = os.path.join(checkpoint_path,"config.json")
xtts_vocab = root_path+"XTTS_v2.0_original_model_files/vocab.json"
# Load model
config = XttsConfig()
config.load_json(xtts_config)
XTTS_MODEL = Xtts.init_from_config(config)
XTTS_MODEL.load_checkpoint(config,
checkpoint_path = xtts_checkpoint,
vocab_path = xtts_vocab,
use_deepspeed = False)
XTTS_MODEL.to(device)
print("Model loaded successfully!")
Model Inference
xTTS can clone any voice with a sample length of just 6s. An audio sample from the training set is used as a reference
and therefore as the output voice of the TTS.
You can change it to any voice you wish, as long as you comply with data protection regulations.
Any use contrary to Senegalese law is strictly forbidden, and GalsenAI accepts no liability in such cases. By using this model, you agree to comply with Senegalese laws and not to make any use that could cause any abuse or damage to anyone.
from IPython.display import Audio
# Sample audio of the voice that will be used by the TTS
# You can change it and put any audio of at least 6s duration
reference = root_path+"anta_sample.wav"
Audio(reference, rate=44100)
Synthetic voice generation from atext
:
text = "Màngi tuddu Aadama, di baat bii waa Galsen A.I defar ngir wax ak yéen ci wolof!"
gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents(
audio_path = [reference],
gpt_cond_len = XTTS_MODEL.config.gpt_cond_len,
max_ref_length = XTTS_MODEL.config.max_ref_len,
sound_norm_refs = XTTS_MODEL.config.sound_norm_refs)
result = XTTS_MODEL.inference(
text = text.lower(),
gpt_cond_latent = gpt_cond_latent,
speaker_embedding = speaker_embedding,
do_sample = False,
speed = 1.06,
language = "wo",
enable_text_splitting=True
)
You can then export the output audio:
import soundfile as sf
generated_audio = "generated_audio.wav"
sf.write(generated_audio, audio_signal, sample_rate)
A notebook is available on this link, enabling you to test the model quickly.
LIMITATIONS
The model was trained on the Cleaned Wolof-TTS data, which includes pauses during recording. This behavior is reflected in the final model, and pauses may occur randomly during inference.
To remedy this, you can use the removesilence.py
wrapper included in the repository to remove certain silences and mitigate this problem.
from removesilence import detect_silence, remove_silence
# silence identification
lst = detect_silence(generated_audio)
print(lst)
# silence removing
output_audio = "audio_without_silence.wav"
remove_silence(generated_audio, lst, output_audio)
As the dataset used contains almost no French or English terms, the model will have difficulty correctly synthesizing a voice with code-mixed text; the same goes for numbers.
ACKNOWLEDGEMENT
This work was made possible thanks to the computational support of Caytu Robotics.
GalsenAI disclaims all liability for any use of this voice synthesizer in contravention of the regulations governing the protection of personal data and all laws in force in Senegal.
Please mention GalsenAI on all source code, deposits and communications when using this tool.
If you have any questions, please contact us at contact[at]galsen[dot]ai
.
CREDITS
- The raw data has been organised and made available by Alwaly.
- The training notebook was set up by Mouhamed Sarr (Loloskii).
- The model training on GCP (
A100 40GB
), the implementation of the silence suppression script (based on this article) as well as that of this notebook was carried out by Derguene.