xTTS-v2-wolof / README.md
derguene's picture
Update README.md
bb35b1f verified
metadata
datasets:
  - galsenai/anta_women_tts
language:
  - wo
base_model:
  - coqui/XTTS-v2
tags:
  - nlp
  - tts
  - speech

Wolof Text To Speech

This is a text-to-speech model allowing you to create a synthetic voice speaking in Wolof from any textual input in the same language. The model is based on xTTS V2 and has been trained on Wolof-TTS data cleaned by the GalsenAI Lab.

Checkpoint ID

To download the model, you'll need the gdown utility included in the Git project dependencies and the model ID indicated in the checkpoint-id yaml file (cf the Files and versions section above).
Then, use the command below to download the model checkpoint:

gdown <Checkpoint ID>

Usage

Configurations

Start by cloning the project:

git clone https://github.com/Galsenaicommunity/Wolof-TTS.git

Then, install the dependencies:

cd Wolof-TTS/notebooks/Models/xTTS\ v2
pip install -r requirements.txt

IMPORTANT: You don't need to install the TTS library, a modified version is already included in the project's git repository.

You can now download the model checkpoint with gdown as indicated previously and unzip it:

unzip galsenai-xtts-wo-checkpoints.zip && rm galsenai-xtts-wo-checkpoints.zip

Attention: The model is over 7GB in size.

Model Loading

import torch, torchaudio, os
import numpy as np

from   tqdm import tqdm
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts         import Xtts

root_path       = "../../../../galsenai-xtts-wo-checkpoints/"
checkpoint_path = root_path+"Anta_GPT_XTTS_Wo"
model_path      = "best_model_89250.pth"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

xtts_checkpoint = os.path.join(checkpoint_path, model_path)
xtts_config     = os.path.join(checkpoint_path,"config.json")
xtts_vocab      = root_path+"XTTS_v2.0_original_model_files/vocab.json"

# Load model
config     = XttsConfig()
config.load_json(xtts_config)
XTTS_MODEL = Xtts.init_from_config(config)
XTTS_MODEL.load_checkpoint(config,
                           checkpoint_path = xtts_checkpoint,
                           vocab_path      = xtts_vocab,
                           use_deepspeed   = False)
XTTS_MODEL.to(device)

print("Model loaded successfully!")

Model Inference

xTTS can clone any voice with a sample length of just 6s. An audio sample from the training set is used as a reference and therefore as the output voice of the TTS. You can change it to any voice you wish, as long as you comply with data protection regulations.

Any use contrary to Senegalese law is strictly forbidden, and GalsenAI accepts no liability in such cases. By using this model, you agree to comply with Senegalese laws and not to make any use that could cause any abuse or damage to anyone.

from IPython.display import Audio
# Sample audio of the voice that will be used by the TTS
# You can change it and put any audio of at least 6s duration
reference = root_path+"anta_sample.wav"
Audio(reference, rate=44100)

Synthetic voice generation from atext:

text = "Màngi tuddu Aadama, di baat bii waa Galsen A.I defar ngir wax ak yéen ci wolof!"

gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents(
    audio_path      = [reference],
    gpt_cond_len    = XTTS_MODEL.config.gpt_cond_len,
    max_ref_length  = XTTS_MODEL.config.max_ref_len,
    sound_norm_refs = XTTS_MODEL.config.sound_norm_refs)

result = XTTS_MODEL.inference(
    text              = text.lower(),
    gpt_cond_latent   = gpt_cond_latent,
    speaker_embedding = speaker_embedding,
    do_sample         = False,
    speed             = 1.06,
    language          = "wo",
    enable_text_splitting=True
)

You can then export the output audio:

import soundfile as sf

generated_audio = "generated_audio.wav"
sf.write(generated_audio, audio_signal, sample_rate)

A notebook is available on this link, enabling you to test the model quickly.

LIMITATIONS

The model was trained on the Cleaned Wolof-TTS data, which includes pauses during recording. This behavior is reflected in the final model, and pauses may occur randomly during inference. To remedy this, you can use the removesilence.py wrapper included in the repository to remove certain silences and mitigate this problem.

from removesilence import detect_silence, remove_silence

# silence identification
lst = detect_silence(generated_audio)
print(lst)

# silence removing
output_audio = "audio_without_silence.wav"
remove_silence(generated_audio, lst, output_audio)

As the dataset used contains almost no French or English terms, the model will have difficulty correctly synthesizing a voice with code-mixed text; the same goes for numbers.

ACKNOWLEDGEMENT

This work was made possible thanks to the computational support of Caytu Robotics.
GalsenAI disclaims all liability for any use of this voice synthesizer in contravention of the regulations governing the protection of personal data and all laws in force in Senegal. Please mention GalsenAI on all source code, deposits and communications when using this tool.

If you have any questions, please contact us at contact[at]galsen[dot]ai.

CREDITS