--- datasets: - galsenai/anta_women_tts language: - wo base_model: - coqui/XTTS-v2 tags: - nlp - tts - speech --- # Wolof Text To Speech This is a text-to-speech model allowing you to create a synthetic voice speaking in `Wolof` from any textual input in the same language. The model is based on [xTTS V2](https://huggingface.co/coqui/XTTS-v2) and has been trained on [Wolof-TTS](https://huggingface.co/datasets/galsenai/anta_women_tts/) data cleaned by the [GalsenAI Lab](https://huggingface.co/galsenai). ## Checkpoint ID To download the model, you'll need the [gdown](https://github.com/wkentaro/gdown) utility included in the [Git project](https://github.com/Galsenaicommunity/Wolof-TTS) dependencies and the model ID indicated in the [checkpoint-id](checkpoint-id.yml) yaml file (cf the `Files and versions` section above). Then, use the command below to download the model checkpoint: ``` gdown ``` ## Usage ### Configurations Start by cloning the project: ```sh git clone https://github.com/Galsenaicommunity/Wolof-TTS.git ``` Then, install the dependencies: ```sh cd Wolof-TTS/notebooks/Models/xTTS\ v2 pip install -r requirements.txt ``` > `IMPORTANT`: You don't need to install the TTS library, [a modified version](https://github.com/anhnh2002/XTTSv2-Finetuning-for-New-Languages/tree/main) is already included in the project's git repository. > You can now download the model checkpoint with `gdown` as indicated previously and unzip it: ```sh unzip galsenai-xtts-wo-checkpoints.zip && rm galsenai-xtts-wo-checkpoints.zip ``` > Attention: The model is over 7GB in size. ### Model Loading ```py import torch, torchaudio, os import numpy as np from tqdm import tqdm from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts root_path = "../../../../galsenai-xtts-wo-checkpoints/" checkpoint_path = root_path+"Anta_GPT_XTTS_Wo" model_path = "best_model_89250.pth" device = "cuda:0" if torch.cuda.is_available() else "cpu" xtts_checkpoint = os.path.join(checkpoint_path, model_path) xtts_config = os.path.join(checkpoint_path,"config.json") xtts_vocab = root_path+"XTTS_v2.0_original_model_files/vocab.json" # Load model config = XttsConfig() config.load_json(xtts_config) XTTS_MODEL = Xtts.init_from_config(config) XTTS_MODEL.load_checkpoint(config, checkpoint_path = xtts_checkpoint, vocab_path = xtts_vocab, use_deepspeed = False) XTTS_MODEL.to(device) print("Model loaded successfully!") ``` ### Model Inference xTTS can clone any voice with a sample length of just 6s. An audio sample from the training set is used as a `reference` and therefore as the output voice of the TTS. You can change it to any voice you wish, as long as you comply with data protection regulations. > Any use contrary to Senegalese law is strictly forbidden, and GalsenAI accepts no liability in such cases. > By using this model, you agree to comply with Senegalese laws and not to make any use that could cause any abuse or damage to anyone. ```py from IPython.display import Audio # Sample audio of the voice that will be used by the TTS # You can change it and put any audio of at least 6s duration reference = root_path+"anta_sample.wav" Audio(reference, rate=44100) ``` Synthetic voice generation from a`text`: ```py text = "Màngi tuddu Aadama, di baat bii waa Galsen A.I defar ngir wax ak yéen ci wolof!" gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents( audio_path = [reference], gpt_cond_len = XTTS_MODEL.config.gpt_cond_len, max_ref_length = XTTS_MODEL.config.max_ref_len, sound_norm_refs = XTTS_MODEL.config.sound_norm_refs) result = XTTS_MODEL.inference( text = text.lower(), gpt_cond_latent = gpt_cond_latent, speaker_embedding = speaker_embedding, do_sample = False, speed = 1.06, language = "wo", enable_text_splitting=True ) ``` You can then export the output audio: ```py import soundfile as sf generated_audio = "generated_audio.wav" sf.write(generated_audio, audio_signal, sample_rate) ``` A notebook is available on [this link](https://colab.research.google.com/drive/1AAhAtWyFjGpLGWrXaeK04BWc1BlIkNBf?usp=sharing), enabling you to test the model quickly. ## LIMITATIONS The model was trained on the [Cleaned Wolof-TTS data](https://huggingface.co/datasets/galsenai/anta_women_tts/), which includes pauses during recording. This behavior is reflected in the final model, and pauses may occur randomly during inference. To remedy this, you can use the `removesilence.py` wrapper included in the repository to remove certain silences and mitigate this problem. ```py from removesilence import detect_silence, remove_silence # silence identification lst = detect_silence(generated_audio) print(lst) # silence removing output_audio = "audio_without_silence.wav" remove_silence(generated_audio, lst, output_audio) ``` As the dataset used contains almost no French or English terms, the model will have difficulty correctly synthesizing a voice with [code-mixed](https://en.wikipedia.org/wiki/Code-mixing) text; the same goes for numbers. ## ACKNOWLEDGEMENT This work was made possible thanks to the computational support of [Caytu Robotics](https://caytu.com/). GalsenAI disclaims all liability for any use of this voice synthesizer in contravention of the regulations governing the protection of personal data and all laws in force in Senegal. __Please mention GalsenAI on all source code, deposits and communications when using this tool.__ If you have any questions, please contact us at `contact[at]galsen[dot]ai`. ## CREDITS * The [raw data](https://huggingface.co/datasets/galsenai/wolof_tts) has been organised and made available by [Alwaly](https://huggingface.co/Alwaly). * The [training notebook](https://github.com/Galsenaicommunity/Wolof-TTS/blob/main/notebooks/Models/xTTS%20v2/xTTS_v2_fine_tunnig_on_single_wolof_tts_dataset.ipynb) was set up by [Mouhamed Sarr (Loloskii)](https://github.com/mohaskii). * The model training on [GCP](https://cloud.google.com/) (`A100 40GB`), the implementation of the silence suppression script (based on [this article](https://onkar-patil.medium.com/how-to-remove-silence-from-an-audio-using-python-50fd2c00557d)) as well as that of this notebook was carried out by [Derguene](https://huggingface.co/derguene).