---
datasets:
- galsenai/anta_women_tts
language:
- wo
base_model:
- coqui/XTTS-v2
tags:
- nlp
- tts
- speech
---

# Wolof Text To Speech
This is a text-to-speech model allowing you to create a synthetic voice speaking in `Wolof` from any textual input in the same language. The model is based on [xTTS V2](https://huggingface.co/coqui/XTTS-v2) and has been trained on [Wolof-TTS](https://huggingface.co/datasets/galsenai/anta_women_tts/) data cleaned by the [GalsenAI Lab](https://huggingface.co/galsenai).

## Checkpoint ID
To download the model, you'll need the [gdown](https://github.com/wkentaro/gdown) utility included in the [Git project](https://github.com/Galsenaicommunity/Wolof-TTS) dependencies and the model ID indicated in the [checkpoint-id](checkpoint-id.yml) yaml file (cf the `Files and versions` section above).  
Then, use the command below to download the model checkpoint:
```
gdown <Checkpoint ID>
```

## Usage
### Configurations
Start by cloning the project:
```sh
git clone https://github.com/Galsenaicommunity/Wolof-TTS.git
```
Then, install the dependencies:
```sh
cd Wolof-TTS/notebooks/Models/xTTS\ v2
pip install -r requirements.txt
```
> `IMPORTANT`: You don't need to install the TTS library, [a modified version](https://github.com/anhnh2002/XTTSv2-Finetuning-for-New-Languages/tree/main) is already included in the project's git repository.
> 
You can now download the model checkpoint with `gdown` as indicated previously and unzip it:
```sh
unzip galsenai-xtts-wo-checkpoints.zip && rm galsenai-xtts-wo-checkpoints.zip
```
> Attention: The model is over 7GB in size.

### Model Loading
```py
import torch, torchaudio, os
import numpy as np

from   tqdm import tqdm
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts         import Xtts

root_path       = "../../../../galsenai-xtts-wo-checkpoints/"
checkpoint_path = root_path+"Anta_GPT_XTTS_Wo"
model_path      = "best_model_89250.pth"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

xtts_checkpoint = os.path.join(checkpoint_path, model_path)
xtts_config     = os.path.join(checkpoint_path,"config.json")
xtts_vocab      = root_path+"XTTS_v2.0_original_model_files/vocab.json"

# Load model
config     = XttsConfig()
config.load_json(xtts_config)
XTTS_MODEL = Xtts.init_from_config(config)
XTTS_MODEL.load_checkpoint(config,
                           checkpoint_path = xtts_checkpoint,
                           vocab_path      = xtts_vocab,
                           use_deepspeed   = False)
XTTS_MODEL.to(device)

print("Model loaded successfully!")
```

### Model Inference
xTTS can clone any voice with a sample length of just 6s. An audio sample from the training set is used as a `reference` and therefore as the output voice of the TTS. 
You can change it to any voice you wish, as long as you comply with data protection regulations. 
> Any use contrary to Senegalese law is strictly forbidden, and GalsenAI accepts no liability in such cases.
> By using this model, you agree to comply with Senegalese laws and not to make any use that could cause any abuse or damage to anyone.
```py
from IPython.display import Audio
# Sample audio of the voice that will be used by the TTS
# You can change it and put any audio of at least 6s duration
reference = root_path+"anta_sample.wav"
Audio(reference, rate=44100)
```
Synthetic voice generation from a`text`:
```py
text = "Màngi tuddu Aadama, di baat bii waa Galsen A.I defar ngir wax ak yéen ci wolof!"

gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents(
    audio_path      = [reference],
    gpt_cond_len    = XTTS_MODEL.config.gpt_cond_len,
    max_ref_length  = XTTS_MODEL.config.max_ref_len,
    sound_norm_refs = XTTS_MODEL.config.sound_norm_refs)

result = XTTS_MODEL.inference(
    text              = text.lower(),
    gpt_cond_latent   = gpt_cond_latent,
    speaker_embedding = speaker_embedding,
    do_sample         = False,
    speed             = 1.06,
    language          = "wo",
    enable_text_splitting=True
)
```
You can then export the output audio:
```py
import soundfile as sf

generated_audio = "generated_audio.wav"
sf.write(generated_audio, audio_signal, sample_rate)
```
A notebook is available on [this link](https://colab.research.google.com/drive/1AAhAtWyFjGpLGWrXaeK04BWc1BlIkNBf?usp=sharing), enabling you to test the model quickly.

## LIMITATIONS
The model was trained on the [Cleaned Wolof-TTS data](https://huggingface.co/datasets/galsenai/anta_women_tts/), which includes pauses during recording. This behavior is reflected in the final model, and pauses may occur randomly during inference. 
To remedy this, you can use the `removesilence.py` wrapper included in the repository to remove certain silences and mitigate this problem.
```py
from removesilence import detect_silence, remove_silence

# silence identification
lst = detect_silence(generated_audio)
print(lst)

# silence removing
output_audio = "audio_without_silence.wav"
remove_silence(generated_audio, lst, output_audio)
```
As the dataset used contains almost no French or English terms, the model will have difficulty correctly synthesizing a voice with [code-mixed](https://en.wikipedia.org/wiki/Code-mixing) text; the same goes for numbers.

## ACKNOWLEDGEMENT
This work was made possible thanks to the computational support of [Caytu Robotics](https://caytu.com/).  
GalsenAI disclaims all liability for any use of this voice synthesizer in contravention of the regulations governing the protection of personal data and all laws in force in Senegal.
__Please mention GalsenAI on all source code, deposits and communications when using this tool.__

If you have any questions, please contact us at `contact[at]galsen[dot]ai`.

## CREDITS
* The [raw data](https://huggingface.co/datasets/galsenai/wolof_tts) has been organised and made available by [Alwaly](https://huggingface.co/Alwaly).
* The [training notebook](https://github.com/Galsenaicommunity/Wolof-TTS/blob/main/notebooks/Models/xTTS%20v2/xTTS_v2_fine_tunnig_on_single_wolof_tts_dataset.ipynb) was set up by [Mouhamed Sarr (Loloskii)](https://github.com/mohaskii).
* The model training on [GCP](https://cloud.google.com/) (`A100 40GB`), the implementation of the silence suppression script (based on [this article](https://onkar-patil.medium.com/how-to-remove-silence-from-an-audio-using-python-50fd2c00557d)) as well as that of this notebook was carried out by [Derguene](https://huggingface.co/derguene).