File size: 8,565 Bytes
2a0d31e fdb9167 2a0d31e fdb9167 4e186aa fdb9167 4e186aa 2a0d31e fdb9167 4e186aa fdb9167 4e186aa fdb9167 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
---
language: sw
license: apache-2.0
tags:
- tensorflowtts
- audio
- text-to-speech
- mel-to-wav
inference: false
datasets:
- bookbot/sw-TZ-Victoria
- bookbot/sw-TZ-Victoria-syllables-word
- bookbot/sw-TZ-Victoria-v2
- bookbot/sw-TZ-VictoriaNeural-upsampled-48kHz
---
# MB-MelGAN HiFi PostNets SW v4
MB-MelGAN HiFi PostNets SW v4 is a mel-to-wav model based on the [MB-MelGAN](https://arxiv.org/abs/2005.05106) architecture with [HiFi-GAN](https://arxiv.org/abs/2010.05646) discriminator. This model was trained from scratch on trained on real and synthetic audio datasets. Instead of training on ground truth waveform spectrograms, this model was trained on the generated PostNet spectrograms of [LightSpeech MFA SW v4](https://huggingface.co/bookbot/lightspeech-mfa-sw-v4). The list of speakers include:
- sw-TZ-Victoria
- sw-TZ-Victoria-syllables-word
- sw-TZ-Victoria-v2
- sw-TZ-VictoriaNeural-upsampled-48kHz
This model was trained using the [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) framework. All training was done on a RTX 4090 GPU. All necessary scripts used for training could be found in this [Github Fork](https://github.com/bookbot-hive/TensorFlowTTS), as well as the [Training metrics](https://huggingface.co/bookbot/mb-melgan-hifi-postnets-sw-v4/tensorboard) logged via Tensorboard.
## Model
| Model | Config | SR (Hz) | Mel range (Hz) | FFT / Hop / Win (pt) | #steps |
| ------------------------------- | ----------------------------------------------------------------------------------------- | ------- | -------------- | -------------------- | ------ |
| `mb-melgan-hifi-postnets-sw-v4` | [Link](https://huggingface.co/bookbot/mb-melgan-hifi-postnets-sw-v4/blob/main/config.yml) | 44.1K | 20-11025 | 2048 / 512 / None | 1M |
## Training Procedure
<details>
<summary>Feature Extraction Setting</summary>
sampling_rate: 44100
hop_size: 512 # Hop size.
format: "npy"
</details>
<details>
<summary>Generator Network Architecture Setting</summary>
model_type: "multiband_melgan_generator"
multiband_melgan_generator_params:
out_channels: 4 # Number of output channels (number of subbands).
kernel_size: 7 # Kernel size of initial and final conv layers.
filters: 384 # Initial number of channels for conv layers.
upsample_scales: [8, 4, 4] # List of Upsampling scales.
stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
stacks: 4 # Number of stacks in a single residual stack module.
is_weight_norm: false # Use weight-norm or not.
</details>
<details>
<summary>Discriminator Network Architecture Setting</summary>
multiband_melgan_discriminator_params:
out_channels: 1 # Number of output channels.
scales: 3 # Number of multi-scales.
downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.
downsample_pooling_params: # Parameters of the above pooling function.
pool_size: 4
strides: 2
kernel_sizes: [5, 3] # List of kernel size.
filters: 16 # Number of channels of the initial conv layer.
max_downsample_filters: 512 # Maximum number of channels of downsampling layers.
downsample_scales: [4, 4, 4] # List of downsampling scales.
nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
nonlinear_activation_params: # Parameters of nonlinear activation function.
alpha: 0.2
is_weight_norm: false # Use weight-norm or not.
hifigan_discriminator_params:
out_channels: 1 # Number of output channels (number of subbands).
period_scales: [3, 5, 7, 11, 17, 23, 37] # List of period scales.
n_layers: 5 # Number of layer of each period discriminator.
kernel_size: 5 # Kernel size.
strides: 3 # Strides
filters: 8 # In Conv filters of each period discriminator
filter_scales: 4 # Filter scales.
max_filters: 512 # maximum filters of period discriminator's conv.
is_weight_norm: false # Use weight-norm or not.
</details>
<details>
<summary>STFT Loss Setting</summary>
stft_loss_params:
fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.
frame_steps: [120, 240, 50] # List of hop size for STFT-based loss
frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
subband_stft_loss_params:
fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss.
frame_steps: [30, 60, 10] # List of hop size for STFT-based loss
frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
</details>
<details>
<summary>Adversarial Loss Setting</summary>
lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss
lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.
</details>
<details>
<summary>Data Loader Setting</summary>
batch_size: 32 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
eval_batch_size: 16
batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size.
batch_max_steps_valid: 8192 # Length of each audio for validation. Make sure dividable by hope_size.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
is_shuffle: true # shuffle dataset after each epoch.
</details>
<details>
<summary>Optimizer & Scheduler Setting</summary>
generator_optimizer_params:
lr_fn: "PiecewiseConstantDecay"
lr_params:
boundaries: [100000, 150000, 400000, 500000, 600000, 700000]
values: [0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
amsgrad: false
discriminator_optimizer_params:
lr_fn: "PiecewiseConstantDecay"
lr_params:
boundaries: [100000, 200000, 300000, 400000, 500000]
values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
amsgrad: false
gradient_accumulation_steps: 1
</details>
<details>
<summary>Interval Setting</summary>
discriminator_train_start_steps: 200000 # steps begin training discriminator
train_max_steps: 1000000 # Number of training steps.
save_interval_steps: 20000 # Interval steps to save checkpoint.
eval_interval_steps: 5000 # Interval steps to evaluate the network.
log_interval_steps: 200 # Interval steps to record the training log.
</details>
<details>
<summary>Other Setting</summary>
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
</details>
## How to Use
```py
import soundfile as sf
import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel, AutoProcessor
lightspeech = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-sw-v4")
processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-sw-v4")
mb_melgan = TFAutoModel.from_pretrained("bookbot/mb-melgan-hifi-postnets-sw-v4")
text, speaker_name = "Hello World.", "sw-TZ-Victoria"
input_ids = processor.text_to_sequence(text)
mel, _, _ = lightspeech.inference(
input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
speaker_ids=tf.convert_to_tensor(
[processor.speakers_map[speaker_name]], dtype=tf.int32
),
speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
)
audio = mb_melgan.inference(mel)[0, :, 0]
sf.write("./audio.wav", audio, 44100, "PCM_16")
```
## Disclaimer
Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.
## Authors
MB-MelGAN HiFi PostNets SW v4 was trained and evaluated by [David Samuel Setiawan](https://davidsamuell.github.io/), [Wilson Wongso](https://wilsonwongso.dev/). All computation and development are done on local machines.
## Framework versions
- TensorFlowTTS 1.8
- TensorFlow 2.12.0
|