File size: 8,565 Bytes
2a0d31e
fdb9167
2a0d31e
fdb9167
 
 
 
 
 
 
 
4e186aa
fdb9167
4e186aa
2a0d31e
fdb9167
 
 
 
 
 
4e186aa
fdb9167
4e186aa
fdb9167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---

language: sw
license: apache-2.0
tags:
  - tensorflowtts
  - audio
  - text-to-speech
  - mel-to-wav
inference: false
datasets:
  - bookbot/sw-TZ-Victoria
  - bookbot/sw-TZ-Victoria-syllables-word
  - bookbot/sw-TZ-Victoria-v2
  - bookbot/sw-TZ-VictoriaNeural-upsampled-48kHz
---


# MB-MelGAN HiFi PostNets SW v4

MB-MelGAN HiFi PostNets SW v4 is a mel-to-wav model based on the [MB-MelGAN](https://arxiv.org/abs/2005.05106) architecture with [HiFi-GAN](https://arxiv.org/abs/2010.05646) discriminator. This model was trained from scratch on trained on real and synthetic audio datasets. Instead of training on ground truth waveform spectrograms, this model was trained on the generated PostNet spectrograms of [LightSpeech MFA SW v4](https://huggingface.co/bookbot/lightspeech-mfa-sw-v4). The list of speakers include:

- sw-TZ-Victoria
- sw-TZ-Victoria-syllables-word
- sw-TZ-Victoria-v2
- sw-TZ-VictoriaNeural-upsampled-48kHz

This model was trained using the [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) framework. All training was done on a RTX 4090 GPU. All necessary scripts used for training could be found in this [Github Fork](https://github.com/bookbot-hive/TensorFlowTTS), as well as the [Training metrics](https://huggingface.co/bookbot/mb-melgan-hifi-postnets-sw-v4/tensorboard) logged via Tensorboard.

## Model

| Model                           | Config                                                                                    | SR (Hz) | Mel range (Hz) | FFT / Hop / Win (pt) | #steps |
| ------------------------------- | ----------------------------------------------------------------------------------------- | ------- | -------------- | -------------------- | ------ |
| `mb-melgan-hifi-postnets-sw-v4` | [Link](https://huggingface.co/bookbot/mb-melgan-hifi-postnets-sw-v4/blob/main/config.yml) | 44.1K   | 20-11025       | 2048 / 512 / None    | 1M     |

## Training Procedure

<details>
  <summary>Feature Extraction Setting</summary>

    sampling_rate: 44100

    hop_size: 512 # Hop size.

    format: "npy"


</details>

<details>
  <summary>Generator Network Architecture Setting</summary>

    model_type: "multiband_melgan_generator"


    multiband_melgan_generator_params:

        out_channels: 4 # Number of output channels (number of subbands).

        kernel_size: 7 # Kernel size of initial and final conv layers.

        filters: 384 # Initial number of channels for conv layers.

        upsample_scales: [8, 4, 4] # List of Upsampling scales.

        stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.

        stacks: 4 # Number of stacks in a single residual stack module.

        is_weight_norm: false # Use weight-norm or not.


</details>

<details>
  <summary>Discriminator Network Architecture Setting</summary>

    multiband_melgan_discriminator_params:

        out_channels: 1 # Number of output channels.

        scales: 3 # Number of multi-scales.

        downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.

        downsample_pooling_params: # Parameters of the above pooling function.

            pool_size: 4

            strides: 2

        kernel_sizes: [5, 3] # List of kernel size.

        filters: 16 # Number of channels of the initial conv layer.

        max_downsample_filters: 512 # Maximum number of channels of downsampling layers.

        downsample_scales: [4, 4, 4] # List of downsampling scales.

        nonlinear_activation: "LeakyReLU" # Nonlinear activation function.

        nonlinear_activation_params: # Parameters of nonlinear activation function.

            alpha: 0.2

        is_weight_norm: false # Use weight-norm or not.


    hifigan_discriminator_params:

        out_channels: 1 # Number of output channels (number of subbands).

        period_scales: [3, 5, 7, 11, 17, 23, 37] # List of period scales.

        n_layers: 5 # Number of layer of each period discriminator.

        kernel_size: 5 # Kernel size.

        strides: 3 # Strides

        filters: 8 # In Conv filters of each period discriminator

        filter_scales: 4 # Filter scales.

        max_filters: 512 # maximum filters of period discriminator's conv.

        is_weight_norm: false # Use weight-norm or not.


</details>

<details>
  <summary>STFT Loss Setting</summary>

    stft_loss_params:

        fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.

        frame_steps: [120, 240, 50] # List of hop size for STFT-based loss

        frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.


    subband_stft_loss_params:

        fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss.

        frame_steps: [30, 60, 10] # List of hop size for STFT-based loss

        frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.


</details>

<details>
  <summary>Adversarial Loss Setting</summary>

    lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss

    lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.


</details>

<details>
  <summary>Data Loader Setting</summary>

    batch_size: 32 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.

    eval_batch_size: 16

    batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size.

    batch_max_steps_valid: 8192 # Length of each audio for validation. Make sure dividable by hope_size.

    remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.

    allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.

    is_shuffle: true # shuffle dataset after each epoch.


</details>

<details>
  <summary>Optimizer & Scheduler Setting</summary>

    generator_optimizer_params:

        lr_fn: "PiecewiseConstantDecay"

        lr_params:

            boundaries: [100000, 150000, 400000, 500000, 600000, 700000]

            values: [0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]

        amsgrad: false


    discriminator_optimizer_params:

        lr_fn: "PiecewiseConstantDecay"

        lr_params:

            boundaries: [100000, 200000, 300000, 400000, 500000]

            values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]

        amsgrad: false


    gradient_accumulation_steps: 1


</details>

<details>
  <summary>Interval Setting</summary>

    discriminator_train_start_steps: 200000 # steps begin training discriminator

    train_max_steps: 1000000 # Number of training steps.

    save_interval_steps: 20000 # Interval steps to save checkpoint.

    eval_interval_steps: 5000 # Interval steps to evaluate the network.

    log_interval_steps: 200 # Interval steps to record the training log.


</details>

<details>
  <summary>Other Setting</summary>

    num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.


</details>

## How to Use

```py

import soundfile as sf

import tensorflow as tf

from tensorflow_tts.inference import TFAutoModel, AutoProcessor



lightspeech = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-sw-v4")

processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-sw-v4")

mb_melgan = TFAutoModel.from_pretrained("bookbot/mb-melgan-hifi-postnets-sw-v4")



text, speaker_name = "Hello World.", "sw-TZ-Victoria"

input_ids = processor.text_to_sequence(text)



mel, _, _ = lightspeech.inference(

    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),

    speaker_ids=tf.convert_to_tensor(

        [processor.speakers_map[speaker_name]], dtype=tf.int32

    ),

    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),

    f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),

    energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),

)



audio = mb_melgan.inference(mel)[0, :, 0]

sf.write("./audio.wav", audio, 44100, "PCM_16")

```

## Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

## Authors

MB-MelGAN HiFi PostNets SW v4 was trained and evaluated by [David Samuel Setiawan](https://davidsamuell.github.io/), [Wilson Wongso](https://wilsonwongso.dev/). All computation and development are done on local machines.

## Framework versions

- TensorFlowTTS 1.8
- TensorFlow 2.12.0