Spaces:

Silentlin
/

DiffSinger

Build error

App Files Files Community

ddd commited on Jul 26, 2022

Commit

b93970c

1 Parent(s): aee7e5a

Add application file

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

LICENSE +21 -0
README.md +83 -12
checkpoints/.gitkeep +0 -0
configs/config_base.yaml +42 -0
configs/singing/base.yaml +42 -0
configs/singing/fs2.yaml +3 -0
configs/tts/base.yaml +95 -0
configs/tts/base_zh.yaml +3 -0
configs/tts/fs2.yaml +80 -0
configs/tts/hifigan.yaml +21 -0
configs/tts/lj/base_mel2wav.yaml +3 -0
configs/tts/lj/base_text2mel.yaml +13 -0
configs/tts/lj/fs2.yaml +3 -0
configs/tts/lj/hifigan.yaml +3 -0
configs/tts/lj/pwg.yaml +3 -0
configs/tts/pwg.yaml +110 -0
data/processed/ljspeech/dict.txt +77 -0
data/processed/ljspeech/metadata_phone.csv +0 -0
data/processed/ljspeech/mfa_dict.txt +0 -0
data/processed/ljspeech/phone_set.json +1 -0
data_gen/singing/binarize.py +398 -0
data_gen/tts/base_binarizer.py +224 -0
data_gen/tts/bin/binarize.py +20 -0
data_gen/tts/binarizer_zh.py +59 -0
data_gen/tts/data_gen_utils.py +347 -0
data_gen/tts/txt_processors/base_text_processor.py +8 -0
data_gen/tts/txt_processors/en.py +78 -0
data_gen/tts/txt_processors/zh.py +41 -0
data_gen/tts/txt_processors/zh_g2pM.py +72 -0
docs/README-SVS-opencpop-cascade.md +111 -0
docs/README-SVS-opencpop-e2e.md +106 -0
docs/README-SVS-popcs.md +63 -0
docs/README-SVS.md +44 -0
docs/README-TTS.md +63 -0
docs/README-zh.md +212 -0
inference/svs/base_svs_infer.py +265 -0
inference/svs/ds_cascade.py +54 -0
inference/svs/ds_e2e.py +67 -0
inference/svs/gradio/gradio_settings.yaml +19 -0
inference/svs/gradio/infer.py +91 -0
inference/svs/opencpop/cpop_pinyin2ph.txt +418 -0
inference/svs/opencpop/map.py +8 -0
modules/__init__.py +0 -0
modules/commons/common_layers.py +668 -0
modules/commons/espnet_positional_embedding.py +113 -0
modules/commons/ssim.py +391 -0
modules/diffsinger_midi/fs2.py +118 -0
modules/fastspeech/fs2.py +255 -0
modules/fastspeech/pe.py +149 -0
modules/fastspeech/tts_modules.py +357 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2021 Jinglin Liu
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,83 @@
----
-title: DiffSinger
-emoji: 👁
-colorFrom: green
-colorTo: gray
-sdk: gradio
-sdk_version: 3.1.1
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
+[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
+[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
+ | [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/NATSpeech/DiffSpeech)
+This repository is the official PyTorch implementation of our AAAI-2022 [paper](https://arxiv.org/abs/2105.02446), in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).
+<table style="width:100%">
+  <tr>
+    <th>DiffSinger/DiffSpeech at training</th>
+    <th>DiffSinger/DiffSpeech at inference</th>
+  </tr>
+  <tr>
+    <td><img src="resources/model_a.png" alt="Training" height="300"></td>
+    <td><img src="resources/model_b.png" alt="Inference" height="300"></td>
+  </tr>
+</table>
+:tada: :tada: :tada: **Updates**:
+ - Mar.2, 2022: [MIDI-new-version](docs/README-SVS-opencpop-e2e.md): A substantial improvement :sparkles:
+ - Mar.1, 2022: [NeuralSVB](https://github.com/MoonInTheRiver/NeuralSVB), for singing voice beautifying, has been released  :sparkles:  :sparkles:  :sparkles: .
+ - Feb.13, 2022: [NATSpeech](https://github.com/NATSpeech/NATSpeech), the improved code framework, which contains the implementations of DiffSpeech and our NeurIPS-2021 work [PortaSpeech](https://openreview.net/forum?id=xmJsuh8xlq) has been released :sparkles: :sparkles: :sparkles:.
+ - Jan.29, 2022: support [MIDI-old-version](docs/README-SVS-opencpop-cascade.md) SVS. :construction: :pick: :hammer_and_wrench:
+ - Jan.13, 2022: support SVS, release PopCS dataset.
+ - Dec.19, 2021: support TTS. [HuggingFace🤗 Demo](https://huggingface.co/spaces/NATSpeech/DiffSpeech)
+:rocket: **News**:
+ - Feb.24, 2022: Our new work, NeuralSVB was accepted by ACL-2022 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2202.13277). [Demo Page](https://neuralsvb.github.io).
+ - Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
+ - Sep.29, 2021: Our recent work `PortaSpeech: Portable and High-Quality Generative Text-to-Speech` was accepted by NeurIPS-2021 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2109.15166) .
+ - May.06, 2021: We submitted DiffSinger to Arxiv [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446).
+## Environments
+```sh
+conda create -n your_env_name python=3.8
+source activate your_env_name
+pip install -r requirements_2080.txt   (GPU 2080Ti, CUDA 10.2)
+or pip install -r requirements_3090.txt   (GPU 3090, CUDA 11.4)
+```
+## Documents
+- [Run DiffSpeech (TTS version)](docs/README-TTS.md).
+- [Run DiffSinger (SVS version)](docs/README-SVS.md).
+## Tensorboard
+```sh
+tensorboard --logdir_spec exp_name
+```
+<table style="width:100%">
+  <tr>
+    <td><img src="resources/tfb.png" alt="Tensorboard" height="250"></td>
+  </tr>
+</table>
+## Audio Demos
+Old audio samples can be found in our [demo page](https://diffsinger.github.io/). Audio samples generated by this repository are listed here:
+### TTS audio samples
+Speech samples (test set of LJSpeech) can be found in [resources/demos_1213](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/demos_1213).
+### SVS audio samples
+Singing samples (test set of PopCS) can be found in [resources/demos_0112](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/demos_0112).
+## Citation
+    @article{liu2021diffsinger,
+      title={Diffsinger: Singing voice synthesis via shallow diffusion mechanism},
+      author={Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Liu, Peng and Zhao, Zhou},
+      journal={arXiv preprint arXiv:2105.02446},
+      volume={2},
+      year={2021}}
+## Acknowledgements
+Our codes are based on the following repos:
+* [denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch)
+* [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
+* [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
+* [HifiGAN](https://github.com/jik876/hifi-gan)
+* [espnet](https://github.com/espnet/espnet)
+* [DiffWave](https://github.com/lmnt-com/diffwave)
+Also thanks [Keon Lee](https://github.com/keonlee9420/DiffSinger) for fast implementation of our work.

checkpoints/.gitkeep ADDED Viewed

File without changes

configs/config_base.yaml ADDED Viewed

	@@ -0,0 +1,42 @@

+# task
+binary_data_dir: ''
+work_dir: '' # experiment directory.
+infer: false # infer
+seed: 1234
+debug: false
+save_codes:
+  - configs
+  - modules
+  - tasks
+  - utils
+  - usr
+#############
+# dataset
+#############
+ds_workers: 1
+test_num: 100
+valid_num: 100
+endless_ds: false
+sort_by_len: true
+#########
+# train and eval
+#########
+load_ckpt: ''
+save_ckpt: true
+save_best: false
+num_ckpt_keep: 3
+clip_grad_norm: 0
+accumulate_grad_batches: 1
+log_interval: 100
+num_sanity_val_steps: 5  # steps of validation at the beginning
+check_val_every_n_epoch: 10
+val_check_interval: 2000
+max_epochs: 1000
+max_updates: 160000
+max_tokens: 31250
+max_sentences: 100000
+max_eval_tokens: -1
+max_eval_sentences: -1
+test_input_dir: ''

configs/singing/base.yaml ADDED Viewed

	@@ -0,0 +1,42 @@

+base_config:
+  - configs/tts/base.yaml
+  - configs/tts/base_zh.yaml
+datasets: []
+test_prefixes: []
+test_num: 0
+valid_num: 0
+pre_align_cls: data_gen.singing.pre_align.SingingPreAlign
+binarizer_cls: data_gen.singing.binarize.SingingBinarizer
+pre_align_args:
+  use_tone: false # for ZH
+  forced_align: mfa
+  use_sox: true
+hop_size: 128            # Hop size.
+fft_size: 512           # FFT size.
+win_size: 512           # FFT size.
+max_frames: 8000
+fmin: 50                 # Minimum freq in mel basis calculation.
+fmax: 11025               # Maximum frequency in mel basis calculation.
+pitch_type: frame
+hidden_size: 256
+mel_loss: "ssim:0.5|l1:0.5"
+lambda_f0: 0.0
+lambda_uv: 0.0
+lambda_energy: 0.0
+lambda_ph_dur: 0.0
+lambda_sent_dur: 0.0
+lambda_word_dur: 0.0
+predictor_grad: 0.0
+use_spk_embed: true
+use_spk_id: false
+max_tokens: 20000
+max_updates: 400000
+num_spk: 100
+save_f0: true
+use_gt_dur: true
+use_gt_f0: true

configs/singing/fs2.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+base_config:
+  - configs/tts/fs2.yaml
+  - configs/singing/base.yaml

configs/tts/base.yaml ADDED Viewed

	@@ -0,0 +1,95 @@

+# task
+base_config: configs/config_base.yaml
+task_cls: ''
+#############
+# dataset
+#############
+raw_data_dir: ''
+processed_data_dir: ''
+binary_data_dir: ''
+dict_dir: ''
+pre_align_cls: ''
+binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
+pre_align_args:
+  use_tone: true # for ZH
+  forced_align: mfa
+  use_sox: false
+  txt_processor: en
+  allow_no_txt: false
+  denoise: false
+binarization_args:
+  shuffle: false
+  with_txt: true
+  with_wav: false
+  with_align: true
+  with_spk_embed: true
+  with_f0: true
+  with_f0cwt: true
+loud_norm: false
+endless_ds: true
+reset_phone_dict: true
+test_num: 100
+valid_num: 100
+max_frames: 1550
+max_input_tokens: 1550
+audio_num_mel_bins: 80
+audio_sample_rate: 22050
+hop_size: 256  # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
+win_size: 1024  # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
+fmin: 80  # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
+fmax: 7600  # To be increased/reduced depending on data.
+fft_size: 1024  # Extra window size is filled with 0 paddings to match this parameter
+min_level_db: -100
+num_spk: 1
+mel_vmin: -6
+mel_vmax: 1.5
+ds_workers: 4
+#########
+# model
+#########
+dropout: 0.1
+enc_layers: 4
+dec_layers: 4
+hidden_size: 384
+num_heads: 2
+prenet_dropout: 0.5
+prenet_hidden_size: 256
+stop_token_weight: 5.0
+enc_ffn_kernel_size: 9
+dec_ffn_kernel_size: 9
+ffn_act: gelu
+ffn_padding: 'SAME'
+###########
+# optimization
+###########
+lr: 2.0
+warmup_updates: 8000
+optimizer_adam_beta1: 0.9
+optimizer_adam_beta2: 0.98
+weight_decay: 0
+clip_grad_norm: 1
+###########
+# train and eval
+###########
+max_tokens: 30000
+max_sentences: 100000
+max_eval_sentences: 1
+max_eval_tokens: 60000
+train_set_name: 'train'
+valid_set_name: 'valid'
+test_set_name: 'test'
+vocoder: pwg
+vocoder_ckpt: ''
+profile_infer: false
+out_wav_norm: false
+save_gt: false
+save_f0: false
+gen_dir_name: ''
+use_denoise: false

configs/tts/base_zh.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+pre_align_args:
+  txt_processor: zh_g2pM
+binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer

configs/tts/fs2.yaml ADDED Viewed

	@@ -0,0 +1,80 @@

+base_config: configs/tts/base.yaml
+task_cls: tasks.tts.fs2.FastSpeech2Task
+# model
+hidden_size: 256
+dropout: 0.1
+encoder_type: fft # fft|tacotron|tacotron2|conformer
+encoder_K: 8 # for tacotron encoder
+decoder_type: fft # fft|rnn|conv|conformer
+use_pos_embed: true
+# duration
+predictor_hidden: -1
+predictor_kernel: 5
+predictor_layers: 2
+dur_predictor_kernel: 3
+dur_predictor_layers: 2
+predictor_dropout: 0.5
+# pitch and energy
+use_pitch_embed: true
+pitch_type: ph # frame|ph|cwt
+use_uv: true
+cwt_hidden_size: 128
+cwt_layers: 2
+cwt_loss: l1
+cwt_add_f0_loss: false
+cwt_std_scale: 0.8
+pitch_ar: false
+#pitch_embed_type: 0q
+pitch_loss: 'l1' # l1|l2|ssim
+pitch_norm: log
+use_energy_embed: false
+# reference encoder and speaker embedding
+use_spk_id: false
+use_split_spk_id: false
+use_spk_embed: false
+use_var_enc: false
+lambda_commit: 0.25
+ref_norm_layer: bn
+pitch_enc_hidden_stride_kernel:
+  - 0,2,5 # conv_hidden_size, conv_stride, conv_kernel_size. conv_hidden_size=0: use hidden_size
+  - 0,2,5
+  - 0,2,5
+dur_enc_hidden_stride_kernel:
+  - 0,2,3 # conv_hidden_size, conv_stride, conv_kernel_size. conv_hidden_size=0: use hidden_size
+  - 0,2,3
+  - 0,1,3
+# mel
+mel_loss: l1:0.5|ssim:0.5 # l1|l2|gdl|ssim or l1:0.5|ssim:0.5
+# loss lambda
+lambda_f0: 1.0
+lambda_uv: 1.0
+lambda_energy: 0.1
+lambda_ph_dur: 1.0
+lambda_sent_dur: 1.0
+lambda_word_dur: 1.0
+predictor_grad: 0.1
+# train and eval
+pretrain_fs_ckpt: ''
+warmup_updates: 2000
+max_tokens: 32000
+max_sentences: 100000
+max_eval_sentences: 1
+max_updates: 120000
+num_valid_plots: 5
+num_test_samples: 0
+test_ids: []
+use_gt_dur: false
+use_gt_f0: false
+# exp
+dur_loss: mse # huber|mol
+norm_type: gn

configs/tts/hifigan.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+base_config: configs/tts/pwg.yaml
+task_cls: tasks.vocoder.hifigan.HifiGanTask
+resblock: "1"
+adam_b1: 0.8
+adam_b2: 0.99
+upsample_rates: [ 8,8,2,2 ]
+upsample_kernel_sizes: [ 16,16,4,4 ]
+upsample_initial_channel: 128
+resblock_kernel_sizes: [ 3,7,11 ]
+resblock_dilation_sizes: [ [ 1,3,5 ], [ 1,3,5 ], [ 1,3,5 ] ]
+lambda_mel: 45.0
+max_samples: 8192
+max_sentences: 16
+generator_params:
+  lr: 0.0002            # Generator's learning rate.
+  aux_context_window: 0 # Context window size for auxiliary feature.
+discriminator_optimizer_params:
+  lr: 0.0002            # Discriminator's learning rate.

configs/tts/lj/base_mel2wav.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+raw_data_dir: 'data/raw/LJSpeech-1.1'
+processed_data_dir: 'data/processed/ljspeech'
+binary_data_dir: 'data/binary/ljspeech_wav'

configs/tts/lj/base_text2mel.yaml ADDED Viewed

	@@ -0,0 +1,13 @@

+raw_data_dir: 'data/raw/LJSpeech-1.1'
+processed_data_dir: 'data/processed/ljspeech'
+binary_data_dir: 'data/binary/ljspeech'
+pre_align_cls: data_gen.tts.lj.pre_align.LJPreAlign
+pitch_type: cwt
+mel_loss: l1
+num_test_samples: 20
+test_ids: [ 68, 70, 74, 87, 110, 172, 190, 215, 231, 294,
+            316, 324, 402, 422, 485, 500, 505, 508, 509, 519 ]
+use_energy_embed: false
+test_num: 523
+valid_num: 348

configs/tts/lj/fs2.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+base_config:
+  - configs/tts/fs2.yaml
+  - configs/tts/lj/base_text2mel.yaml

configs/tts/lj/hifigan.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+base_config:
+  - configs/tts/hifigan.yaml
+  - configs/tts/lj/base_mel2wav.yaml

configs/tts/lj/pwg.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+base_config:
+  - configs/tts/pwg.yaml
+  - configs/tts/lj/base_mel2wav.yaml

configs/tts/pwg.yaml ADDED Viewed

	@@ -0,0 +1,110 @@

+base_config: configs/tts/base.yaml
+task_cls: tasks.vocoder.pwg.PwgTask
+binarization_args:
+  with_wav: true
+  with_spk_embed: false
+  with_align: false
+test_input_dir: ''
+###########
+# train and eval
+###########
+max_samples: 25600
+max_sentences: 5
+max_eval_sentences: 1
+max_updates: 1000000
+val_check_interval: 2000
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050     # Sampling rate.
+fft_size: 1024           # FFT size.
+hop_size: 256            # Hop size.
+win_length: null         # Window length.
+# If set to null, it will be the same as fft_size.
+window: "hann"           # Window function.
+num_mels: 80             # Number of mel basis.
+fmin: 80                 # Minimum freq in mel basis calculation.
+fmax: 7600               # Maximum frequency in mel basis calculation.
+format: "hdf5"           # Feature file format. "npy" or "hdf5" is supported.
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+generator_params:
+  in_channels: 1        # Number of input channels.
+  out_channels: 1       # Number of output channels.
+  kernel_size: 3        # Kernel size of dilated convolution.
+  layers: 30            # Number of residual block layers.
+  stacks: 3             # Number of stacks i.e., dilation cycles.
+  residual_channels: 64 # Number of channels in residual conv.
+  gate_channels: 128    # Number of channels in gated conv.
+  skip_channels: 64     # Number of channels in skip conv.
+  aux_channels: 80      # Number of channels for auxiliary feature conv.
+  # Must be the same as num_mels.
+  aux_context_window: 2 # Context window size for auxiliary feature.
+  # If set to 2, previous 2 and future 2 frames will be considered.
+  dropout: 0.0          # Dropout rate. 0.0 means no dropout applied.
+  use_weight_norm: true # Whether to use weight norm.
+  # If set to true, it will be applied to all of the conv layers.
+  upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
+  upsample_params:                      # Upsampling network parameters.
+    upsample_scales: [4, 4, 4, 4]     # Upsampling scales. Prodcut of these must be the same as hop size.
+  use_pitch_embed: false
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+discriminator_params:
+  in_channels: 1        # Number of input channels.
+  out_channels: 1       # Number of output channels.
+  kernel_size: 3        # Number of output channels.
+  layers: 10            # Number of conv layers.
+  conv_channels: 64     # Number of chnn layers.
+  bias: true            # Whether to use bias parameter in conv.
+  use_weight_norm: true # Whether to use weight norm.
+  # If set to true, it will be applied to all of the conv layers.
+  nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+  nonlinear_activation_params:      # Nonlinear function parameters
+    negative_slope: 0.2           # Alpha in LeakyReLU.
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+  fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+  hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
+  win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+  window: "hann_window"         # Window function for STFT-based loss
+use_mel_loss: false
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_adv: 4.0  # Loss balancing coefficient.
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+  lr: 0.0001             # Generator's learning rate.
+  eps: 1.0e-6            # Generator's epsilon.
+  weight_decay: 0.0      # Generator's weight decay coefficient.
+generator_scheduler_params:
+  step_size: 200000      # Generator's scheduler step size.
+  gamma: 0.5             # Generator's scheduler gamma.
+  # At each step size, lr will be multiplied by this parameter.
+generator_grad_norm: 10    # Generator's gradient norm.
+discriminator_optimizer_params:
+  lr: 0.00005            # Discriminator's learning rate.
+  eps: 1.0e-6            # Discriminator's epsilon.
+  weight_decay: 0.0      # Discriminator's weight decay coefficient.
+discriminator_scheduler_params:
+  step_size: 200000      # Discriminator's scheduler step size.
+  gamma: 0.5             # Discriminator's scheduler gamma.
+  # At each step size, lr will be multiplied by this parameter.
+discriminator_grad_norm: 1 # Discriminator's gradient norm.
+disc_start_steps: 40000 # Number of steps to start to train discriminator.

data/processed/ljspeech/dict.txt ADDED Viewed

	@@ -0,0 +1,77 @@

+! !
+, ,
+. .
+; ;
+<BOS> <BOS>
+<EOS> <EOS>
+? ?
+AA0 AA0
+AA1 AA1
+AA2 AA2
+AE0 AE0
+AE1 AE1
+AE2 AE2
+AH0 AH0
+AH1 AH1
+AH2 AH2
+AO0 AO0
+AO1 AO1
+AO2 AO2
+AW0 AW0
+AW1 AW1
+AW2 AW2
+AY0 AY0
+AY1 AY1
+AY2 AY2
+B B
+CH CH
+D D
+DH DH
+EH0 EH0
+EH1 EH1
+EH2 EH2
+ER0 ER0
+ER1 ER1
+ER2 ER2
+EY0 EY0
+EY1 EY1
+EY2 EY2
+F F
+G G
+HH HH
+IH0 IH0
+IH1 IH1
+IH2 IH2
+IY0 IY0
+IY1 IY1
+IY2 IY2
+JH JH
+K K
+L L
+M M
+N N
+NG NG
+OW0 OW0
+OW1 OW1
+OW2 OW2
+OY0 OY0
+OY1 OY1
+OY2 OY2
+P P
+R R
+S S
+SH SH
+T T
+TH TH
+UH0 UH0
+UH1 UH1
+UH2 UH2
+UW0 UW0
+UW1 UW1
+UW2 UW2
+V V
+W W
+Y Y
+Z Z
+ZH ZH
+| |

data/processed/ljspeech/metadata_phone.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

data/processed/ljspeech/mfa_dict.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

data/processed/ljspeech/phone_set.json ADDED Viewed

	@@ -0,0 +1 @@

+ ["!", ",", ".", ";", "<BOS>", "<EOS>", "?", "AA0", "AA1", "AA2", "AE0", "AE1", "AE2", "AH0", "AH1", "AH2", "AO0", "AO1", "AO2", "AW0", "AW1", "AW2", "AY0", "AY1", "AY2", "B", "CH", "D", "DH", "EH0", "EH1", "EH2", "ER0", "ER1", "ER2", "EY0", "EY1", "EY2", "F", "G", "HH", "IH0", "IH1", "IH2", "IY0", "IY1", "IY2", "JH", "K", "L", "M", "N", "NG", "OW0", "OW1", "OW2", "OY0", "OY1", "OY2", "P", "R", "S", "SH", "T", "TH", "UH0", "UH1", "UH2", "UW0", "UW1", "UW2", "V", "W", "Y", "Z", "ZH", "|"]

data_gen/singing/binarize.py ADDED Viewed

	@@ -0,0 +1,398 @@

+import os
+import random
+from copy import deepcopy
+import pandas as pd
+import logging
+from tqdm import tqdm
+import json
+import glob
+import re
+from resemblyzer import VoiceEncoder
+import traceback
+import numpy as np
+import pretty_midi
+import librosa
+from scipy.interpolate import interp1d
+import torch
+from textgrid import TextGrid
+from utils.hparams import hparams
+from data_gen.tts.data_gen_utils import build_phone_encoder, get_pitch
+from utils.pitch_utils import f0_to_coarse
+from data_gen.tts.base_binarizer import BaseBinarizer, BinarizationError
+from data_gen.tts.binarizer_zh import ZhBinarizer
+from data_gen.tts.txt_processors.zh_g2pM import ALL_YUNMU
+from vocoders.base_vocoder import VOCODERS
+class SingingBinarizer(BaseBinarizer):
+    def __init__(self, processed_data_dir=None):
+        if processed_data_dir is None:
+            processed_data_dir = hparams['processed_data_dir']
+        self.processed_data_dirs = processed_data_dir.split(",")
+        self.binarization_args = hparams['binarization_args']
+        self.pre_align_args = hparams['pre_align_args']
+        self.item2txt = {}
+        self.item2ph = {}
+        self.item2wavfn = {}
+        self.item2f0fn = {}
+        self.item2tgfn = {}
+        self.item2spk = {}
+    def split_train_test_set(self, item_names):
+        item_names = deepcopy(item_names)
+        test_item_names = [x for x in item_names if any([ts in x for ts in hparams['test_prefixes']])]
+        train_item_names = [x for x in item_names if x not in set(test_item_names)]
+        logging.info("train {}".format(len(train_item_names)))
+        logging.info("test {}".format(len(test_item_names)))
+        return train_item_names, test_item_names
+    def load_meta_data(self):
+        for ds_id, processed_data_dir in enumerate(self.processed_data_dirs):
+            wav_suffix = '_wf0.wav'
+            txt_suffix = '.txt'
+            ph_suffix = '_ph.txt'
+            tg_suffix = '.TextGrid'
+            all_wav_pieces = glob.glob(f'{processed_data_dir}/*/*{wav_suffix}')
+            for piece_path in all_wav_pieces:
+                item_name = raw_item_name = piece_path[len(processed_data_dir)+1:].replace('/', '-')[:-len(wav_suffix)]
+                if len(self.processed_data_dirs) > 1:
+                    item_name = f'ds{ds_id}_{item_name}'
+                self.item2txt[item_name] = open(f'{piece_path.replace(wav_suffix, txt_suffix)}').readline()
+                self.item2ph[item_name] = open(f'{piece_path.replace(wav_suffix, ph_suffix)}').readline()
+                self.item2wavfn[item_name] = piece_path
+                self.item2spk[item_name] = re.split('-|#', piece_path.split('/')[-2])[0]
+                if len(self.processed_data_dirs) > 1:
+                    self.item2spk[item_name] = f"ds{ds_id}_{self.item2spk[item_name]}"
+                self.item2tgfn[item_name] = piece_path.replace(wav_suffix, tg_suffix)
+        print('spkers: ', set(self.item2spk.values()))
+        self.item_names = sorted(list(self.item2txt.keys()))
+        if self.binarization_args['shuffle']:
+            random.seed(1234)
+            random.shuffle(self.item_names)
+        self._train_item_names, self._test_item_names = self.split_train_test_set(self.item_names)
+    @property
+    def train_item_names(self):
+        return self._train_item_names
+    @property
+    def valid_item_names(self):
+        return self._test_item_names
+    @property
+    def test_item_names(self):
+        return self._test_item_names
+    def process(self):
+        self.load_meta_data()
+        os.makedirs(hparams['binary_data_dir'], exist_ok=True)
+        self.spk_map = self.build_spk_map()
+        print("| spk_map: ", self.spk_map)
+        spk_map_fn = f"{hparams['binary_data_dir']}/spk_map.json"
+        json.dump(self.spk_map, open(spk_map_fn, 'w'))
+        self.phone_encoder = self._phone_encoder()
+        self.process_data('valid')
+        self.process_data('test')
+        self.process_data('train')
+    def _phone_encoder(self):
+        ph_set_fn = f"{hparams['binary_data_dir']}/phone_set.json"
+        ph_set = []
+        if hparams['reset_phone_dict'] or not os.path.exists(ph_set_fn):
+            for ph_sent in self.item2ph.values():
+                ph_set += ph_sent.split(' ')
+            ph_set = sorted(set(ph_set))
+            json.dump(ph_set, open(ph_set_fn, 'w'))
+            print("| Build phone set: ", ph_set)
+        else:
+            ph_set = json.load(open(ph_set_fn, 'r'))
+            print("| Load phone set: ", ph_set)
+        return build_phone_encoder(hparams['binary_data_dir'])
+    # @staticmethod
+    # def get_pitch(wav_fn, spec, res):
+    #     wav_suffix = '_wf0.wav'
+    #     f0_suffix = '_f0.npy'
+    #     f0fn = wav_fn.replace(wav_suffix, f0_suffix)
+    #     pitch_info = np.load(f0fn)
+    #     f0 = [x[1] for x in pitch_info]
+    #     spec_x_coor = np.arange(0, 1, 1 / len(spec))[:len(spec)]
+    #     f0_x_coor = np.arange(0, 1, 1 / len(f0))[:len(f0)]
+    #     f0 = interp1d(f0_x_coor, f0, 'nearest', fill_value='extrapolate')(spec_x_coor)[:len(spec)]
+    #     # f0_x_coor = np.arange(0, 1, 1 / len(f0))
+    #     # f0_x_coor[-1] = 1
+    #     # f0 = interp1d(f0_x_coor, f0, 'nearest')(spec_x_coor)[:len(spec)]
+    #     if sum(f0) == 0:
+    #         raise BinarizationError("Empty f0")
+    #     assert len(f0) == len(spec), (len(f0), len(spec))
+    #     pitch_coarse = f0_to_coarse(f0)
+    #
+    #     # vis f0
+    #     # import matplotlib.pyplot as plt
+    #     # from textgrid import TextGrid
+    #     # tg_fn = wav_fn.replace(wav_suffix, '.TextGrid')
+    #     # fig = plt.figure(figsize=(12, 6))
+    #     # plt.pcolor(spec.T, vmin=-5, vmax=0)
+    #     # ax = plt.gca()
+    #     # ax2 = ax.twinx()
+    #     # ax2.plot(f0, color='red')
+    #     # ax2.set_ylim(0, 800)
+    #     # itvs = TextGrid.fromFile(tg_fn)[0]
+    #     # for itv in itvs:
+    #     #     x = itv.maxTime * hparams['audio_sample_rate'] / hparams['hop_size']
+    #     #     plt.vlines(x=x, ymin=0, ymax=80, color='black')
+    #     #     plt.text(x=x, y=20, s=itv.mark, color='black')
+    #     # plt.savefig('tmp/20211229_singing_plots_test.png')
+    #
+    #     res['f0'] = f0
+    #     res['pitch'] = pitch_coarse
+    @classmethod
+    def process_item(cls, item_name, ph, txt, tg_fn, wav_fn, spk_id, encoder, binarization_args):
+        if hparams['vocoder'] in VOCODERS:
+            wav, mel = VOCODERS[hparams['vocoder']].wav2spec(wav_fn)
+        else:
+            wav, mel = VOCODERS[hparams['vocoder'].split('.')[-1]].wav2spec(wav_fn)
+        res = {
+            'item_name': item_name, 'txt': txt, 'ph': ph, 'mel': mel, 'wav': wav, 'wav_fn': wav_fn,
+            'sec': len(wav) / hparams['audio_sample_rate'], 'len': mel.shape[0], 'spk_id': spk_id
+        }
+        try:
+            if binarization_args['with_f0']:
+                # cls.get_pitch(wav_fn, mel, res)
+                cls.get_pitch(wav, mel, res)
+            if binarization_args['with_txt']:
+                try:
+                    # print(ph)
+                    phone_encoded = res['phone'] = encoder.encode(ph)
+                except:
+                    traceback.print_exc()
+                    raise BinarizationError(f"Empty phoneme")
+                if binarization_args['with_align']:
+                    cls.get_align(tg_fn, ph, mel, phone_encoded, res)
+        except BinarizationError as e:
+            print(f"| Skip item ({e}). item_name: {item_name}, wav_fn: {wav_fn}")
+            return None
+        return res
+class MidiSingingBinarizer(SingingBinarizer):
+    item2midi = {}
+    item2midi_dur = {}
+    item2is_slur = {}
+    item2ph_durs = {}
+    item2wdb = {}
+    def load_meta_data(self):
+        for ds_id, processed_data_dir in enumerate(self.processed_data_dirs):
+            meta_midi = json.load(open(os.path.join(processed_data_dir, 'meta.json')))   # [list of dict]
+            for song_item in meta_midi:
+                item_name = raw_item_name = song_item['item_name']
+                if len(self.processed_data_dirs) > 1:
+                    item_name = f'ds{ds_id}_{item_name}'
+                self.item2wavfn[item_name] = song_item['wav_fn']
+                self.item2txt[item_name] = song_item['txt']
+                self.item2ph[item_name] = ' '.join(song_item['phs'])
+                self.item2wdb[item_name] = [1 if x in ALL_YUNMU + ['AP', 'SP', '<SIL>'] else 0 for x in song_item['phs']]
+                self.item2ph_durs[item_name] = song_item['ph_dur']
+                self.item2midi[item_name] = song_item['notes']
+                self.item2midi_dur[item_name] = song_item['notes_dur']
+                self.item2is_slur[item_name] = song_item['is_slur']
+                self.item2spk[item_name] = 'pop-cs'
+                if len(self.processed_data_dirs) > 1:
+                    self.item2spk[item_name] = f"ds{ds_id}_{self.item2spk[item_name]}"
+        print('spkers: ', set(self.item2spk.values()))
+        self.item_names = sorted(list(self.item2txt.keys()))
+        if self.binarization_args['shuffle']:
+            random.seed(1234)
+            random.shuffle(self.item_names)
+        self._train_item_names, self._test_item_names = self.split_train_test_set(self.item_names)
+    @staticmethod
+    def get_pitch(wav_fn, wav, spec, ph, res):
+        wav_suffix = '.wav'
+        # midi_suffix = '.mid'
+        wav_dir = 'wavs'
+        f0_dir = 'f0'
+        item_name = '/'.join(os.path.splitext(wav_fn)[0].split('/')[-2:]).replace('_wf0', '')
+        res['pitch_midi'] = np.asarray(MidiSingingBinarizer.item2midi[item_name])
+        res['midi_dur'] = np.asarray(MidiSingingBinarizer.item2midi_dur[item_name])
+        res['is_slur'] = np.asarray(MidiSingingBinarizer.item2is_slur[item_name])
+        res['word_boundary'] = np.asarray(MidiSingingBinarizer.item2wdb[item_name])
+        assert res['pitch_midi'].shape == res['midi_dur'].shape == res['is_slur'].shape, (
+        res['pitch_midi'].shape, res['midi_dur'].shape, res['is_slur'].shape)
+        # gt f0.
+        gt_f0, gt_pitch_coarse = get_pitch(wav, spec, hparams)
+        if sum(gt_f0) == 0:
+            raise BinarizationError("Empty **gt** f0")
+        res['f0'] = gt_f0
+        res['pitch'] = gt_pitch_coarse
+    @staticmethod
+    def get_align(ph_durs, mel, phone_encoded, res, hop_size=hparams['hop_size'], audio_sample_rate=hparams['audio_sample_rate']):
+        mel2ph = np.zeros([mel.shape[0]], int)
+        startTime = 0
+        for i_ph in range(len(ph_durs)):
+            start_frame = int(startTime * audio_sample_rate / hop_size + 0.5)
+            end_frame = int((startTime + ph_durs[i_ph]) * audio_sample_rate / hop_size + 0.5)
+            mel2ph[start_frame:end_frame] = i_ph + 1
+            startTime = startTime + ph_durs[i_ph]
+        # print('ph durs: ', ph_durs)
+        # print('mel2ph: ', mel2ph, len(mel2ph))
+        res['mel2ph'] = mel2ph
+        # res['dur'] = None
+    @classmethod
+    def process_item(cls, item_name, ph, txt, tg_fn, wav_fn, spk_id, encoder, binarization_args):
+        if hparams['vocoder'] in VOCODERS:
+            wav, mel = VOCODERS[hparams['vocoder']].wav2spec(wav_fn)
+        else:
+            wav, mel = VOCODERS[hparams['vocoder'].split('.')[-1]].wav2spec(wav_fn)
+        res = {
+            'item_name': item_name, 'txt': txt, 'ph': ph, 'mel': mel, 'wav': wav, 'wav_fn': wav_fn,
+            'sec': len(wav) / hparams['audio_sample_rate'], 'len': mel.shape[0], 'spk_id': spk_id
+        }
+        try:
+            if binarization_args['with_f0']:
+                cls.get_pitch(wav_fn, wav, mel, ph, res)
+            if binarization_args['with_txt']:
+                try:
+                    phone_encoded = res['phone'] = encoder.encode(ph)
+                except:
+                    traceback.print_exc()
+                    raise BinarizationError(f"Empty phoneme")
+                if binarization_args['with_align']:
+                    cls.get_align(MidiSingingBinarizer.item2ph_durs[item_name], mel, phone_encoded, res)
+        except BinarizationError as e:
+            print(f"| Skip item ({e}). item_name: {item_name}, wav_fn: {wav_fn}")
+            return None
+        return res
+class ZhSingingBinarizer(ZhBinarizer, SingingBinarizer):
+    pass
+class OpencpopBinarizer(MidiSingingBinarizer):
+    item2midi = {}
+    item2midi_dur = {}
+    item2is_slur = {}
+    item2ph_durs = {}
+    item2wdb = {}
+    def split_train_test_set(self, item_names):
+        item_names = deepcopy(item_names)
+        test_item_names = [x for x in item_names if any([x.startswith(ts) for ts in hparams['test_prefixes']])]
+        train_item_names = [x for x in item_names if x not in set(test_item_names)]
+        logging.info("train {}".format(len(train_item_names)))
+        logging.info("test {}".format(len(test_item_names)))
+        return train_item_names, test_item_names
+    def load_meta_data(self):
+        raw_data_dir = hparams['raw_data_dir']
+        # meta_midi = json.load(open(os.path.join(raw_data_dir, 'meta.json')))   # [list of dict]
+        utterance_labels = open(os.path.join(raw_data_dir, 'transcriptions.txt')).readlines()
+        for utterance_label in utterance_labels:
+            song_info = utterance_label.split('|')
+            item_name = raw_item_name = song_info[0]
+            self.item2wavfn[item_name] = f'{raw_data_dir}/wavs/{item_name}.wav'
+            self.item2txt[item_name] = song_info[1]
+            self.item2ph[item_name] = song_info[2]
+            # self.item2wdb[item_name] = list(np.nonzero([1 if x in ALL_YUNMU + ['AP', 'SP'] else 0 for x in song_info[2].split()])[0])
+            self.item2wdb[item_name] = [1 if x in ALL_YUNMU + ['AP', 'SP'] else 0 for x in song_info[2].split()]
+            self.item2ph_durs[item_name] = [float(x) for x in song_info[5].split(" ")]
+            self.item2midi[item_name] = [librosa.note_to_midi(x.split("/")[0]) if x != 'rest' else 0
+                                   for x in song_info[3].split(" ")]
+            self.item2midi_dur[item_name] = [float(x) for x in song_info[4].split(" ")]
+            self.item2is_slur[item_name] = [int(x) for x in song_info[6].split(" ")]
+            self.item2spk[item_name] = 'opencpop'
+        print('spkers: ', set(self.item2spk.values()))
+        self.item_names = sorted(list(self.item2txt.keys()))
+        if self.binarization_args['shuffle']:
+            random.seed(1234)
+            random.shuffle(self.item_names)
+        self._train_item_names, self._test_item_names = self.split_train_test_set(self.item_names)
+    @staticmethod
+    def get_pitch(wav_fn, wav, spec, ph, res):
+        wav_suffix = '.wav'
+        # midi_suffix = '.mid'
+        wav_dir = 'wavs'
+        f0_dir = 'text_f0_align'
+        item_name = os.path.splitext(os.path.basename(wav_fn))[0]
+        res['pitch_midi'] = np.asarray(OpencpopBinarizer.item2midi[item_name])
+        res['midi_dur'] = np.asarray(OpencpopBinarizer.item2midi_dur[item_name])
+        res['is_slur'] = np.asarray(OpencpopBinarizer.item2is_slur[item_name])
+        res['word_boundary'] = np.asarray(OpencpopBinarizer.item2wdb[item_name])
+        assert res['pitch_midi'].shape == res['midi_dur'].shape == res['is_slur'].shape, (res['pitch_midi'].shape, res['midi_dur'].shape, res['is_slur'].shape)
+        # gt f0.
+        # f0 = None
+        # f0_suffix = '_f0.npy'
+        # f0fn = wav_fn.replace(wav_suffix, f0_suffix).replace(wav_dir, f0_dir)
+        # pitch_info = np.load(f0fn)
+        # f0 = [x[1] for x in pitch_info]
+        # spec_x_coor = np.arange(0, 1, 1 / len(spec))[:len(spec)]
+        #
+        # f0_x_coor = np.arange(0, 1, 1 / len(f0))[:len(f0)]
+        # f0 = interp1d(f0_x_coor, f0, 'nearest', fill_value='extrapolate')(spec_x_coor)[:len(spec)]
+        # if sum(f0) == 0:
+        #     raise BinarizationError("Empty **gt** f0")
+        #
+        # pitch_coarse = f0_to_coarse(f0)
+        # res['f0'] = f0
+        # res['pitch'] = pitch_coarse
+        # gt f0.
+        gt_f0, gt_pitch_coarse = get_pitch(wav, spec, hparams)
+        if sum(gt_f0) == 0:
+            raise BinarizationError("Empty **gt** f0")
+        res['f0'] = gt_f0
+        res['pitch'] = gt_pitch_coarse
+    @classmethod
+    def process_item(cls, item_name, ph, txt, tg_fn, wav_fn, spk_id, encoder, binarization_args):
+        if hparams['vocoder'] in VOCODERS:
+            wav, mel = VOCODERS[hparams['vocoder']].wav2spec(wav_fn)
+        else:
+            wav, mel = VOCODERS[hparams['vocoder'].split('.')[-1]].wav2spec(wav_fn)
+        res = {
+            'item_name': item_name, 'txt': txt, 'ph': ph, 'mel': mel, 'wav': wav, 'wav_fn': wav_fn,
+            'sec': len(wav) / hparams['audio_sample_rate'], 'len': mel.shape[0], 'spk_id': spk_id
+        }
+        try:
+            if binarization_args['with_f0']:
+                cls.get_pitch(wav_fn, wav, mel, ph, res)
+            if binarization_args['with_txt']:
+                try:
+                    phone_encoded = res['phone'] = encoder.encode(ph)
+                except:
+                    traceback.print_exc()
+                    raise BinarizationError(f"Empty phoneme")
+                if binarization_args['with_align']:
+                    cls.get_align(OpencpopBinarizer.item2ph_durs[item_name], mel, phone_encoded, res)
+        except BinarizationError as e:
+            print(f"| Skip item ({e}). item_name: {item_name}, wav_fn: {wav_fn}")
+            return None
+        return res
+if __name__ == "__main__":
+    SingingBinarizer().process()

data_gen/tts/base_binarizer.py ADDED Viewed

	@@ -0,0 +1,224 @@

+import os
+os.environ["OMP_NUM_THREADS"] = "1"
+from utils.multiprocess_utils import chunked_multiprocess_run
+import random
+import traceback
+import json
+from resemblyzer import VoiceEncoder
+from tqdm import tqdm
+from data_gen.tts.data_gen_utils import get_mel2ph, get_pitch, build_phone_encoder
+from utils.hparams import set_hparams, hparams
+import numpy as np
+from utils.indexed_datasets import IndexedDatasetBuilder
+from vocoders.base_vocoder import VOCODERS
+import pandas as pd
+class BinarizationError(Exception):
+    pass
+class BaseBinarizer:
+    def __init__(self, processed_data_dir=None):
+        if processed_data_dir is None:
+            processed_data_dir = hparams['processed_data_dir']
+        self.processed_data_dirs = processed_data_dir.split(",")
+        self.binarization_args = hparams['binarization_args']
+        self.pre_align_args = hparams['pre_align_args']
+        self.forced_align = self.pre_align_args['forced_align']
+        tg_dir = None
+        if self.forced_align == 'mfa':
+            tg_dir = 'mfa_outputs'
+        if self.forced_align == 'kaldi':
+            tg_dir = 'kaldi_outputs'
+        self.item2txt = {}
+        self.item2ph = {}
+        self.item2wavfn = {}
+        self.item2tgfn = {}
+        self.item2spk = {}
+        for ds_id, processed_data_dir in enumerate(self.processed_data_dirs):
+            self.meta_df = pd.read_csv(f"{processed_data_dir}/metadata_phone.csv", dtype=str)
+            for r_idx, r in self.meta_df.iterrows():
+                item_name = raw_item_name = r['item_name']
+                if len(self.processed_data_dirs) > 1:
+                    item_name = f'ds{ds_id}_{item_name}'
+                self.item2txt[item_name] = r['txt']
+                self.item2ph[item_name] = r['ph']
+                self.item2wavfn[item_name] = os.path.join(hparams['raw_data_dir'], 'wavs', os.path.basename(r['wav_fn']).split('_')[1])
+                self.item2spk[item_name] = r.get('spk', 'SPK1')
+                if len(self.processed_data_dirs) > 1:
+                    self.item2spk[item_name] = f"ds{ds_id}_{self.item2spk[item_name]}"
+                if tg_dir is not None:
+                    self.item2tgfn[item_name] = f"{processed_data_dir}/{tg_dir}/{raw_item_name}.TextGrid"
+        self.item_names = sorted(list(self.item2txt.keys()))
+        if self.binarization_args['shuffle']:
+            random.seed(1234)
+            random.shuffle(self.item_names)
+    @property
+    def train_item_names(self):
+        return self.item_names[hparams['test_num']+hparams['valid_num']:]
+    @property
+    def valid_item_names(self):
+        return self.item_names[0: hparams['test_num']+hparams['valid_num']]  #
+    @property
+    def test_item_names(self):
+        return self.item_names[0: hparams['test_num']]  # Audios for MOS testing are in 'test_ids'
+    def build_spk_map(self):
+        spk_map = set()
+        for item_name in self.item_names:
+            spk_name = self.item2spk[item_name]
+            spk_map.add(spk_name)
+        spk_map = {x: i for i, x in enumerate(sorted(list(spk_map)))}
+        assert len(spk_map) == 0 or len(spk_map) <= hparams['num_spk'], len(spk_map)
+        return spk_map
+    def item_name2spk_id(self, item_name):
+        return self.spk_map[self.item2spk[item_name]]
+    def _phone_encoder(self):
+        ph_set_fn = f"{hparams['binary_data_dir']}/phone_set.json"
+        ph_set = []
+        if hparams['reset_phone_dict'] or not os.path.exists(ph_set_fn):
+            for processed_data_dir in self.processed_data_dirs:
+                ph_set += [x.split(' ')[0] for x in open(f'{processed_data_dir}/dict.txt').readlines()]
+            ph_set = sorted(set(ph_set))
+            json.dump(ph_set, open(ph_set_fn, 'w'))
+        else:
+            ph_set = json.load(open(ph_set_fn, 'r'))
+        print("| phone set: ", ph_set)
+        return build_phone_encoder(hparams['binary_data_dir'])
+    def meta_data(self, prefix):
+        if prefix == 'valid':
+            item_names = self.valid_item_names
+        elif prefix == 'test':
+            item_names = self.test_item_names
+        else:
+            item_names = self.train_item_names
+        for item_name in item_names:
+            ph = self.item2ph[item_name]
+            txt = self.item2txt[item_name]
+            tg_fn = self.item2tgfn.get(item_name)
+            wav_fn = self.item2wavfn[item_name]
+            spk_id = self.item_name2spk_id(item_name)
+            yield item_name, ph, txt, tg_fn, wav_fn, spk_id
+    def process(self):
+        os.makedirs(hparams['binary_data_dir'], exist_ok=True)
+        self.spk_map = self.build_spk_map()
+        print("| spk_map: ", self.spk_map)
+        spk_map_fn = f"{hparams['binary_data_dir']}/spk_map.json"
+        json.dump(self.spk_map, open(spk_map_fn, 'w'))
+        self.phone_encoder = self._phone_encoder()
+        self.process_data('valid')
+        self.process_data('test')
+        self.process_data('train')
+    def process_data(self, prefix):
+        data_dir = hparams['binary_data_dir']
+        args = []
+        builder = IndexedDatasetBuilder(f'{data_dir}/{prefix}')
+        lengths = []
+        f0s = []
+        total_sec = 0
+        if self.binarization_args['with_spk_embed']:
+            voice_encoder = VoiceEncoder().cuda()
+        meta_data = list(self.meta_data(prefix))
+        for m in meta_data:
+            args.append(list(m) + [self.phone_encoder, self.binarization_args])
+        num_workers = int(os.getenv('N_PROC', os.cpu_count() // 3))
+        for f_id, (_, item) in enumerate(
+                zip(tqdm(meta_data), chunked_multiprocess_run(self.process_item, args, num_workers=num_workers))):
+            if item is None:
+                continue
+            item['spk_embed'] = voice_encoder.embed_utterance(item['wav']) \
+                if self.binarization_args['with_spk_embed'] else None
+            if not self.binarization_args['with_wav'] and 'wav' in item:
+                print("del wav")
+                del item['wav']
+            builder.add_item(item)
+            lengths.append(item['len'])
+            total_sec += item['sec']
+            if item.get('f0') is not None:
+                f0s.append(item['f0'])
+        builder.finalize()
+        np.save(f'{data_dir}/{prefix}_lengths.npy', lengths)
+        if len(f0s) > 0:
+            f0s = np.concatenate(f0s, 0)
+            f0s = f0s[f0s != 0]
+            np.save(f'{data_dir}/{prefix}_f0s_mean_std.npy', [np.mean(f0s).item(), np.std(f0s).item()])
+        print(f"| {prefix} total duration: {total_sec:.3f}s")
+    @classmethod
+    def process_item(cls, item_name, ph, txt, tg_fn, wav_fn, spk_id, encoder, binarization_args):
+        if hparams['vocoder'] in VOCODERS:
+            wav, mel = VOCODERS[hparams['vocoder']].wav2spec(wav_fn)
+        else:
+            wav, mel = VOCODERS[hparams['vocoder'].split('.')[-1]].wav2spec(wav_fn)
+        res = {
+            'item_name': item_name, 'txt': txt, 'ph': ph, 'mel': mel, 'wav': wav, 'wav_fn': wav_fn,
+            'sec': len(wav) / hparams['audio_sample_rate'], 'len': mel.shape[0], 'spk_id': spk_id
+        }
+        try:
+            if binarization_args['with_f0']:
+                cls.get_pitch(wav, mel, res)
+                if binarization_args['with_f0cwt']:
+                    cls.get_f0cwt(res['f0'], res)
+            if binarization_args['with_txt']:
+                try:
+                    phone_encoded = res['phone'] = encoder.encode(ph)
+                except:
+                    traceback.print_exc()
+                    raise BinarizationError(f"Empty phoneme")
+                if binarization_args['with_align']:
+                    cls.get_align(tg_fn, ph, mel, phone_encoded, res)
+        except BinarizationError as e:
+            print(f"| Skip item ({e}). item_name: {item_name}, wav_fn: {wav_fn}")
+            return None
+        return res
+    @staticmethod
+    def get_align(tg_fn, ph, mel, phone_encoded, res):
+        if tg_fn is not None and os.path.exists(tg_fn):
+            mel2ph, dur = get_mel2ph(tg_fn, ph, mel, hparams)
+        else:
+            raise BinarizationError(f"Align not found")
+        if mel2ph.max() - 1 >= len(phone_encoded):
+            raise BinarizationError(
+                f"Align does not match: mel2ph.max() - 1: {mel2ph.max() - 1}, len(phone_encoded): {len(phone_encoded)}")
+        res['mel2ph'] = mel2ph
+        res['dur'] = dur
+    @staticmethod
+    def get_pitch(wav, mel, res):
+        f0, pitch_coarse = get_pitch(wav, mel, hparams)
+        if sum(f0) == 0:
+            raise BinarizationError("Empty f0")
+        res['f0'] = f0
+        res['pitch'] = pitch_coarse
+    @staticmethod
+    def get_f0cwt(f0, res):
+        from utils.cwt import get_cont_lf0, get_lf0_cwt
+        uv, cont_lf0_lpf = get_cont_lf0(f0)
+        logf0s_mean_org, logf0s_std_org = np.mean(cont_lf0_lpf), np.std(cont_lf0_lpf)
+        cont_lf0_lpf_norm = (cont_lf0_lpf - logf0s_mean_org) / logf0s_std_org
+        Wavelet_lf0, scales = get_lf0_cwt(cont_lf0_lpf_norm)
+        if np.any(np.isnan(Wavelet_lf0)):
+            raise BinarizationError("NaN CWT")
+        res['cwt_spec'] = Wavelet_lf0
+        res['cwt_scales'] = scales
+        res['f0_mean'] = logf0s_mean_org
+        res['f0_std'] = logf0s_std_org
+if __name__ == "__main__":
+    set_hparams()
+    BaseBinarizer().process()

data_gen/tts/bin/binarize.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import os
+os.environ["OMP_NUM_THREADS"] = "1"
+import importlib
+from utils.hparams import set_hparams, hparams
+def binarize():
+    binarizer_cls = hparams.get("binarizer_cls", 'data_gen.tts.base_binarizer.BaseBinarizer')
+    pkg = ".".join(binarizer_cls.split(".")[:-1])
+    cls_name = binarizer_cls.split(".")[-1]
+    binarizer_cls = getattr(importlib.import_module(pkg), cls_name)
+    print("| Binarizer: ", binarizer_cls)
+    binarizer_cls().process()
+if __name__ == '__main__':
+    set_hparams()
+    binarize()

data_gen/tts/binarizer_zh.py ADDED Viewed

	@@ -0,0 +1,59 @@

+import os
+os.environ["OMP_NUM_THREADS"] = "1"
+from data_gen.tts.txt_processors.zh_g2pM import ALL_SHENMU
+from data_gen.tts.base_binarizer import BaseBinarizer, BinarizationError
+from data_gen.tts.data_gen_utils import get_mel2ph
+from utils.hparams import set_hparams, hparams
+import numpy as np
+class ZhBinarizer(BaseBinarizer):
+    @staticmethod
+    def get_align(tg_fn, ph, mel, phone_encoded, res):
+        if tg_fn is not None and os.path.exists(tg_fn):
+            _, dur = get_mel2ph(tg_fn, ph, mel, hparams)
+        else:
+            raise BinarizationError(f"Align not found")
+        ph_list = ph.split(" ")
+        assert len(dur) == len(ph_list)
+        mel2ph = []
+        # 分隔符的时长分配给韵母
+        dur_cumsum = np.pad(np.cumsum(dur), [1, 0], mode='constant', constant_values=0)
+        for i in range(len(dur)):
+            p = ph_list[i]
+            if p[0] != '<' and not p[0].isalpha():
+                uv_ = res['f0'][dur_cumsum[i]:dur_cumsum[i + 1]] == 0
+                j = 0
+                while j < len(uv_) and not uv_[j]:
+                    j += 1
+                dur[i - 1] += j
+                dur[i] -= j
+                if dur[i] < 100:
+                    dur[i - 1] += dur[i]
+                    dur[i] = 0
+        # 声母和韵母等长
+        for i in range(len(dur)):
+            p = ph_list[i]
+            if p in ALL_SHENMU:
+                p_next = ph_list[i + 1]
+                if not (dur[i] > 0 and p_next[0].isalpha() and p_next not in ALL_SHENMU):
+                    print(f"assert dur[i] > 0 and p_next[0].isalpha() and p_next not in ALL_SHENMU, "
+                          f"dur[i]: {dur[i]}, p: {p}, p_next: {p_next}.")
+                    continue
+                total = dur[i + 1] + dur[i]
+                dur[i] = total // 2
+                dur[i + 1] = total - dur[i]
+        for i in range(len(dur)):
+            mel2ph += [i + 1] * dur[i]
+        mel2ph = np.array(mel2ph)
+        if mel2ph.max() - 1 >= len(phone_encoded):
+            raise BinarizationError(f"| Align does not match: {(mel2ph.max() - 1, len(phone_encoded))}")
+        res['mel2ph'] = mel2ph
+        res['dur'] = dur
+if __name__ == "__main__":
+    set_hparams()
+    ZhBinarizer().process()

data_gen/tts/data_gen_utils.py ADDED Viewed

	@@ -0,0 +1,347 @@

+import warnings
+warnings.filterwarnings("ignore")
+import parselmouth
+import os
+import torch
+from skimage.transform import resize
+from utils.text_encoder import TokenTextEncoder
+from utils.pitch_utils import f0_to_coarse
+import struct
+import webrtcvad
+from scipy.ndimage.morphology import binary_dilation
+import librosa
+import numpy as np
+from utils import audio
+import pyloudnorm as pyln
+import re
+import json
+from collections import OrderedDict
+PUNCS = '!,.?;:'
+int16_max = (2 ** 15) - 1
+def trim_long_silences(path, sr=None, return_raw_wav=False, norm=True, vad_max_silence_length=12):
+    """
+    Ensures that segments without voice in the waveform remain no longer than a
+    threshold determined by the VAD parameters in params.py.
+    :param wav: the raw waveform as a numpy array of floats
+    :param vad_max_silence_length: Maximum number of consecutive silent frames a segment can have.
+    :return: the same waveform with silences trimmed away (length <= original wav length)
+    """
+    ## Voice Activation Detection
+    # Window size of the VAD. Must be either 10, 20 or 30 milliseconds.
+    # This sets the granularity of the VAD. Should not need to be changed.
+    sampling_rate = 16000
+    wav_raw, sr = librosa.core.load(path, sr=sr)
+    if norm:
+        meter = pyln.Meter(sr)  # create BS.1770 meter
+        loudness = meter.integrated_loudness(wav_raw)
+        wav_raw = pyln.normalize.loudness(wav_raw, loudness, -20.0)
+        if np.abs(wav_raw).max() > 1.0:
+            wav_raw = wav_raw / np.abs(wav_raw).max()
+    wav = librosa.resample(wav_raw, sr, sampling_rate, res_type='kaiser_best')
+    vad_window_length = 30  # In milliseconds
+    # Number of frames to average together when performing the moving average smoothing.
+    # The larger this value, the larger the VAD variations must be to not get smoothed out.
+    vad_moving_average_width = 8
+    # Compute the voice detection window size
+    samples_per_window = (vad_window_length * sampling_rate) // 1000
+    # Trim the end of the audio to have a multiple of the window size
+    wav = wav[:len(wav) - (len(wav) % samples_per_window)]
+    # Convert the float waveform to 16-bit mono PCM
+    pcm_wave = struct.pack("%dh" % len(wav), *(np.round(wav * int16_max)).astype(np.int16))
+    # Perform voice activation detection
+    voice_flags = []
+    vad = webrtcvad.Vad(mode=3)
+    for window_start in range(0, len(wav), samples_per_window):
+        window_end = window_start + samples_per_window
+        voice_flags.append(vad.is_speech(pcm_wave[window_start * 2:window_end * 2],
+                                         sample_rate=sampling_rate))
+    voice_flags = np.array(voice_flags)
+    # Smooth the voice detection with a moving average
+    def moving_average(array, width):
+        array_padded = np.concatenate((np.zeros((width - 1) // 2), array, np.zeros(width // 2)))
+        ret = np.cumsum(array_padded, dtype=float)
+        ret[width:] = ret[width:] - ret[:-width]
+        return ret[width - 1:] / width
+    audio_mask = moving_average(voice_flags, vad_moving_average_width)
+    audio_mask = np.round(audio_mask).astype(np.bool)
+    # Dilate the voiced regions
+    audio_mask = binary_dilation(audio_mask, np.ones(vad_max_silence_length + 1))
+    audio_mask = np.repeat(audio_mask, samples_per_window)
+    audio_mask = resize(audio_mask, (len(wav_raw),)) > 0
+    if return_raw_wav:
+        return wav_raw, audio_mask, sr
+    return wav_raw[audio_mask], audio_mask, sr
+def process_utterance(wav_path,
+                      fft_size=1024,
+                      hop_size=256,
+                      win_length=1024,
+                      window="hann",
+                      num_mels=80,
+                      fmin=80,
+                      fmax=7600,
+                      eps=1e-6,
+                      sample_rate=22050,
+                      loud_norm=False,
+                      min_level_db=-100,
+                      return_linear=False,
+                      trim_long_sil=False, vocoder='pwg'):
+    if isinstance(wav_path, str):
+        if trim_long_sil:
+            wav, _, _ = trim_long_silences(wav_path, sample_rate)
+        else:
+            wav, _ = librosa.core.load(wav_path, sr=sample_rate)
+    else:
+        wav = wav_path
+    if loud_norm:
+        meter = pyln.Meter(sample_rate)  # create BS.1770 meter
+        loudness = meter.integrated_loudness(wav)
+        wav = pyln.normalize.loudness(wav, loudness, -22.0)
+        if np.abs(wav).max() > 1:
+            wav = wav / np.abs(wav).max()
+    # get amplitude spectrogram
+    x_stft = librosa.stft(wav, n_fft=fft_size, hop_length=hop_size,
+                          win_length=win_length, window=window, pad_mode="constant")
+    spc = np.abs(x_stft)  # (n_bins, T)
+    # get mel basis
+    fmin = 0 if fmin == -1 else fmin
+    fmax = sample_rate / 2 if fmax == -1 else fmax
+    mel_basis = librosa.filters.mel(sample_rate, fft_size, num_mels, fmin, fmax)
+    mel = mel_basis @ spc
+    if vocoder == 'pwg':
+        mel = np.log10(np.maximum(eps, mel))  # (n_mel_bins, T)
+    else:
+        assert False, f'"{vocoder}" is not in ["pwg"].'
+    l_pad, r_pad = audio.librosa_pad_lr(wav, fft_size, hop_size, 1)
+    wav = np.pad(wav, (l_pad, r_pad), mode='constant', constant_values=0.0)
+    wav = wav[:mel.shape[1] * hop_size]
+    if not return_linear:
+        return wav, mel
+    else:
+        spc = audio.amp_to_db(spc)
+        spc = audio.normalize(spc, {'min_level_db': min_level_db})
+        return wav, mel, spc
+def get_pitch(wav_data, mel, hparams):
+    """
+    :param wav_data: [T]
+    :param mel: [T, 80]
+    :param hparams:
+    :return:
+    """
+    time_step = hparams['hop_size'] / hparams['audio_sample_rate'] * 1000
+    f0_min = 80
+    f0_max = 750
+    if hparams['hop_size'] == 128:
+        pad_size = 4
+    elif hparams['hop_size'] == 256:
+        pad_size = 2
+    else:
+        assert False
+    f0 = parselmouth.Sound(wav_data, hparams['audio_sample_rate']).to_pitch_ac(
+        time_step=time_step / 1000, voicing_threshold=0.6,
+        pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency']
+    lpad = pad_size * 2
+    rpad = len(mel) - len(f0) - lpad
+    f0 = np.pad(f0, [[lpad, rpad]], mode='constant')
+    # mel and f0 are extracted by 2 different libraries. we should force them to have the same length.
+    # Attention: we find that new version of some libraries could cause ``rpad'' to be a negetive value...
+    # Just to be sure, we recommend users to set up the same environments as them in requirements_auto.txt (by Anaconda)
+    delta_l = len(mel) - len(f0)
+    assert np.abs(delta_l) <= 8
+    if delta_l > 0:
+        f0 = np.concatenate([f0, [f0[-1]] * delta_l], 0)
+    f0 = f0[:len(mel)]
+    pitch_coarse = f0_to_coarse(f0)
+    return f0, pitch_coarse
+def remove_empty_lines(text):
+    """remove empty lines"""
+    assert (len(text) > 0)
+    assert (isinstance(text, list))
+    text = [t.strip() for t in text]
+    if "" in text:
+        text.remove("")
+    return text
+class TextGrid(object):
+    def __init__(self, text):
+        text = remove_empty_lines(text)
+        self.text = text
+        self.line_count = 0
+        self._get_type()
+        self._get_time_intval()
+        self._get_size()
+        self.tier_list = []
+        self._get_item_list()
+    def _extract_pattern(self, pattern, inc):
+        """
+        Parameters
+        ----------
+        pattern : regex to extract pattern
+        inc : increment of line count after extraction
+        Returns
+        -------
+        group : extracted info
+        """
+        try:
+            group = re.match(pattern, self.text[self.line_count]).group(1)
+            self.line_count += inc
+        except AttributeError:
+            raise ValueError("File format error at line %d:%s" % (self.line_count, self.text[self.line_count]))
+        return group
+    def _get_type(self):
+        self.file_type = self._extract_pattern(r"File type = \"(.*)\"", 2)
+    def _get_time_intval(self):
+        self.xmin = self._extract_pattern(r"xmin = (.*)", 1)
+        self.xmax = self._extract_pattern(r"xmax = (.*)", 2)
+    def _get_size(self):
+        self.size = int(self._extract_pattern(r"size = (.*)", 2))
+    def _get_item_list(self):
+        """Only supports IntervalTier currently"""
+        for itemIdx in range(1, self.size + 1):
+            tier = OrderedDict()
+            item_list = []
+            tier_idx = self._extract_pattern(r"item \[(.*)\]:", 1)
+            tier_class = self._extract_pattern(r"class = \"(.*)\"", 1)
+            if tier_class != "IntervalTier":
+                raise NotImplementedError("Only IntervalTier class is supported currently")
+            tier_name = self._extract_pattern(r"name = \"(.*)\"", 1)
+            tier_xmin = self._extract_pattern(r"xmin = (.*)", 1)
+            tier_xmax = self._extract_pattern(r"xmax = (.*)", 1)
+            tier_size = self._extract_pattern(r"intervals: size = (.*)", 1)
+            for i in range(int(tier_size)):
+                item = OrderedDict()
+                item["idx"] = self._extract_pattern(r"intervals \[(.*)\]", 1)
+                item["xmin"] = self._extract_pattern(r"xmin = (.*)", 1)
+                item["xmax"] = self._extract_pattern(r"xmax = (.*)", 1)
+                item["text"] = self._extract_pattern(r"text = \"(.*)\"", 1)
+                item_list.append(item)
+            tier["idx"] = tier_idx
+            tier["class"] = tier_class
+            tier["name"] = tier_name
+            tier["xmin"] = tier_xmin
+            tier["xmax"] = tier_xmax
+            tier["size"] = tier_size
+            tier["items"] = item_list
+            self.tier_list.append(tier)
+    def toJson(self):
+        _json = OrderedDict()
+        _json["file_type"] = self.file_type
+        _json["xmin"] = self.xmin
+        _json["xmax"] = self.xmax
+        _json["size"] = self.size
+        _json["tiers"] = self.tier_list
+        return json.dumps(_json, ensure_ascii=False, indent=2)
+def get_mel2ph(tg_fn, ph, mel, hparams):
+    ph_list = ph.split(" ")
+    with open(tg_fn, "r") as f:
+        tg = f.readlines()
+    tg = remove_empty_lines(tg)
+    tg = TextGrid(tg)
+    tg = json.loads(tg.toJson())
+    split = np.ones(len(ph_list) + 1, np.float) * -1
+    tg_idx = 0
+    ph_idx = 0
+    tg_align = [x for x in tg['tiers'][-1]['items']]
+    tg_align_ = []
+    for x in tg_align:
+        x['xmin'] = float(x['xmin'])
+        x['xmax'] = float(x['xmax'])
+        if x['text'] in ['sil', 'sp', '', 'SIL', 'PUNC']:
+            x['text'] = ''
+            if len(tg_align_) > 0 and tg_align_[-1]['text'] == '':
+                tg_align_[-1]['xmax'] = x['xmax']
+                continue
+        tg_align_.append(x)
+    tg_align = tg_align_
+    tg_len = len([x for x in tg_align if x['text'] != ''])
+    ph_len = len([x for x in ph_list if not is_sil_phoneme(x)])
+    assert tg_len == ph_len, (tg_len, ph_len, tg_align, ph_list, tg_fn)
+    while tg_idx < len(tg_align) or ph_idx < len(ph_list):
+        if tg_idx == len(tg_align) and is_sil_phoneme(ph_list[ph_idx]):
+            split[ph_idx] = 1e8
+            ph_idx += 1
+            continue
+        x = tg_align[tg_idx]
+        if x['text'] == '' and ph_idx == len(ph_list):
+            tg_idx += 1
+            continue
+        assert ph_idx < len(ph_list), (tg_len, ph_len, tg_align, ph_list, tg_fn)
+        ph = ph_list[ph_idx]
+        if x['text'] == '' and not is_sil_phoneme(ph):
+            assert False, (ph_list, tg_align)
+        if x['text'] != '' and is_sil_phoneme(ph):
+            ph_idx += 1
+        else:
+            assert (x['text'] == '' and is_sil_phoneme(ph)) \
+                   or x['text'].lower() == ph.lower() \
+                   or x['text'].lower() == 'sil', (x['text'], ph)
+            split[ph_idx] = x['xmin']
+            if ph_idx > 0 and split[ph_idx - 1] == -1 and is_sil_phoneme(ph_list[ph_idx - 1]):
+                split[ph_idx - 1] = split[ph_idx]
+            ph_idx += 1
+            tg_idx += 1
+    assert tg_idx == len(tg_align), (tg_idx, [x['text'] for x in tg_align])
+    assert ph_idx >= len(ph_list) - 1, (ph_idx, ph_list, len(ph_list), [x['text'] for x in tg_align], tg_fn)
+    mel2ph = np.zeros([mel.shape[0]], np.int)
+    split[0] = 0
+    split[-1] = 1e8
+    for i in range(len(split) - 1):
+        assert split[i] != -1 and split[i] <= split[i + 1], (split[:-1],)
+    split = [int(s * hparams['audio_sample_rate'] / hparams['hop_size'] + 0.5) for s in split]
+    for ph_idx in range(len(ph_list)):
+        mel2ph[split[ph_idx]:split[ph_idx + 1]] = ph_idx + 1
+    mel2ph_torch = torch.from_numpy(mel2ph)
+    T_t = len(ph_list)
+    dur = mel2ph_torch.new_zeros([T_t + 1]).scatter_add(0, mel2ph_torch, torch.ones_like(mel2ph_torch))
+    dur = dur[1:].numpy()
+    return mel2ph, dur
+def build_phone_encoder(data_dir):
+    phone_list_file = os.path.join(data_dir, 'phone_set.json')
+    phone_list = json.load(open(phone_list_file))
+    return TokenTextEncoder(None, vocab_list=phone_list, replace_oov=',')
+def is_sil_phoneme(p):
+    return not p[0].isalpha()

data_gen/tts/txt_processors/base_text_processor.py ADDED Viewed

	@@ -0,0 +1,8 @@

+class BaseTxtProcessor:
+    @staticmethod
+    def sp_phonemes():
+        return ['|']
+    @classmethod
+    def process(cls, txt, pre_align_args):
+        raise NotImplementedError

data_gen/tts/txt_processors/en.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import re
+from data_gen.tts.data_gen_utils import PUNCS
+from g2p_en import G2p
+import unicodedata
+from g2p_en.expand import normalize_numbers
+from nltk import pos_tag
+from nltk.tokenize import TweetTokenizer
+from data_gen.tts.txt_processors.base_text_processor import BaseTxtProcessor
+class EnG2p(G2p):
+    word_tokenize = TweetTokenizer().tokenize
+    def __call__(self, text):
+        # preprocessing
+        words = EnG2p.word_tokenize(text)
+        tokens = pos_tag(words)  # tuples of (word, tag)
+        # steps
+        prons = []
+        for word, pos in tokens:
+            if re.search("[a-z]", word) is None:
+                pron = [word]
+            elif word in self.homograph2features:  # Check homograph
+                pron1, pron2, pos1 = self.homograph2features[word]
+                if pos.startswith(pos1):
+                    pron = pron1
+                else:
+                    pron = pron2
+            elif word in self.cmu:  # lookup CMU dict
+                pron = self.cmu[word][0]
+            else:  # predict for oov
+                pron = self.predict(word)
+            prons.extend(pron)
+            prons.extend([" "])
+        return prons[:-1]
+class TxtProcessor(BaseTxtProcessor):
+    g2p = EnG2p()
+    @staticmethod
+    def preprocess_text(text):
+        text = normalize_numbers(text)
+        text = ''.join(char for char in unicodedata.normalize('NFD', text)
+                       if unicodedata.category(char) != 'Mn')  # Strip accents
+        text = text.lower()
+        text = re.sub("[\'\"()]+", "", text)
+        text = re.sub("[-]+", " ", text)
+        text = re.sub(f"[^ a-z{PUNCS}]", "", text)
+        text = re.sub(f" ?([{PUNCS}]) ?", r"\1", text)  # !! -> !
+        text = re.sub(f"([{PUNCS}])+", r"\1", text)  # !! -> !
+        text = text.replace("i.e.", "that is")
+        text = text.replace("i.e.", "that is")
+        text = text.replace("etc.", "etc")
+        text = re.sub(f"([{PUNCS}])", r" \1 ", text)
+        text = re.sub(rf"\s+", r" ", text)
+        return text
+    @classmethod
+    def process(cls, txt, pre_align_args):
+        txt = cls.preprocess_text(txt).strip()
+        phs = cls.g2p(txt)
+        phs_ = []
+        n_word_sep = 0
+        for p in phs:
+            if p.strip() == '':
+                phs_ += ['|']
+                n_word_sep += 1
+            else:
+                phs_ += p.split(" ")
+        phs = phs_
+        assert n_word_sep + 1 == len(txt.split(" ")), (phs, f"\"{txt}\"")
+        return phs, txt

data_gen/tts/txt_processors/zh.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import re
+from pypinyin import pinyin, Style
+from data_gen.tts.data_gen_utils import PUNCS
+from data_gen.tts.txt_processors.base_text_processor import BaseTxtProcessor
+from utils.text_norm import NSWNormalizer
+class TxtProcessor(BaseTxtProcessor):
+    table = {ord(f): ord(t) for f, t in zip(
+        u'：，。！？【】（）％＃＠＆１２３４５６７８９０',
+        u':,.!?[]()%#@&1234567890')}
+    @staticmethod
+    def preprocess_text(text):
+        text = text.translate(TxtProcessor.table)
+        text = NSWNormalizer(text).normalize(remove_punc=False)
+        text = re.sub("[\'\"()]+", "", text)
+        text = re.sub("[-]+", " ", text)
+        text = re.sub(f"[^ A-Za-z\u4e00-\u9fff{PUNCS}]", "", text)
+        text = re.sub(f"([{PUNCS}])+", r"\1", text)  # !! -> !
+        text = re.sub(f"([{PUNCS}])", r" \1 ", text)
+        text = re.sub(rf"\s+", r"", text)
+        return text
+    @classmethod
+    def process(cls, txt, pre_align_args):
+        txt = cls.preprocess_text(txt)
+        shengmu = pinyin(txt, style=Style.INITIALS)  # https://blog.csdn.net/zhoulei124/article/details/89055403
+        yunmu_finals = pinyin(txt, style=Style.FINALS)
+        yunmu_tone3 = pinyin(txt, style=Style.FINALS_TONE3)
+        yunmu = [[t[0] + '5'] if t[0] == f[0] else t for f, t in zip(yunmu_finals, yunmu_tone3)] \
+            if pre_align_args['use_tone'] else yunmu_finals
+        assert len(shengmu) == len(yunmu)
+        phs = ["|"]
+        for a, b, c in zip(shengmu, yunmu, yunmu_finals):
+            if a[0] == c[0]:
+                phs += [a[0], "|"]
+            else:
+                phs += [a[0], b[0], "|"]
+        return phs, txt

data_gen/tts/txt_processors/zh_g2pM.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import re
+import jieba
+from pypinyin import pinyin, Style
+from data_gen.tts.data_gen_utils import PUNCS
+from data_gen.tts.txt_processors import zh
+from g2pM import G2pM
+ALL_SHENMU = ['zh', 'ch', 'sh', 'b', 'p', 'm', 'f', 'd', 't', 'n', 'l', 'g', 'k', 'h', 'j',
+              'q', 'x', 'r', 'z', 'c', 's', 'y', 'w']
+ALL_YUNMU = ['a', 'ai', 'an', 'ang', 'ao', 'e', 'ei', 'en', 'eng', 'er', 'i', 'ia', 'ian',
+             'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'iu', 'ng', 'o', 'ong', 'ou',
+             'u', 'ua', 'uai', 'uan', 'uang', 'ui', 'un', 'uo', 'v', 'van', 've', 'vn']
+class TxtProcessor(zh.TxtProcessor):
+    model = G2pM()
+    @staticmethod
+    def sp_phonemes():
+        return ['|', '#']
+    @classmethod
+    def process(cls, txt, pre_align_args):
+        txt = cls.preprocess_text(txt)
+        ph_list = cls.model(txt, tone=pre_align_args['use_tone'], char_split=True)
+        seg_list = '#'.join(jieba.cut(txt))
+        assert len(ph_list) == len([s for s in seg_list if s != '#']), (ph_list, seg_list)
+        # 加入词边界'#'
+        ph_list_ = []
+        seg_idx = 0
+        for p in ph_list:
+            p = p.replace("u:", "v")
+            if seg_list[seg_idx] == '#':
+                ph_list_.append('#')
+                seg_idx += 1
+            else:
+                ph_list_.append("|")
+            seg_idx += 1
+            if re.findall('[\u4e00-\u9fff]', p):
+                if pre_align_args['use_tone']:
+                    p = pinyin(p, style=Style.TONE3, strict=True)[0][0]
+                    if p[-1] not in ['1', '2', '3', '4', '5']:
+                        p = p + '5'
+                else:
+                    p = pinyin(p, style=Style.NORMAL, strict=True)[0][0]
+            finished = False
+            if len([c.isalpha() for c in p]) > 1:
+                for shenmu in ALL_SHENMU:
+                    if p.startswith(shenmu) and not p.lstrip(shenmu).isnumeric():
+                        ph_list_ += [shenmu, p.lstrip(shenmu)]
+                        finished = True
+                        break
+            if not finished:
+                ph_list_.append(p)
+        ph_list = ph_list_
+        # 去除静音符号周围的词边界标记 [..., '#', ',', '#', ...]
+        sil_phonemes = list(PUNCS) + TxtProcessor.sp_phonemes()
+        ph_list_ = []
+        for i in range(0, len(ph_list), 1):
+            if ph_list[i] != '#' or (ph_list[i - 1] not in sil_phonemes and ph_list[i + 1] not in sil_phonemes):
+                ph_list_.append(ph_list[i])
+        ph_list = ph_list_
+        return ph_list, txt
+if __name__ == '__main__':
+    phs, txt = TxtProcessor.process('他来到了，网易杭研大厦', {'use_tone': True})
+    print(phs)

docs/README-SVS-opencpop-cascade.md ADDED Viewed

	@@ -0,0 +1,111 @@

+# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
+[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
+[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
+## DiffSinger (MIDI version SVS)
+### 0. Data Acquirement
+For Opencpop dataset: Please strictly follow the instructions of [Opencpop](https://wenet.org.cn/opencpop/). We have no right to give you the access to Opencpop.
+The pipeline below is designed for Opencpop dataset:
+### 1. Preparation
+#### Data Preparation
+a) Download and extract Opencpop, then create a link to the dataset folder: `ln -s /xxx/opencpop data/raw/`
+b) Run the following scripts to pack the dataset for training/inference.
+```sh
+export PYTHONPATH=.
+CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/midi/cascade/opencs/aux_rel.yaml
+# `data/binary/opencpop-midi-dp` will be generated.
+```
+#### Vocoder Preparation
+We provide the pre-trained model of [HifiGAN-Singing](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0109_hifigan_bigpopcs_hop128.zip) which is specially designed for SVS with NSF mechanism.
+Please unzip this file into `checkpoints` before training your acoustic model.
+(Update: You can also move [a ckpt with more training steps](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/model_ckpt_steps_1512000.ckpt) into this vocoder directory)
+This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder.
+#### Exp Name Preparation
+```bash
+export MY_FS_EXP_NAME=0302_opencpop_fs_midi
+export MY_DS_EXP_NAME=0303_opencpop_ds58_midi
+```
+```
+.
+|--data
+    |--raw
+        |--opencpop
+            |--segments
+                |--transcriptions.txt
+                |--wavs
+|--checkpoints
+    |--MY_FS_EXP_NAME (optional)
+    |--MY_DS_EXP_NAME (optional)
+    |--0109_hifigan_bigpopcs_hop128
+        |--model_ckpt_steps_1512000.ckpt
+        |--config.yaml
+```
+### 2. Training Example
+First, you need a pre-trained FFT-Singer checkpoint. You can use the pre-trained model, or train FFT-Singer from scratch, run:
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/cascade/opencs/aux_rel.yaml --exp_name $MY_FS_EXP_NAME --reset
+```
+Then, to train DiffSinger, run:
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name $MY_DS_EXP_NAME --reset
+```
+Remember to adjust the "fs2_ckpt" parameter in `usr/configs/midi/cascade/opencs/ds60_rel.yaml` to fit your path.
+### 3. Inference Example
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer
+```
+We also provide:
+ - the pre-trained model of DiffSinger;
+ - the pre-trained model of FFT-Singer;
+They can be found in [here](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/adjust-receptive-field.zip).
+Remember to put the pre-trained models in `checkpoints` directory.
+### 4. Inference from raw inputs
+```sh
+python inference/svs/ds_e2e.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name $MY_DS_EXP_NAME
+```
+Raw inputs:
+```
+inp = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+        'input_type': 'word'
+    }  # user input: Chinese characters
+or,
+inp = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
+        'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
+        'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
+        'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
+        'input_type': 'phoneme'
+    }  # input like Opencpop dataset.
+```
+### 5. Some issues.
+a) the HifiGAN-Singing is trained on our [vocoder dataset](https://dl.acm.org/doi/abs/10.1145/3474085.3475437) and the training set of [PopCS](https://arxiv.org/abs/2105.02446). Opencpop is the out-of-domain dataset (unseen speaker). This may cause the deterioration of audio quality, and we are considering fine-tuning this vocoder on the training set of Opencpop.
+b) in this version of codes, we used the melody frontend ([lyric + MIDI]->[F0+ph_dur]) to predict F0 contour and phoneme duration.
+c) generated audio demos can be found in [MY_DS_EXP_NAME](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/adjust-receptive-field.zip).

docs/README-SVS-opencpop-e2e.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
+[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
+[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
+Substantial update: We 1) **abandon** the explicit prediction of the F0 curve; 2) increase the receptive field of the denoiser; 3) make the linguistic encoder more robust.
+**By doing so, 1) the synthesized recordings are more natural in terms of pitch; 2) the pipeline is simpler.**
+简而言之，把F0曲线的动态性交给生成式模型去捕捉，而不再是以前那样用MSE约束对数域F0。
+## DiffSinger (MIDI version SVS)
+### 0. Data Acquirement
+For Opencpop dataset: Please strictly follow the instructions of [Opencpop](https://wenet.org.cn/opencpop/). We have no right to give you the access to Opencpop.
+The pipeline below is designed for Opencpop dataset:
+### 1. Preparation
+#### Data Preparation
+a) Download and extract Opencpop, then create a link to the dataset folder: `ln -s /xxx/opencpop data/raw/`
+b) Run the following scripts to pack the dataset for training/inference.
+```sh
+export PYTHONPATH=.
+CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/midi/cascade/opencs/aux_rel.yaml
+# `data/binary/opencpop-midi-dp` will be generated.
+```
+#### Vocoder Preparation
+We provide the pre-trained model of [HifiGAN-Singing](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0109_hifigan_bigpopcs_hop128.zip) which is specially designed for SVS with NSF mechanism.
+Also, please unzip pre-trained vocoder and [this pendant for vocoder](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0102_xiaoma_pe.zip) into `checkpoints` before training your acoustic model.
+(Update: You can also move [a ckpt with more training steps](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/model_ckpt_steps_1512000.ckpt) into this vocoder directory)
+This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder.
+#### Exp Name Preparation
+```bash
+export MY_DS_EXP_NAME=0228_opencpop_ds100_rel
+```
+```
+.
+|--data
+    |--raw
+        |--opencpop
+            |--segments
+                |--transcriptions.txt
+                |--wavs
+|--checkpoints
+    |--MY_DS_EXP_NAME (optional)
+    |--0109_hifigan_bigpopcs_hop128 (vocoder)
+        |--model_ckpt_steps_1512000.ckpt
+        |--config.yaml
+```
+### 2. Training Example
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset
+```
+### 3. Inference from packed test set
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer
+```
+We also provide:
+ - the pre-trained model of DiffSinger;
+They can be found in [here](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0228_opencpop_ds100_rel.zip).
+Remember to put the pre-trained models in `checkpoints` directory.
+### 4. Inference from raw inputs
+```sh
+python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME
+```
+Raw inputs:
+```
+inp = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+        'input_type': 'word'
+    }  # user input: Chinese characters
+or,
+inp = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
+        'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
+        'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
+        'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
+        'input_type': 'phoneme'
+    }  # input like Opencpop dataset.
+```
+### 5. Some issues.
+a) the HifiGAN-Singing is trained on our [vocoder dataset](https://dl.acm.org/doi/abs/10.1145/3474085.3475437) and the training set of [PopCS](https://arxiv.org/abs/2105.02446). Opencpop is the out-of-domain dataset (unseen speaker). This may cause the deterioration of audio quality, and we are considering fine-tuning this vocoder on the training set of Opencpop.
+b) in this version of codes, we used the melody frontend ([lyric + MIDI]->[ph_dur]) to predict phoneme duration. F0 curve is implicitly predicted together with mel-spectrogram.
+c) example [generated audio](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/demos_0221/DS/).
+More generated audio demos can be found in [DiffSinger](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0228_opencpop_ds100_rel.zip).

docs/README-SVS-popcs.md ADDED Viewed

	@@ -0,0 +1,63 @@

+## DiffSinger (SVS version)
+### 0. Data Acquirement
+- See in [apply_form](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md).
+- Dataset [preview](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_preview.zip).
+### 1. Preparation
+#### Data Preparation
+a) Download and extract PopCS, then create a link to the dataset folder: `ln -s /xxx/popcs/ data/processed/popcs`
+b) Run the following scripts to pack the dataset for training/inference.
+```sh
+export PYTHONPATH=.
+CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/popcs_ds_beta6.yaml
+# `data/binary/popcs-pmf0` will be generated.
+```
+#### Vocoder Preparation
+We provide the pre-trained model of [HifiGAN-Singing](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0109_hifigan_bigpopcs_hop128.zip) which is specially designed for SVS with NSF mechanism.
+Please unzip this file into `checkpoints` before training your acoustic model.
+(Update: You can also move [a ckpt with more training steps](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/model_ckpt_steps_1512000.ckpt) into this vocoder directory)
+This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder.
+### 2. Training Example
+First, you need a pre-trained FFT-Singer checkpoint. You can use the [pre-trained model](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_fs2_pmf0_1230.zip), or train FFT-Singer from scratch, run:
+```sh
+# First, train fft-singer;
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset
+# Then, infer fft-singer;
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer
+```
+Then, to train DiffSinger, run:
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset
+```
+Remember to adjust the "fs2_ckpt" parameter in `usr/configs/popcs_ds_beta6_offline.yaml` to fit your path.
+### 3. Inference Example
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset --infer
+```
+We also provide:
+ - the pre-trained model of [DiffSinger](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_ds_beta6_offline_pmf0_1230.zip);
+ - the pre-trained model of [FFT-Singer](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_fs2_pmf0_1230.zip) for the shallow diffusion mechanism in DiffSinger;
+Remember to put the pre-trained models in `checkpoints` directory.
+*Note that:*
+- *the original PWG version vocoder in the paper we used has been put into commercial use, so we provide this HifiGAN version vocoder as a substitute.*
+- *we assume the ground-truth F0 to be given as the pitch information following [1][2][3]. If you want to conduct experiments on MIDI data, you need an external F0 predictor (like [MIDI-old-version](README-SVS-opencpop-cascade.md)) or a joint prediction with spectrograms(like [MIDI-new-version](README-SVS-opencpop-e2e.md)).*
+[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.
+[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.
+[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.

docs/README-SVS.md ADDED Viewed

	@@ -0,0 +1,44 @@

+## DiffSinger (SVS version)
+### PART1. [Run DiffSinger on PopCS](README-SVS-popcs.md)
+In this part, we only focus on spectrum modeling (acoustic model) and assume the ground-truth (GT) F0 to be given as the pitch information following these papers [1][2][3].
+Thus, the pipeline of this part can be summarized as:
+```
+[lyrics] -> [linguistic representation] (Frontend)
+[linguistic representation] + [GT F0] + [GT phoneme duration] -> [mel-spectrogram]  (Acoustic model)
+[mel-spectrogram] + [GT F0] -> [waveform] (Vocoder)
+```
+[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.
+[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.
+[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.
+### PART2. [Run DiffSinger on Opencpop](README-SVS-opencpop-cascade.md)
+Thanks [Opencpop team](https://wenet.org.cn/opencpop/) for releasing their SVS dataset with MIDI label, **Jan.20, 2022**. (Also thanks to my co-author [Yi Ren](https://github.com/RayeRen), who applied for the dataset and did some preprocessing works for this part).
+Since there are elaborately annotated MIDI labels, we are able to supplement the pipeline in PART 1 by adding a naive melody frontend.
+#### 2.1
+Thus, the pipeline of [this part](README-SVS-opencpop-cascade.md) can be summarized as:
+```
+[lyrics] + [MIDI] -> [linguistic representation (with MIDI information)] + [predicted F0] + [predicted phoneme duration] (Melody frontend)
+[linguistic representation] + [predicted F0] + [predicted phoneme duration] -> [mel-spectrogram]  (Acoustic model)
+[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
+```
+#### 2.2
+In 2.1, we find that if we predict F0 explicitly in the melody frontend, there will be many bad cases of uv/v prediction. Then, we abandon the explicit prediction of the F0 curve in the melody frontend but make a joint prediction with spectrograms.
+Thus, the pipeline of [this part](README-SVS-opencpop-e2e.md) can be summarized as:
+```
+[lyrics] + [MIDI] -> [linguistic representation] + [predicted phoneme duration] (Melody frontend)
+[linguistic representation (with MIDI information)] + [predicted phoneme duration] -> [mel-spectrogram]  (Acoustic model)
+[mel-spectrogram] -> [predicted F0]  (Pitch extractor)
+[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
+```

docs/README-TTS.md ADDED Viewed

	@@ -0,0 +1,63 @@

+## DiffSpeech (TTS version)
+### 1. Preparation
+#### Data Preparation
+a) Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/), then create a link to the dataset folder: `ln -s /xxx/LJSpeech-1.1/ data/raw/`
+b) Download and Unzip the [ground-truth duration](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/mfa_outputs.tar) extracted by [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz):  `tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/`
+c) Run the following scripts to pack the dataset for training/inference.
+```sh
+export PYTHONPATH=.
+CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml
+# `data/binary/ljspeech` will be generated.
+```
+#### Vocoder Preparation
+We provide the pre-trained model of [HifiGAN](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0414_hifi_lj_1.zip) vocoder.
+Please unzip this file into `checkpoints` before training your acoustic model.
+### 2. Training Example
+First, you need a pre-trained FastSpeech2 checkpoint. You can use the [pre-trained model](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/fs2_lj_1.zip), or train FastSpeech2 from scratch, run:
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config configs/tts/lj/fs2.yaml --exp_name fs2_lj_1 --reset
+```
+Then, to train DiffSpeech, run:
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset
+```
+Remember to adjust the "fs2_ckpt" parameter in `usr/configs/lj_ds_beta6.yaml` to fit your path.
+### 3. Inference Example
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset --infer
+```
+We also provide:
+ - the pre-trained model of [DiffSpeech](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/lj_ds_beta6_1213.zip);
+ - the individual pre-trained model of [FastSpeech 2](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/fs2_lj_1.zip) for the shallow diffusion mechanism in DiffSpeech;
+Remember to put the pre-trained models in `checkpoints` directory.
+## Mel Visualization
+Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].
+<table style="width:100%">
+  <tr>
+    <th>DiffSpeech vs. FastSpeech 2</th>
+  </tr>
+  <tr>
+    <td><img src="resources/diffspeech-fs2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
+  </tr>
+  <tr>
+    <td><img src="resources/diffspeech-fs2-1.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
+  </tr>
+  <tr>
+    <td><img src="resources/diffspeech-fs2-2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
+  </tr>
+</table>

docs/README-zh.md ADDED Viewed

	@@ -0,0 +1,212 @@

+# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
+[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
+[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
+ | [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/NATSpeech/DiffSpeech)
+ | [English README](../README.md)
+本仓库包含了我们的AAAI-2022 [论文](https://arxiv.org/abs/2105.02446)中提出的DiffSpeech (用于语音合成) 与 DiffSinger (用于歌声合成) 的官方Pytorch实现。
+<table style="width:100%">
+  <tr>
+    <th>DiffSinger/DiffSpeech训练阶段</th>
+    <th>DiffSinger/DiffSpeech推理阶段</th>
+  </tr>
+  <tr>
+    <td><img src="resources/model_a.png" alt="Training" height="300"></td>
+    <td><img src="resources/model_b.png" alt="Inference" height="300"></td>
+  </tr>
+</table>
+:tada: :tada: :tada: **一些重要更新**:
+- Mar.2, 2022: [MIDI-新版](README-SVS-opencpop-e2e.md): 重大更新 :sparkles:
+ - Mar.1, 2022: [NeuralSVB](https://github.com/MoonInTheRiver/NeuralSVB), 为了歌声美化任务的代码，开源了 :sparkles:  :sparkles:  :sparkles: .
+ - Feb.13, 2022: [NATSpeech](https://github.com/NATSpeech/NATSpeech), 一个升级后的代码框架, 包含了DiffSpeech和我们NeurIPS-2021的工作[PortaSpeech](https://openreview.net/forum?id=xmJsuh8xlq) 已经开源! :sparkles: :sparkles: :sparkles:.
+ - Jan.29, 2022: 支持了[MIDI-旧版](README-SVS-opencpop-cascade.md) 版本的歌声合成系统.
+ - Jan.13, 2022: 支持了歌声合成系统, 开源了PopCS数据集.
+ - Dec.19, 2021: 支持了语音合成系统. [HuggingFace🤗 Demo](https://huggingface.co/spaces/NATSpeech/DiffSpeech)
+:rocket: **新闻**:
+ - Feb.24, 2022: 我们的新工作`NeuralSVB` 被 ACL-2022 接收 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2202.13277). [音频演示](https://neuralsvb.github.io).
+ - Dec.01, 2021: DiffSinger被AAAI-2022接收.
+ - Sep.29, 2021: 我们的新工作`PortaSpeech: Portable and High-Quality Generative Text-to-Speech` 被NeurIPS-2021接收 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2109.15166) .
+ - May.06, 2021: 我们把这篇DiffSinger提交到了公开论文网站: Arxiv [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446).
+## 安装依赖
+```sh
+conda create -n your_env_name python=3.8
+source activate your_env_name
+pip install -r requirements_2080.txt   (GPU 2080Ti, CUDA 10.2)
+or pip install -r requirements_3090.txt   (GPU 3090, CUDA 11.4)
+```
+## DiffSpeech (语音合成的版本)
+### 1. 准备工作
+#### 数据准备
+a) 下载并解压 [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/), 创建软链接: `ln -s /xxx/LJSpeech-1.1/ data/raw/`
+b) 下载并解压 [我们用MFA预处理好的对齐](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/mfa_outputs.tar):  `tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/`
+c) 按照如下脚本给数据集打包，打包后的二进制文件用于后续的训练和推理.
+```sh
+export PYTHONPATH=.
+CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml
+# `data/binary/ljspeech` will be generated.
+```
+#### 声码器准备
+我们提供了[HifiGAN](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0414_hifi_lj_1.zip)声码器的预训练模型.
+请在训练声学模型前，先把声码器文件解压到`checkpoints`里。
+### 2. 训练样例
+首先你需要一个预训练好的FastSpeech2存档点. 你可以用[我们预训练好的模型](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/fs2_lj_1.zip), 或者跑下面这个指令从零开始训练FastSpeech2:
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config configs/tts/lj/fs2.yaml --exp_name fs2_lj_1 --reset
+```
+然后为了训练DiffSpeech, 运行:
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset
+```
+记得针对你的路径修改`usr/configs/lj_ds_beta6.yaml`里"fs2_ckpt"这个参数.
+### 3. 推理样例
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset --infer
+```
+我们也提供了:
+ - [DiffSpeech](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/lj_ds_beta6_1213.zip)的预训练模型;
+ - [FastSpeech 2](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/fs2_lj_1.zip)的预训练模型, 这是为了DiffSpeech里的浅扩散机制;
+记得把预训练模型放在 `checkpoints` 目录.
+## DiffSinger (歌声合成的版本)
+### 0. 数据获取
+- 见 [申请表](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md).
+- 数据集 [预览](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_preview.zip).
+### 1. Preparation
+#### 数据准备
+a) 下载并解压PopCS, 创建软链接: `ln -s /xxx/popcs/ data/processed/popcs`
+b) 按照如下脚本给数据集打包，打包后的二进制文件用于后续的训练和推理.
+```sh
+export PYTHONPATH=.
+CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/popcs_ds_beta6.yaml
+# `data/binary/popcs-pmf0` 会生成出来.
+```
+#### 声码器准备
+我们提供了[HifiGAN-Singing](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0109_hifigan_bigpopcs_hop128.zip)的预训练模型, 它专门为了歌声合成系统设计, 采用了NSF的技术。
+请在训练声学模型前，先把声码器文件解压到`checkpoints`里。
+(更新: 你也可以将我们提供的[训练更多步数的存档点](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/model_ckpt_steps_1512000.ckpt)放到声码器的文件夹里)
+这个声码器是在大约70小时的较大数据集上训练的, 可以被认为是一个通用声码器。
+### 2. 训练样例
+首先你需要一个预训练好的FFT-Singer. 你可以用[我们预训练好的模型](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_fs2_pmf0_1230.zip), 或者用如下脚本从零训练FFT-Singer:
+```sh
+# First, train fft-singer;
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset
+# Then, infer fft-singer;
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer
+```
+然后, 为了训练DiffSinger, 运行:
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset
+```
+记得针对你的路径修改`usr/configs/popcs_ds_beta6_offline.yaml`里"fs2_ckpt"这个参数.
+### 3. 推理样例
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset --infer
+```
+我们也提供了:
+ - [DiffSinger](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_ds_beta6_offline_pmf0_1230.zip)的预训练模型;
+ - [FFT-Singer](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_fs2_pmf0_1230.zip)的预训练模型, 这是为了DiffSinger里的浅扩散机制;
+记得把预训练模型放在 `checkpoints` 目录.
+*请注意：*
+-*我们原始论文中的PWG版本声码器已投入商业使用，因此我们提供此HifiGAN版本声码器作为替代品。*
+-*我们这篇论文假设提供真实的F0来进行实验，如[1][2][3]等前作所做的那样，重点在频谱建模上，而非F0曲线的预测。如果你想对MIDI数据进行实验，从MIDI和歌词预测F0曲线（显式或隐式），请查看文档[MIDI-old-version](README-SVS-opencpop-cascade.md) 或 [MIDI-new-version](README-SVS-opencpop-e2e.md)。目前已经支持的MIDI数据集有: Opencpop*
+[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.
+[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.
+[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.
+## Tensorboard
+```sh
+tensorboard --logdir_spec exp_name
+```
+<table style="width:100%">
+  <tr>
+    <td><img src="resources/tfb.png" alt="Tensorboard" height="250"></td>
+  </tr>
+</table>
+## Mel 可视化
+沿着纵轴, DiffSpeech: [0-80]; FastSpeech2: [80-160].
+<table style="width:100%">
+  <tr>
+    <th>DiffSpeech vs. FastSpeech 2</th>
+  </tr>
+  <tr>
+    <td><img src="resources/diffspeech-fs2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
+  </tr>
+  <tr>
+    <td><img src="resources/diffspeech-fs2-1.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
+  </tr>
+  <tr>
+    <td><img src="resources/diffspeech-fs2-2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
+  </tr>
+</table>
+## Audio Demos
+音频样本可以看我们的[样例页](https://diffsinger.github.io/).
+我们也放了部分由DiffSpeech+HifiGAN (标记为[P]) 和 GTmel+HifiGAN (标记为[G]) 生成的测试集音频样例在：[resources/demos_1213](../resources/demos_1213).
+(对应这个预训练参数：[DiffSpeech](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/lj_ds_beta6_1213.zip))
+---
+:rocket: :rocket: :rocket: **更新:**
+新生成的歌声样例在：[resources/demos_0112](../resources/demos_0112).
+## Citation
+如果本仓库对你的研究和工作有用，请引用以下论文：
+    @article{liu2021diffsinger,
+      title={Diffsinger: Singing voice synthesis via shallow diffusion mechanism},
+      author={Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Liu, Peng and Zhao, Zhou},
+      journal={arXiv preprint arXiv:2105.02446},
+      volume={2},
+      year={2021}}
+## 鸣谢
+我们的代码基于如下仓库:
+* [denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch)
+* [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
+* [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
+* [HifiGAN](https://github.com/jik876/hifi-gan)
+* [espnet](https://github.com/espnet/espnet)
+* [DiffWave](https://github.com/lmnt-com/diffwave)

inference/svs/base_svs_infer.py ADDED Viewed

	@@ -0,0 +1,265 @@

+import os
+import torch
+import numpy as np
+from modules.hifigan.hifigan import HifiGanGenerator
+from vocoders.hifigan import HifiGAN
+from inference.svs.opencpop.map import cpop_pinyin2ph_func
+from utils import load_ckpt
+from utils.hparams import set_hparams, hparams
+from utils.text_encoder import TokenTextEncoder
+from pypinyin import pinyin, lazy_pinyin, Style
+import librosa
+import glob
+import re
+class BaseSVSInfer:
+    def __init__(self, hparams, device=None):
+        if device is None:
+            device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.hparams = hparams
+        self.device = device
+        phone_list = ["AP", "SP", "a", "ai", "an", "ang", "ao", "b", "c", "ch", "d", "e", "ei", "en", "eng", "er", "f", "g",
+                  "h", "i", "ia", "ian", "iang", "iao", "ie", "in", "ing", "iong", "iu", "j", "k", "l", "m", "n", "o",
+                  "ong", "ou", "p", "q", "r", "s", "sh", "t", "u", "ua", "uai", "uan", "uang", "ui", "un", "uo", "v",
+                  "van", "ve", "vn", "w", "x", "y", "z", "zh"]
+        self.ph_encoder = TokenTextEncoder(None, vocab_list=phone_list, replace_oov=',')
+        self.pinyin2phs = cpop_pinyin2ph_func()
+        self.spk_map = {'opencpop': 0}
+        self.model = self.build_model()
+        self.model.eval()
+        self.model.to(self.device)
+        self.vocoder = self.build_vocoder()
+        self.vocoder.eval()
+        self.vocoder.to(self.device)
+    def build_model(self):
+        raise NotImplementedError
+    def forward_model(self, inp):
+        raise NotImplementedError
+    def build_vocoder(self):
+        base_dir = hparams['vocoder_ckpt']
+        config_path = f'{base_dir}/config.yaml'
+        ckpt = sorted(glob.glob(f'{base_dir}/model_ckpt_steps_*.ckpt'), key=
+        lambda x: int(re.findall(f'{base_dir}/model_ckpt_steps_(\d+).ckpt', x)[0]))[-1]
+        print('| load HifiGAN: ', ckpt)
+        ckpt_dict = torch.load(ckpt, map_location="cpu")
+        config = set_hparams(config_path, global_hparams=False)
+        state = ckpt_dict["state_dict"]["model_gen"]
+        vocoder = HifiGanGenerator(config)
+        vocoder.load_state_dict(state, strict=True)
+        vocoder.remove_weight_norm()
+        vocoder = vocoder.eval().to(self.device)
+        return vocoder
+    def run_vocoder(self, c, **kwargs):
+        c = c.transpose(2, 1)  # [B, 80, T]
+        f0 = kwargs.get('f0')  # [B, T]
+        if f0 is not None and hparams.get('use_nsf'):
+            # f0 = torch.FloatTensor(f0).to(self.device)
+            y = self.vocoder(c, f0).view(-1)
+        else:
+            y = self.vocoder(c).view(-1)
+            # [T]
+        return y[None]
+    def preprocess_word_level_input(self, inp):
+        # Pypinyin can't solve polyphonic words
+        text_raw = inp['text'].replace('最长', '最常').replace('长睫毛', '常睫毛') \
+            .replace('那么长', '那么常').replace('多长', '多常') \
+            .replace('很长', '很常')  # We hope someone could provide a better g2p module for us by opening pull requests.
+        # lyric
+        pinyins = lazy_pinyin(text_raw, strict=False)
+        ph_per_word_lst = [self.pinyin2phs[pinyin.strip()] for pinyin in pinyins if pinyin.strip() in self.pinyin2phs]
+        # Note
+        note_per_word_lst = [x.strip() for x in inp['notes'].split('|') if x.strip() != '']
+        mididur_per_word_lst = [x.strip() for x in inp['notes_duration'].split('|') if x.strip() != '']
+        if len(note_per_word_lst) == len(ph_per_word_lst) == len(mididur_per_word_lst):
+            print('Pass word-notes check.')
+        else:
+            print('The number of words does\'t match the number of notes\' windows. ',
+                  'You should split the note(s) for each word by | mark.')
+            print(ph_per_word_lst, note_per_word_lst, mididur_per_word_lst)
+            print(len(ph_per_word_lst), len(note_per_word_lst), len(mididur_per_word_lst))
+            return None
+        note_lst = []
+        ph_lst = []
+        midi_dur_lst = []
+        is_slur = []
+        for idx, ph_per_word in enumerate(ph_per_word_lst):
+            # for phs in one word:
+            # single ph like ['ai']  or multiple phs like ['n', 'i']
+            ph_in_this_word = ph_per_word.split()
+            # for notes in one word:
+            # single note like ['D4'] or multiple notes like ['D4', 'E4'] which means a 'slur' here.
+            note_in_this_word = note_per_word_lst[idx].split()
+            midi_dur_in_this_word = mididur_per_word_lst[idx].split()
+            # process for the model input
+            # Step 1.
+            #  Deal with note of 'not slur' case or the first note of 'slur' case
+            #  j        ie
+            #  F#4/Gb4  F#4/Gb4
+            #  0        0
+            for ph in ph_in_this_word:
+                ph_lst.append(ph)
+                note_lst.append(note_in_this_word[0])
+                midi_dur_lst.append(midi_dur_in_this_word[0])
+                is_slur.append(0)
+            # step 2.
+            #  Deal with the 2nd, 3rd... notes of 'slur' case
+            #  j        ie         ie
+            #  F#4/Gb4  F#4/Gb4    C#4/Db4
+            #  0        0          1
+            if len(note_in_this_word) > 1:  # is_slur = True, we should repeat the YUNMU to match the 2nd, 3rd... notes.
+                for idx in range(1, len(note_in_this_word)):
+                    ph_lst.append(ph_in_this_word[1])
+                    note_lst.append(note_in_this_word[idx])
+                    midi_dur_lst.append(midi_dur_in_this_word[idx])
+                    is_slur.append(1)
+        ph_seq = ' '.join(ph_lst)
+        if len(ph_lst) == len(note_lst) == len(midi_dur_lst):
+            print(len(ph_lst), len(note_lst), len(midi_dur_lst))
+            print('Pass word-notes check.')
+        else:
+            print('The number of words does\'t match the number of notes\' windows. ',
+                  'You should split the note(s) for each word by | mark.')
+            return None
+        return ph_seq, note_lst, midi_dur_lst, is_slur
+    def preprocess_phoneme_level_input(self, inp):
+        ph_seq = inp['ph_seq']
+        note_lst = inp['note_seq'].split()
+        midi_dur_lst = inp['note_dur_seq'].split()
+        is_slur = inp['is_slur_seq'].split()
+        print(len(note_lst), len(ph_seq.split()), len(midi_dur_lst))
+        if len(note_lst) == len(ph_seq.split()) == len(midi_dur_lst):
+            print('Pass word-notes check.')
+        else:
+            print('The number of words does\'t match the number of notes\' windows. ',
+                  'You should split the note(s) for each word by | mark.')
+            return None
+        return ph_seq, note_lst, midi_dur_lst, is_slur
+    def preprocess_input(self, inp, input_type='word'):
+        """
+        :param inp: {'text': str, 'item_name': (str, optional), 'spk_name': (str, optional)}
+        :return:
+        """
+        item_name = inp.get('item_name', '<ITEM_NAME>')
+        spk_name = inp.get('spk_name', 'opencpop')
+        # single spk
+        spk_id = self.spk_map[spk_name]
+        # get ph seq, note lst, midi dur lst, is slur lst.
+        if input_type == 'word':
+            ret = self.preprocess_word_level_input(inp)
+        elif input_type == 'phoneme':  # like transcriptions.txt in Opencpop dataset.
+            ret = self.preprocess_phoneme_level_input(inp)
+        else:
+            print('Invalid input type.')
+            return None
+        if ret:
+            ph_seq, note_lst, midi_dur_lst, is_slur = ret
+        else:
+            print('==========> Preprocess_word_level or phone_level input wrong.')
+            return None
+        # convert note lst to midi id; convert note dur lst to midi duration
+        try:
+            midis = [librosa.note_to_midi(x.split("/")[0]) if x != 'rest' else 0
+                     for x in note_lst]
+            midi_dur_lst = [float(x) for x in midi_dur_lst]
+        except Exception as e:
+            print(e)
+            print('Invalid Input Type.')
+            return None
+        ph_token = self.ph_encoder.encode(ph_seq)
+        item = {'item_name': item_name, 'text': inp['text'], 'ph': ph_seq, 'spk_id': spk_id,
+                'ph_token': ph_token, 'pitch_midi': np.asarray(midis), 'midi_dur': np.asarray(midi_dur_lst),
+                'is_slur': np.asarray(is_slur), }
+        item['ph_len'] = len(item['ph_token'])
+        return item
+    def input_to_batch(self, item):
+        item_names = [item['item_name']]
+        text = [item['text']]
+        ph = [item['ph']]
+        txt_tokens = torch.LongTensor(item['ph_token'])[None, :].to(self.device)
+        txt_lengths = torch.LongTensor([txt_tokens.shape[1]]).to(self.device)
+        spk_ids = torch.LongTensor(item['spk_id'])[None, :].to(self.device)
+        pitch_midi = torch.LongTensor(item['pitch_midi'])[None, :hparams['max_frames']].to(self.device)
+        midi_dur = torch.FloatTensor(item['midi_dur'])[None, :hparams['max_frames']].to(self.device)
+        is_slur = torch.LongTensor(item['is_slur'])[None, :hparams['max_frames']].to(self.device)
+        batch = {
+            'item_name': item_names,
+            'text': text,
+            'ph': ph,
+            'txt_tokens': txt_tokens,
+            'txt_lengths': txt_lengths,
+            'spk_ids': spk_ids,
+            'pitch_midi': pitch_midi,
+            'midi_dur': midi_dur,
+            'is_slur': is_slur
+        }
+        return batch
+    def postprocess_output(self, output):
+        return output
+    def infer_once(self, inp):
+        inp = self.preprocess_input(inp, input_type=inp['input_type'] if inp.get('input_type') else 'word')
+        output = self.forward_model(inp)
+        output = self.postprocess_output(output)
+        return output
+    @classmethod
+    def example_run(cls, inp):
+        from utils.audio import save_wav
+        set_hparams(print_hparams=False)
+        infer_ins = cls(hparams)
+        out = infer_ins.infer_once(inp)
+        os.makedirs('infer_out', exist_ok=True)
+        save_wav(out, f'infer_out/example_out.wav', hparams['audio_sample_rate'])
+# if __name__ == '__main__':
+    # debug
+    # a = BaseSVSInfer(hparams)
+    # a.preprocess_input({'text': '你 说 你 不 SP 懂 为 何 在 这 时 牵 手 AP',
+    #                     'notes': 'D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | rest | D#4/Eb4 | D4 | D4 | D4 | D#4/Eb4 | F4 | D#4/Eb4 | D4 | rest',
+    #                     'notes_duration': '0.113740 | 0.329060 | 0.287950 | 0.133480 | 0.150900 | 0.484730 | 0.242010 | 0.180820 | 0.343570 | 0.152050 | 0.266720 | 0.280310 | 0.633300 | 0.444590'
+    #                     })
+    # b = {
+    #     'text': '小酒窝长睫毛AP是你最美的记号',
+    #     'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+    #     'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340'
+    # }
+    # c = {
+    #     'text': '小酒窝长睫毛AP是你最美的记号',
+    #     'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
+    #     'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
+    #     'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
+    #     'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
+    # }  # input like Opencpop dataset.
+    # a.preprocess_input(b)
+    # a.preprocess_input(c, input_type='phoneme')

inference/svs/ds_cascade.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import torch
+# from inference.tts.fs import FastSpeechInfer
+# from modules.tts.fs2_orig import FastSpeech2Orig
+from inference.svs.base_svs_infer import BaseSVSInfer
+from utils import load_ckpt
+from utils.hparams import hparams
+from usr.diff.shallow_diffusion_tts import GaussianDiffusion
+from usr.diffsinger_task import DIFF_DECODERS
+class DiffSingerCascadeInfer(BaseSVSInfer):
+    def build_model(self):
+        model = GaussianDiffusion(
+            phone_encoder=self.ph_encoder,
+            out_dims=hparams['audio_num_mel_bins'], denoise_fn=DIFF_DECODERS[hparams['diff_decoder_type']](hparams),
+            timesteps=hparams['timesteps'],
+            K_step=hparams['K_step'],
+            loss_type=hparams['diff_loss_type'],
+            spec_min=hparams['spec_min'], spec_max=hparams['spec_max'],
+        )
+        model.eval()
+        load_ckpt(model, hparams['work_dir'], 'model')
+        return model
+    def forward_model(self, inp):
+        sample = self.input_to_batch(inp)
+        txt_tokens = sample['txt_tokens']  # [B, T_t]
+        spk_id = sample.get('spk_ids')
+        with torch.no_grad():
+            output = self.model(txt_tokens, spk_id=spk_id, ref_mels=None, infer=True,
+                                pitch_midi=sample['pitch_midi'], midi_dur=sample['midi_dur'],
+                                is_slur=sample['is_slur'])
+            mel_out = output['mel_out']  # [B, T,80]
+            f0_pred = output['f0_denorm']
+            wav_out = self.run_vocoder(mel_out, f0=f0_pred)
+        wav_out = wav_out.cpu().numpy()
+        return wav_out[0]
+if __name__ == '__main__':
+    inp = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+        'input_type': 'word'
+    }  # user input: Chinese characters
+    c = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
+        'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
+        'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
+        'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
+        'input_type': 'phoneme'
+    }  # input like Opencpop dataset.
+    DiffSingerCascadeInfer.example_run(inp)

inference/svs/ds_e2e.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import torch
+# from inference.tts.fs import FastSpeechInfer
+# from modules.tts.fs2_orig import FastSpeech2Orig
+from inference.svs.base_svs_infer import BaseSVSInfer
+from utils import load_ckpt
+from utils.hparams import hparams
+from usr.diff.shallow_diffusion_tts import GaussianDiffusion
+from usr.diffsinger_task import DIFF_DECODERS
+from modules.fastspeech.pe import PitchExtractor
+import utils
+class DiffSingerE2EInfer(BaseSVSInfer):
+    def build_model(self):
+        model = GaussianDiffusion(
+            phone_encoder=self.ph_encoder,
+            out_dims=hparams['audio_num_mel_bins'], denoise_fn=DIFF_DECODERS[hparams['diff_decoder_type']](hparams),
+            timesteps=hparams['timesteps'],
+            K_step=hparams['K_step'],
+            loss_type=hparams['diff_loss_type'],
+            spec_min=hparams['spec_min'], spec_max=hparams['spec_max'],
+        )
+        model.eval()
+        load_ckpt(model, hparams['work_dir'], 'model')
+        if hparams.get('pe_enable') is not None and hparams['pe_enable']:
+            self.pe = PitchExtractor().cuda()
+            utils.load_ckpt(self.pe, hparams['pe_ckpt'], 'model', strict=True)
+            self.pe.eval()
+        return model
+    def forward_model(self, inp):
+        sample = self.input_to_batch(inp)
+        txt_tokens = sample['txt_tokens']  # [B, T_t]
+        spk_id = sample.get('spk_ids')
+        with torch.no_grad():
+            output = self.model(txt_tokens, spk_id=spk_id, ref_mels=None, infer=True,
+                                pitch_midi=sample['pitch_midi'], midi_dur=sample['midi_dur'],
+                                is_slur=sample['is_slur'])
+            mel_out = output['mel_out']  # [B, T,80]
+            if hparams.get('pe_enable') is not None and hparams['pe_enable']:
+                f0_pred = self.pe(mel_out)['f0_denorm_pred']  # pe predict from Pred mel
+            else:
+                f0_pred = output['f0_denorm']
+            wav_out = self.run_vocoder(mel_out, f0=f0_pred)
+        wav_out = wav_out.cpu().numpy()
+        return wav_out[0]
+if __name__ == '__main__':
+    inp = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+        'input_type': 'word'
+    }  # user input: Chinese characters
+    c = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
+        'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
+        'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
+        'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
+        'input_type': 'phoneme'
+    }  # input like Opencpop dataset.
+    DiffSingerE2EInfer.example_run(inp)
+# python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name 0228_opencpop_ds100_rel

inference/svs/gradio/gradio_settings.yaml ADDED Viewed

	@@ -0,0 +1,19 @@

+title: 'DiffSinger'
+description: |
+  Gradio demo for DiffSinger.
+  请给每个汉字分配音高和时值, 每个字对应的音高和时值需要用|分隔符隔开。需要保证分隔符分割出来的音符窗口与汉字个数(AP或SP也算一个汉字)一致。
+article: |
+  Link to <a href='https://github.com/MoonInTheRiver/DiffSinger' style='color:blue;' target='_blank\'>Github REPO</a>
+example_inputs:
+  - |-
+    你 说 你 不 SP 懂 为 何 在 这 时 牵 手 AP<sep>D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | rest | D#4/Eb4 | D4 | D4 | D4 | D#4/Eb4 | F4 | D#4/Eb4 | D4 | rest<sep>0.113740 | 0.329060 | 0.287950 | 0.133480 | 0.150900 | 0.484730 | 0.242010 | 0.180820 | 0.343570 | 0.152050 | 0.266720 | 0.280310 | 0.633300 | 0.444590
+  - |-
+    小酒窝长睫毛AP是你最美的记号<sep>C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4<sep>0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340
+#inference_cls: inference.svs.ds_cascade.DiffSingerCascadeInfer
+#exp_name: 0303_opencpop_ds58_midi
+inference_cls: inference.svs.ds_e2e.DiffSingerE2EInfer
+exp_name: 0228_opencpop_ds100_rel

inference/svs/gradio/infer.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import importlib
+import re
+import gradio as gr
+import yaml
+from gradio.inputs import Textbox
+from inference.svs.base_svs_infer import BaseSVSInfer
+from utils.hparams import set_hparams
+from utils.hparams import hparams as hp
+import numpy as np
+class GradioInfer:
+    def __init__(self, exp_name, inference_cls, title, description, article, example_inputs):
+        self.exp_name = exp_name
+        self.title = title
+        self.description = description
+        self.article = article
+        self.example_inputs = example_inputs
+        pkg = ".".join(inference_cls.split(".")[:-1])
+        cls_name = inference_cls.split(".")[-1]
+        self.inference_cls = getattr(importlib.import_module(pkg), cls_name)
+    def greet(self, text, notes, notes_duration):
+        PUNCS = '。？；：'
+        sents = re.split(rf'([{PUNCS}])', text.replace('\n', ','))
+        sents_notes = re.split(rf'([{PUNCS}])', notes.replace('\n', ','))
+        sents_notes_dur = re.split(rf'([{PUNCS}])', notes_duration.replace('\n', ','))
+        if sents[-1] not in list(PUNCS):
+            sents = sents + ['']
+            sents_notes = sents_notes + ['']
+            sents_notes_dur = sents_notes_dur + ['']
+        audio_outs = []
+        s, n, n_dur = "", "", ""
+        for i in range(0, len(sents), 2):
+            if len(sents[i]) > 0:
+                s += sents[i] + sents[i + 1]
+                n += sents_notes[i] + sents_notes[i+1]
+                n_dur += sents_notes_dur[i] + sents_notes_dur[i+1]
+            if len(s) >= 400 or (i >= len(sents) - 2 and len(s) > 0):
+                audio_out = self.infer_ins.infer_once({
+                    'text': s,
+                    'notes': n,
+                    'notes_duration': n_dur,
+                })
+                audio_out = audio_out * 32767
+                audio_out = audio_out.astype(np.int16)
+                audio_outs.append(audio_out)
+                audio_outs.append(np.zeros(int(hp['audio_sample_rate'] * 0.3)).astype(np.int16))
+                s = ""
+                n = ""
+        audio_outs = np.concatenate(audio_outs)
+        return hp['audio_sample_rate'], audio_outs
+    def run(self):
+        set_hparams(exp_name=self.exp_name, print_hparams=False)
+        infer_cls = self.inference_cls
+        self.infer_ins: BaseSVSInfer = infer_cls(hp)
+        example_inputs = self.example_inputs
+        for i in range(len(example_inputs)):
+            text, notes, notes_dur = example_inputs[i].split('<sep>')
+            example_inputs[i] = [text, notes, notes_dur]
+        iface = gr.Interface(fn=self.greet,
+                             inputs=[
+                                 Textbox(lines=2, placeholder=None, default=example_inputs[0][0], label="input text"),
+                                 Textbox(lines=2, placeholder=None, default=example_inputs[0][1], label="input note"),
+                                 Textbox(lines=2, placeholder=None, default=example_inputs[0][2], label="input duration")]
+                             ,
+                             outputs="audio",
+                             allow_flagging="never",
+                             title=self.title,
+                             description=self.description,
+                             article=self.article,
+                             examples=example_inputs,
+                             enable_queue=True)
+        iface.launch(share=True,)# cache_examples=True)
+if __name__ == '__main__':
+    gradio_config = yaml.safe_load(open('inference/svs/gradio/gradio_settings.yaml'))
+    g = GradioInfer(**gradio_config)
+    g.run()
+# python inference/svs/gradio/infer.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name 0303_opencpop_ds58_midi
+# python inference/svs/ds_cascade.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name 0303_opencpop_ds58_midi
+# CUDA_VISIBLE_DEVICES=3 python inference/svs/gradio/infer.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name 0228_opencpop_ds100_rel

inference/svs/opencpop/cpop_pinyin2ph.txt ADDED Viewed

	@@ -0,0 +1,418 @@

+| a      | a        |
+| ai     | ai       |
+| an     | an       |
+| ang    | ang      |
+| ao     | ao       |
+| ba     | b a      |
+| bai    | b ai     |
+| ban    | b an     |
+| bang   | b ang    |
+| bao    | b ao     |
+| bei    | b ei     |
+| ben    | b en     |
+| beng   | b eng    |
+| bi     | b i      |
+| bian   | b ian    |
+| biao   | b iao    |
+| bie    | b ie     |
+| bin    | b in     |
+| bing   | b ing    |
+| bo     | b o      |
+| bu     | b u      |
+| ca     | c a      |
+| cai    | c ai     |
+| can    | c an     |
+| cang   | c ang    |
+| cao    | c ao     |
+| ce     | c e      |
+| cei    | c ei     |
+| cen    | c en     |
+| ceng   | c eng    |
+| cha    | ch a     |
+| chai   | ch ai    |
+| chan   | ch an    |
+| chang  | ch ang   |
+| chao   | ch ao    |
+| che    | ch e     |
+| chen   | ch en    |
+| cheng  | ch eng   |
+| chi    | ch i     |
+| chong  | ch ong   |
+| chou   | ch ou    |
+| chu    | ch u     |
+| chua   | ch ua    |
+| chuai  | ch uai   |
+| chuan  | ch uan   |
+| chuang | ch uang  |
+| chui   | ch ui    |
+| chun   | ch un    |
+| chuo   | ch uo    |
+| ci     | c i      |
+| cong   | c ong    |
+| cou    | c ou     |
+| cu     | c u      |
+| cuan   | c uan    |
+| cui    | c ui     |
+| cun    | c un     |
+| cuo    | c uo     |
+| da     | d a      |
+| dai    | d ai     |
+| dan    | d an     |
+| dang   | d ang    |
+| dao    | d ao     |
+| de     | d e      |
+| dei    | d ei     |
+| den    | d en     |
+| deng   | d eng    |
+| di     | d i      |
+| dia    | d ia     |
+| dian   | d ian    |
+| diao   | d iao    |
+| die    | d ie     |
+| ding   | d ing    |
+| diu    | d iu     |
+| dong   | d ong    |
+| dou    | d ou     |
+| du     | d u      |
+| duan   | d uan    |
+| dui    | d ui     |
+| dun    | d un     |
+| duo    | d uo     |
+| e      | e        |
+| ei     | ei       |
+| en     | en       |
+| eng    | eng      |
+| er     | er       |
+| fa     | f a      |
+| fan    | f an     |
+| fang   | f ang    |
+| fei    | f ei     |
+| fen    | f en     |
+| feng   | f eng    |
+| fo     | f o      |
+| fou    | f ou     |
+| fu     | f u      |
+| ga     | g a      |
+| gai    | g ai     |
+| gan    | g an     |
+| gang   | g ang    |
+| gao    | g ao     |
+| ge     | g e      |
+| gei    | g ei     |
+| gen    | g en     |
+| geng   | g eng    |
+| gong   | g ong    |
+| gou    | g ou     |
+| gu     | g u      |
+| gua    | g ua     |
+| guai   | g uai    |
+| guan   | g uan    |
+| guang  | g uang   |
+| gui    | g ui     |
+| gun    | g un     |
+| guo    | g uo     |
+| ha     | h a      |
+| hai    | h ai     |
+| han    | h an     |
+| hang   | h ang    |
+| hao    | h ao     |
+| he     | h e      |
+| hei    | h ei     |
+| hen    | h en     |
+| heng   | h eng    |
+| hm     | h m      |
+| hng    | h ng     |
+| hong   | h ong    |
+| hou    | h ou     |
+| hu     | h u      |
+| hua    | h ua     |
+| huai   | h uai    |
+| huan   | h uan    |
+| huang  | h uang   |
+| hui    | h ui     |
+| hun    | h un     |
+| huo    | h uo     |
+| ji     | j i      |
+| jia    | j ia     |
+| jian   | j ian    |
+| jiang  | j iang   |
+| jiao   | j iao    |
+| jie    | j ie     |
+| jin    | j in     |
+| jing   | j ing    |
+| jiong  | j iong   |
+| jiu    | j iu     |
+| ju     | j v      |
+| juan   | j van    |
+| jue    | j ve     |
+| jun    | j vn     |
+| ka     | k a      |
+| kai    | k ai     |
+| kan    | k an     |
+| kang   | k ang    |
+| kao    | k ao     |
+| ke     | k e      |
+| kei    | k ei     |
+| ken    | k en     |
+| keng   | k eng    |
+| kong   | k ong    |
+| kou    | k ou     |
+| ku     | k u      |
+| kua    | k ua     |
+| kuai   | k uai    |
+| kuan   | k uan    |
+| kuang  | k uang   |
+| kui    | k ui     |
+| kun    | k un     |
+| kuo    | k uo     |
+| la     | l a      |
+| lai    | l ai     |
+| lan    | l an     |
+| lang   | l ang    |
+| lao    | l ao     |
+| le     | l e      |
+| lei    | l ei     |
+| leng   | l eng    |
+| li     | l i      |
+| lia    | l ia     |
+| lian   | l ian    |
+| liang  | l iang   |
+| liao   | l iao    |
+| lie    | l ie     |
+| lin    | l in     |
+| ling   | l ing    |
+| liu    | l iu     |
+| lo     | l o      |
+| long   | l ong    |
+| lou    | l ou     |
+| lu     | l u      |
+| luan   | l uan    |
+| lun    | l un     |
+| luo    | l uo     |
+| lv     | l v      |
+| lve    | l ve     |
+| m      | m        |
+| ma     | m a      |
+| mai    | m ai     |
+| man    | m an     |
+| mang   | m ang    |
+| mao    | m ao     |
+| me     | m e      |
+| mei    | m ei     |
+| men    | m en     |
+| meng   | m eng    |
+| mi     | m i      |
+| mian   | m ian    |
+| miao   | m iao    |
+| mie    | m ie     |
+| min    | m in     |
+| ming   | m ing    |
+| miu    | m iu     |
+| mo     | m o      |
+| mou    | m ou     |
+| mu     | m u      |
+| n      | n        |
+| na     | n a      |
+| nai    | n ai     |
+| nan    | n an     |
+| nang   | n ang    |
+| nao    | n ao     |
+| ne     | n e      |
+| nei    | n ei     |
+| nen    | n en     |
+| neng   | n eng    |
+| ng     | n g      |
+| ni     | n i      |
+| nian   | n ian    |
+| niang  | n iang   |
+| niao   | n iao    |
+| nie    | n ie     |
+| nin    | n in     |
+| ning   | n ing    |
+| niu    | n iu     |
+| nong   | n ong    |
+| nou    | n ou     |
+| nu     | n u      |
+| nuan   | n uan    |
+| nun    | n un     |
+| nuo    | n uo     |
+| nv     | n v      |
+| nve    | n ve     |
+| o      | o        |
+| ou     | ou       |
+| pa     | p a      |
+| pai    | p ai     |
+| pan    | p an     |
+| pang   | p ang    |
+| pao    | p ao     |
+| pei    | p ei     |
+| pen    | p en     |
+| peng   | p eng    |
+| pi     | p i      |
+| pian   | p ian    |
+| piao   | p iao    |
+| pie    | p ie     |
+| pin    | p in     |
+| ping   | p ing    |
+| po     | p o      |
+| pou    | p ou     |
+| pu     | p u      |
+| qi     | q i      |
+| qia    | q ia     |
+| qian   | q ian    |
+| qiang  | q iang   |
+| qiao   | q iao    |
+| qie    | q ie     |
+| qin    | q in     |
+| qing   | q ing    |
+| qiong  | q iong   |
+| qiu    | q iu     |
+| qu     | q v      |
+| quan   | q van    |
+| que    | q ve     |
+| qun    | q vn     |
+| ran    | r an     |
+| rang   | r ang    |
+| rao    | r ao     |
+| re     | r e      |
+| ren    | r en     |
+| reng   | r eng    |
+| ri     | r i      |
+| rong   | r ong    |
+| rou    | r ou     |
+| ru     | r u      |
+| rua    | r ua     |
+| ruan   | r uan    |
+| rui    | r ui     |
+| run    | r un     |
+| ruo    | r uo     |
+| sa     | s a      |
+| sai    | s ai     |
+| san    | s an     |
+| sang   | s ang    |
+| sao    | s ao     |
+| se     | s e      |
+| sen    | s en     |
+| seng   | s eng    |
+| sha    | sh a     |
+| shai   | sh ai    |
+| shan   | sh an    |
+| shang  | sh ang   |
+| shao   | sh ao    |
+| she    | sh e     |
+| shei   | sh ei    |
+| shen   | sh en    |
+| sheng  | sh eng   |
+| shi    | sh i     |
+| shou   | sh ou    |
+| shu    | sh u     |
+| shua   | sh ua    |
+| shuai  | sh uai   |
+| shuan  | sh uan   |
+| shuang | sh uang  |
+| shui   | sh ui    |
+| shun   | sh un    |
+| shuo   | sh uo    |
+| si     | s i      |
+| song   | s ong    |
+| sou    | s ou     |
+| su     | s u      |
+| suan   | s uan    |
+| sui    | s ui     |
+| sun    | s un     |
+| suo    | s uo     |
+| ta     | t a      |
+| tai    | t ai     |
+| tan    | t an     |
+| tang   | t ang    |
+| tao    | t ao     |
+| te     | t e      |
+| tei    | t ei     |
+| teng   | t eng    |
+| ti     | t i      |
+| tian   | t ian    |
+| tiao   | t iao    |
+| tie    | t ie     |
+| ting   | t ing    |
+| tong   | t ong    |
+| tou    | t ou     |
+| tu     | t u      |
+| tuan   | t uan    |
+| tui    | t ui     |
+| tun    | t un     |
+| tuo    | t uo     |
+| wa     | w a      |
+| wai    | w ai     |
+| wan    | w an     |
+| wang   | w ang    |
+| wei    | w ei     |
+| wen    | w en     |
+| weng   | w eng    |
+| wo     | w o      |
+| wu     | w u      |
+| xi     | x i      |
+| xia    | x ia     |
+| xian   | x ian    |
+| xiang  | x iang   |
+| xiao   | x iao    |
+| xie    | x ie     |
+| xin    | x in     |
+| xing   | x ing    |
+| xiong  | x iong   |
+| xiu    | x iu     |
+| xu     | x v      |
+| xuan   | x van    |
+| xue    | x ve     |
+| xun    | x vn     |
+| ya     | y a      |
+| yan    | y an     |
+| yang   | y ang    |
+| yao    | y ao     |
+| ye     | y e      |
+| yi     | y i      |
+| yin    | y in     |
+| ying   | y ing    |
+| yo     | y o      |
+| yong   | y ong    |
+| you    | y ou     |
+| yu     | y v      |
+| yuan   | y van    |
+| yue    | y ve     |
+| yun    | y vn     |
+| za     | z a      |
+| zai    | z ai     |
+| zan    | z an     |
+| zang   | z ang    |
+| zao    | z ao     |
+| ze     | z e      |
+| zei    | z ei     |
+| zen    | z en     |
+| zeng   | z eng    |
+| zha    | zh a     |
+| zhai   | zh ai    |
+| zhan   | zh an    |
+| zhang  | zh ang   |
+| zhao   | zh ao    |
+| zhe    | zh e     |
+| zhei   | zh ei    |
+| zhen   | zh en    |
+| zheng  | zh eng   |
+| zhi    | zh i     |
+| zhong  | zh ong   |
+| zhou   | zh ou    |
+| zhu    | zh u     |
+| zhua   | zh ua    |
+| zhuai  | zh uai   |
+| zhuan  | zh uan   |
+| zhuang | zh uang  |
+| zhui   | zh ui    |
+| zhun   | zh un    |
+| zhuo   | zh uo    |
+| zi     | z i      |
+| zong   | z ong    |
+| zou    | z ou     |
+| zu     | z u      |
+| zuan   | z uan    |
+| zui    | z ui     |
+| zun    | z un     |
+| zuo    | z uo     |

inference/svs/opencpop/map.py ADDED Viewed

	@@ -0,0 +1,8 @@

+def cpop_pinyin2ph_func():
+    # In the README file of opencpop dataset, they defined a "pinyin to phoneme mapping table"
+    pinyin2phs = {'AP': 'AP', 'SP': 'SP'}
+    with open('inference/svs/opencpop/cpop_pinyin2ph.txt') as rf:
+        for line in rf.readlines():
+            elements = [x.strip() for x in line.split('|') if x.strip() != '']
+            pinyin2phs[elements[0]] = elements[1]
+    return pinyin2phs

modules/__init__.py ADDED Viewed

File without changes

modules/commons/common_layers.py ADDED Viewed

	@@ -0,0 +1,668 @@

+import math
+import torch
+from torch import nn
+from torch.nn import Parameter
+import torch.onnx.operators
+import torch.nn.functional as F
+import utils
+class Reshape(nn.Module):
+    def __init__(self, *args):
+        super(Reshape, self).__init__()
+        self.shape = args
+    def forward(self, x):
+        return x.view(self.shape)
+class Permute(nn.Module):
+    def __init__(self, *args):
+        super(Permute, self).__init__()
+        self.args = args
+    def forward(self, x):
+        return x.permute(self.args)
+class LinearNorm(torch.nn.Module):
+    def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
+        super(LinearNorm, self).__init__()
+        self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
+        torch.nn.init.xavier_uniform_(
+            self.linear_layer.weight,
+            gain=torch.nn.init.calculate_gain(w_init_gain))
+    def forward(self, x):
+        return self.linear_layer(x)
+class ConvNorm(torch.nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
+                 padding=None, dilation=1, bias=True, w_init_gain='linear'):
+        super(ConvNorm, self).__init__()
+        if padding is None:
+            assert (kernel_size % 2 == 1)
+            padding = int(dilation * (kernel_size - 1) / 2)
+        self.conv = torch.nn.Conv1d(in_channels, out_channels,
+                                    kernel_size=kernel_size, stride=stride,
+                                    padding=padding, dilation=dilation,
+                                    bias=bias)
+        torch.nn.init.xavier_uniform_(
+            self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
+    def forward(self, signal):
+        conv_signal = self.conv(signal)
+        return conv_signal
+def Embedding(num_embeddings, embedding_dim, padding_idx=None):
+    m = nn.Embedding(num_embeddings, embedding_dim, padding_idx=padding_idx)
+    nn.init.normal_(m.weight, mean=0, std=embedding_dim ** -0.5)
+    if padding_idx is not None:
+        nn.init.constant_(m.weight[padding_idx], 0)
+    return m
+def LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True, export=False):
+    if not export and torch.cuda.is_available():
+        try:
+            from apex.normalization import FusedLayerNorm
+            return FusedLayerNorm(normalized_shape, eps, elementwise_affine)
+        except ImportError:
+            pass
+    return torch.nn.LayerNorm(normalized_shape, eps, elementwise_affine)
+def Linear(in_features, out_features, bias=True):
+    m = nn.Linear(in_features, out_features, bias)
+    nn.init.xavier_uniform_(m.weight)
+    if bias:
+        nn.init.constant_(m.bias, 0.)
+    return m
+class SinusoidalPositionalEmbedding(nn.Module):
+    """This module produces sinusoidal positional embeddings of any length.
+    Padding symbols are ignored.
+    """
+    def __init__(self, embedding_dim, padding_idx, init_size=1024):
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.padding_idx = padding_idx
+        self.weights = SinusoidalPositionalEmbedding.get_embedding(
+            init_size,
+            embedding_dim,
+            padding_idx,
+        )
+        self.register_buffer('_float_tensor', torch.FloatTensor(1))
+    @staticmethod
+    def get_embedding(num_embeddings, embedding_dim, padding_idx=None):
+        """Build sinusoidal embeddings.
+        This matches the implementation in tensor2tensor, but differs slightly
+        from the description in Section 3.5 of "Attention Is All You Need".
+        """
+        half_dim = embedding_dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -emb)
+        emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)
+        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)
+        if embedding_dim % 2 == 1:
+            # zero pad
+            emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)
+        if padding_idx is not None:
+            emb[padding_idx, :] = 0
+        return emb
+    def forward(self, input, incremental_state=None, timestep=None, positions=None, **kwargs):
+        """Input is expected to be of size [bsz x seqlen]."""
+        bsz, seq_len = input.shape[:2]
+        max_pos = self.padding_idx + 1 + seq_len
+        if self.weights is None or max_pos > self.weights.size(0):
+            # recompute/expand embeddings if needed
+            self.weights = SinusoidalPositionalEmbedding.get_embedding(
+                max_pos,
+                self.embedding_dim,
+                self.padding_idx,
+            )
+        self.weights = self.weights.to(self._float_tensor)
+        if incremental_state is not None:
+            # positions is the same for every token when decoding a single step
+            pos = timestep.view(-1)[0] + 1 if timestep is not None else seq_len
+            return self.weights[self.padding_idx + pos, :].expand(bsz, 1, -1)
+        positions = utils.make_positions(input, self.padding_idx) if positions is None else positions
+        return self.weights.index_select(0, positions.view(-1)).view(bsz, seq_len, -1).detach()
+    def max_positions(self):
+        """Maximum number of supported positions."""
+        return int(1e5)  # an arbitrary large number
+class ConvTBC(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, padding=0):
+        super(ConvTBC, self).__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.padding = padding
+        self.weight = torch.nn.Parameter(torch.Tensor(
+            self.kernel_size, in_channels, out_channels))
+        self.bias = torch.nn.Parameter(torch.Tensor(out_channels))
+    def forward(self, input):
+        return torch.conv_tbc(input.contiguous(), self.weight, self.bias, self.padding)
+class MultiheadAttention(nn.Module):
+    def __init__(self, embed_dim, num_heads, kdim=None, vdim=None, dropout=0., bias=True,
+                 add_bias_kv=False, add_zero_attn=False, self_attention=False,
+                 encoder_decoder_attention=False):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.qkv_same_dim = self.kdim == embed_dim and self.vdim == embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+        self.scaling = self.head_dim ** -0.5
+        self.self_attention = self_attention
+        self.encoder_decoder_attention = encoder_decoder_attention
+        assert not self.self_attention or self.qkv_same_dim, 'Self-attention requires query, key and ' \
+                                                             'value to be of the same size'
+        if self.qkv_same_dim:
+            self.in_proj_weight = Parameter(torch.Tensor(3 * embed_dim, embed_dim))
+        else:
+            self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))
+            self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))
+            self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
+        if bias:
+            self.in_proj_bias = Parameter(torch.Tensor(3 * embed_dim))
+        else:
+            self.register_parameter('in_proj_bias', None)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        if add_bias_kv:
+            self.bias_k = Parameter(torch.Tensor(1, 1, embed_dim))
+            self.bias_v = Parameter(torch.Tensor(1, 1, embed_dim))
+        else:
+            self.bias_k = self.bias_v = None
+        self.add_zero_attn = add_zero_attn
+        self.reset_parameters()
+        self.enable_torch_version = False
+        if hasattr(F, "multi_head_attention_forward"):
+            self.enable_torch_version = True
+        else:
+            self.enable_torch_version = False
+        self.last_attn_probs = None
+    def reset_parameters(self):
+        if self.qkv_same_dim:
+            nn.init.xavier_uniform_(self.in_proj_weight)
+        else:
+            nn.init.xavier_uniform_(self.k_proj_weight)
+            nn.init.xavier_uniform_(self.v_proj_weight)
+            nn.init.xavier_uniform_(self.q_proj_weight)
+        nn.init.xavier_uniform_(self.out_proj.weight)
+        if self.in_proj_bias is not None:
+            nn.init.constant_(self.in_proj_bias, 0.)
+            nn.init.constant_(self.out_proj.bias, 0.)
+        if self.bias_k is not None:
+            nn.init.xavier_normal_(self.bias_k)
+        if self.bias_v is not None:
+            nn.init.xavier_normal_(self.bias_v)
+    def forward(
+            self,
+            query, key, value,
+            key_padding_mask=None,
+            incremental_state=None,
+            need_weights=True,
+            static_kv=False,
+            attn_mask=None,
+            before_softmax=False,
+            need_head_weights=False,
+            enc_dec_attn_constraint_mask=None,
+            reset_attn_weight=None
+    ):
+        """Input shape: Time x Batch x Channel
+        Args:
+            key_padding_mask (ByteTensor, optional): mask to exclude
+                keys that are pads, of shape `(batch, src_len)`, where
+                padding elements are indicated by 1s.
+            need_weights (bool, optional): return the attention weights,
+                averaged over heads (default: False).
+            attn_mask (ByteTensor, optional): typically used to
+                implement causal attention, where the mask prevents the
+                attention from looking forward in time (default: None).
+            before_softmax (bool, optional): return the raw attention
+                weights and values before the attention softmax.
+            need_head_weights (bool, optional): return the attention
+                weights for each head. Implies *need_weights*. Default:
+                return the average attention weights over all heads.
+        """
+        if need_head_weights:
+            need_weights = True
+        tgt_len, bsz, embed_dim = query.size()
+        assert embed_dim == self.embed_dim
+        assert list(query.size()) == [tgt_len, bsz, embed_dim]
+        if self.enable_torch_version and incremental_state is None and not static_kv and reset_attn_weight is None:
+            if self.qkv_same_dim:
+                return F.multi_head_attention_forward(query, key, value,
+                                                      self.embed_dim, self.num_heads,
+                                                      self.in_proj_weight,
+                                                      self.in_proj_bias, self.bias_k, self.bias_v,
+                                                      self.add_zero_attn, self.dropout,
+                                                      self.out_proj.weight, self.out_proj.bias,
+                                                      self.training, key_padding_mask, need_weights,
+                                                      attn_mask)
+            else:
+                return F.multi_head_attention_forward(query, key, value,
+                                                      self.embed_dim, self.num_heads,
+                                                      torch.empty([0]),
+                                                      self.in_proj_bias, self.bias_k, self.bias_v,
+                                                      self.add_zero_attn, self.dropout,
+                                                      self.out_proj.weight, self.out_proj.bias,
+                                                      self.training, key_padding_mask, need_weights,
+                                                      attn_mask, use_separate_proj_weight=True,
+                                                      q_proj_weight=self.q_proj_weight,
+                                                      k_proj_weight=self.k_proj_weight,
+                                                      v_proj_weight=self.v_proj_weight)
+        if incremental_state is not None:
+            print('Not implemented error.')
+            exit()
+        else:
+            saved_state = None
+        if self.self_attention:
+            # self-attention
+            q, k, v = self.in_proj_qkv(query)
+        elif self.encoder_decoder_attention:
+            # encoder-decoder attention
+            q = self.in_proj_q(query)
+            if key is None:
+                assert value is None
+                k = v = None
+            else:
+                k = self.in_proj_k(key)
+                v = self.in_proj_v(key)
+        else:
+            q = self.in_proj_q(query)
+            k = self.in_proj_k(key)
+            v = self.in_proj_v(value)
+        q *= self.scaling
+        if self.bias_k is not None:
+            assert self.bias_v is not None
+            k = torch.cat([k, self.bias_k.repeat(1, bsz, 1)])
+            v = torch.cat([v, self.bias_v.repeat(1, bsz, 1)])
+            if attn_mask is not None:
+                attn_mask = torch.cat([attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1)
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [key_padding_mask, key_padding_mask.new_zeros(key_padding_mask.size(0), 1)], dim=1)
+        q = q.contiguous().view(tgt_len, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if k is not None:
+            k = k.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if v is not None:
+            v = v.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if saved_state is not None:
+            print('Not implemented error.')
+            exit()
+        src_len = k.size(1)
+        # This is part of a workaround to get around fork/join parallelism
+        # not supporting Optional types.
+        if key_padding_mask is not None and key_padding_mask.shape == torch.Size([]):
+            key_padding_mask = None
+        if key_padding_mask is not None:
+            assert key_padding_mask.size(0) == bsz
+            assert key_padding_mask.size(1) == src_len
+        if self.add_zero_attn:
+            src_len += 1
+            k = torch.cat([k, k.new_zeros((k.size(0), 1) + k.size()[2:])], dim=1)
+            v = torch.cat([v, v.new_zeros((v.size(0), 1) + v.size()[2:])], dim=1)
+            if attn_mask is not None:
+                attn_mask = torch.cat([attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1)
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [key_padding_mask, torch.zeros(key_padding_mask.size(0), 1).type_as(key_padding_mask)], dim=1)
+        attn_weights = torch.bmm(q, k.transpose(1, 2))
+        attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz)
+        assert list(attn_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
+        if attn_mask is not None:
+            if len(attn_mask.shape) == 2:
+                attn_mask = attn_mask.unsqueeze(0)
+            elif len(attn_mask.shape) == 3:
+                attn_mask = attn_mask[:, None].repeat([1, self.num_heads, 1, 1]).reshape(
+                    bsz * self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights + attn_mask
+        if enc_dec_attn_constraint_mask is not None:  # bs x head x L_kv
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.masked_fill(
+                enc_dec_attn_constraint_mask.unsqueeze(2).bool(),
+                -1e9,
+            )
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+        if key_padding_mask is not None:
+            # don't attend to padding symbols
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.masked_fill(
+                key_padding_mask.unsqueeze(1).unsqueeze(2),
+                -1e9,
+            )
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+        attn_logits = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+        if before_softmax:
+            return attn_weights, v
+        attn_weights_float = utils.softmax(attn_weights, dim=-1)
+        attn_weights = attn_weights_float.type_as(attn_weights)
+        attn_probs = F.dropout(attn_weights_float.type_as(attn_weights), p=self.dropout, training=self.training)
+        if reset_attn_weight is not None:
+            if reset_attn_weight:
+                self.last_attn_probs = attn_probs.detach()
+            else:
+                assert self.last_attn_probs is not None
+                attn_probs = self.last_attn_probs
+        attn = torch.bmm(attn_probs, v)
+        assert list(attn.size()) == [bsz * self.num_heads, tgt_len, self.head_dim]
+        attn = attn.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
+        attn = self.out_proj(attn)
+        if need_weights:
+            attn_weights = attn_weights_float.view(bsz, self.num_heads, tgt_len, src_len).transpose(1, 0)
+            if not need_head_weights:
+                # average attention weights over heads
+                attn_weights = attn_weights.mean(dim=0)
+        else:
+            attn_weights = None
+        return attn, (attn_weights, attn_logits)
+    def in_proj_qkv(self, query):
+        return self._in_proj(query).chunk(3, dim=-1)
+    def in_proj_q(self, query):
+        if self.qkv_same_dim:
+            return self._in_proj(query, end=self.embed_dim)
+        else:
+            bias = self.in_proj_bias
+            if bias is not None:
+                bias = bias[:self.embed_dim]
+            return F.linear(query, self.q_proj_weight, bias)
+    def in_proj_k(self, key):
+        if self.qkv_same_dim:
+            return self._in_proj(key, start=self.embed_dim, end=2 * self.embed_dim)
+        else:
+            weight = self.k_proj_weight
+            bias = self.in_proj_bias
+            if bias is not None:
+                bias = bias[self.embed_dim:2 * self.embed_dim]
+            return F.linear(key, weight, bias)
+    def in_proj_v(self, value):
+        if self.qkv_same_dim:
+            return self._in_proj(value, start=2 * self.embed_dim)
+        else:
+            weight = self.v_proj_weight
+            bias = self.in_proj_bias
+            if bias is not None:
+                bias = bias[2 * self.embed_dim:]
+            return F.linear(value, weight, bias)
+    def _in_proj(self, input, start=0, end=None):
+        weight = self.in_proj_weight
+        bias = self.in_proj_bias
+        weight = weight[start:end, :]
+        if bias is not None:
+            bias = bias[start:end]
+        return F.linear(input, weight, bias)
+    def apply_sparse_mask(self, attn_weights, tgt_len, src_len, bsz):
+        return attn_weights
+class Swish(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, i):
+        result = i * torch.sigmoid(i)
+        ctx.save_for_backward(i)
+        return result
+    @staticmethod
+    def backward(ctx, grad_output):
+        i = ctx.saved_variables[0]
+        sigmoid_i = torch.sigmoid(i)
+        return grad_output * (sigmoid_i * (1 + i * (1 - sigmoid_i)))
+class CustomSwish(nn.Module):
+    def forward(self, input_tensor):
+        return Swish.apply(input_tensor)
+class TransformerFFNLayer(nn.Module):
+    def __init__(self, hidden_size, filter_size, padding="SAME", kernel_size=1, dropout=0., act='gelu'):
+        super().__init__()
+        self.kernel_size = kernel_size
+        self.dropout = dropout
+        self.act = act
+        if padding == 'SAME':
+            self.ffn_1 = nn.Conv1d(hidden_size, filter_size, kernel_size, padding=kernel_size // 2)
+        elif padding == 'LEFT':
+            self.ffn_1 = nn.Sequential(
+                nn.ConstantPad1d((kernel_size - 1, 0), 0.0),
+                nn.Conv1d(hidden_size, filter_size, kernel_size)
+            )
+        self.ffn_2 = Linear(filter_size, hidden_size)
+        if self.act == 'swish':
+            self.swish_fn = CustomSwish()
+    def forward(self, x, incremental_state=None):
+        # x: T x B x C
+        if incremental_state is not None:
+            assert incremental_state is None, 'Nar-generation does not allow this.'
+            exit(1)
+        x = self.ffn_1(x.permute(1, 2, 0)).permute(2, 0, 1)
+        x = x * self.kernel_size ** -0.5
+        if incremental_state is not None:
+            x = x[-1:]
+        if self.act == 'gelu':
+            x = F.gelu(x)
+        if self.act == 'relu':
+            x = F.relu(x)
+        if self.act == 'swish':
+            x = self.swish_fn(x)
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = self.ffn_2(x)
+        return x
+class BatchNorm1dTBC(nn.Module):
+    def __init__(self, c):
+        super(BatchNorm1dTBC, self).__init__()
+        self.bn = nn.BatchNorm1d(c)
+    def forward(self, x):
+        """
+        :param x: [T, B, C]
+        :return: [T, B, C]
+        """
+        x = x.permute(1, 2, 0)  # [B, C, T]
+        x = self.bn(x)  # [B, C, T]
+        x = x.permute(2, 0, 1)  # [T, B, C]
+        return x
+class EncSALayer(nn.Module):
+    def __init__(self, c, num_heads, dropout, attention_dropout=0.1,
+                 relu_dropout=0.1, kernel_size=9, padding='SAME', norm='ln', act='gelu'):
+        super().__init__()
+        self.c = c
+        self.dropout = dropout
+        self.num_heads = num_heads
+        if num_heads > 0:
+            if norm == 'ln':
+                self.layer_norm1 = LayerNorm(c)
+            elif norm == 'bn':
+                self.layer_norm1 = BatchNorm1dTBC(c)
+            self.self_attn = MultiheadAttention(
+                self.c, num_heads, self_attention=True, dropout=attention_dropout, bias=False,
+            )
+        if norm == 'ln':
+            self.layer_norm2 = LayerNorm(c)
+        elif norm == 'bn':
+            self.layer_norm2 = BatchNorm1dTBC(c)
+        self.ffn = TransformerFFNLayer(
+            c, 4 * c, kernel_size=kernel_size, dropout=relu_dropout, padding=padding, act=act)
+    def forward(self, x, encoder_padding_mask=None, **kwargs):
+        layer_norm_training = kwargs.get('layer_norm_training', None)
+        if layer_norm_training is not None:
+            self.layer_norm1.training = layer_norm_training
+            self.layer_norm2.training = layer_norm_training
+        if self.num_heads > 0:
+            residual = x
+            x = self.layer_norm1(x)
+            x, _, = self.self_attn(
+                query=x,
+                key=x,
+                value=x,
+                key_padding_mask=encoder_padding_mask
+            )
+            x = F.dropout(x, self.dropout, training=self.training)
+            x = residual + x
+            x = x * (1 - encoder_padding_mask.float()).transpose(0, 1)[..., None]
+        residual = x
+        x = self.layer_norm2(x)
+        x = self.ffn(x)
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        x = x * (1 - encoder_padding_mask.float()).transpose(0, 1)[..., None]
+        return x
+class DecSALayer(nn.Module):
+    def __init__(self, c, num_heads, dropout, attention_dropout=0.1, relu_dropout=0.1, kernel_size=9, act='gelu'):
+        super().__init__()
+        self.c = c
+        self.dropout = dropout
+        self.layer_norm1 = LayerNorm(c)
+        self.self_attn = MultiheadAttention(
+            c, num_heads, self_attention=True, dropout=attention_dropout, bias=False
+        )
+        self.layer_norm2 = LayerNorm(c)
+        self.encoder_attn = MultiheadAttention(
+            c, num_heads, encoder_decoder_attention=True, dropout=attention_dropout, bias=False,
+        )
+        self.layer_norm3 = LayerNorm(c)
+        self.ffn = TransformerFFNLayer(
+            c, 4 * c, padding='LEFT', kernel_size=kernel_size, dropout=relu_dropout, act=act)
+    def forward(
+            self,
+            x,
+            encoder_out=None,
+            encoder_padding_mask=None,
+            incremental_state=None,
+            self_attn_mask=None,
+            self_attn_padding_mask=None,
+            attn_out=None,
+            reset_attn_weight=None,
+            **kwargs,
+    ):
+        layer_norm_training = kwargs.get('layer_norm_training', None)
+        if layer_norm_training is not None:
+            self.layer_norm1.training = layer_norm_training
+            self.layer_norm2.training = layer_norm_training
+            self.layer_norm3.training = layer_norm_training
+        residual = x
+        x = self.layer_norm1(x)
+        x, _ = self.self_attn(
+            query=x,
+            key=x,
+            value=x,
+            key_padding_mask=self_attn_padding_mask,
+            incremental_state=incremental_state,
+            attn_mask=self_attn_mask
+        )
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        residual = x
+        x = self.layer_norm2(x)
+        if encoder_out is not None:
+            x, attn = self.encoder_attn(
+                query=x,
+                key=encoder_out,
+                value=encoder_out,
+                key_padding_mask=encoder_padding_mask,
+                incremental_state=incremental_state,
+                static_kv=True,
+                enc_dec_attn_constraint_mask=None, #utils.get_incremental_state(self, incremental_state, 'enc_dec_attn_constraint_mask'),
+                reset_attn_weight=reset_attn_weight
+            )
+            attn_logits = attn[1]
+        else:
+            assert attn_out is not None
+            x = self.encoder_attn.in_proj_v(attn_out.transpose(0, 1))
+            attn_logits = None
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        residual = x
+        x = self.layer_norm3(x)
+        x = self.ffn(x, incremental_state=incremental_state)
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        # if len(attn_logits.size()) > 3:
+        #    indices = attn_logits.softmax(-1).max(-1).values.sum(-1).argmax(-1)
+        #    attn_logits = attn_logits.gather(1,
+        #        indices[:, None, None, None].repeat(1, 1, attn_logits.size(-2), attn_logits.size(-1))).squeeze(1)
+        return x, attn_logits

modules/commons/espnet_positional_embedding.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import math
+import torch
+class PositionalEncoding(torch.nn.Module):
+    """Positional encoding.
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+        reverse (bool): Whether to reverse the input position.
+    """
+    def __init__(self, d_model, dropout_rate, max_len=5000, reverse=False):
+        """Construct an PositionalEncoding object."""
+        super(PositionalEncoding, self).__init__()
+        self.d_model = d_model
+        self.reverse = reverse
+        self.xscale = math.sqrt(self.d_model)
+        self.dropout = torch.nn.Dropout(p=dropout_rate)
+        self.pe = None
+        self.extend_pe(torch.tensor(0.0).expand(1, max_len))
+    def extend_pe(self, x):
+        """Reset the positional encodings."""
+        if self.pe is not None:
+            if self.pe.size(1) >= x.size(1):
+                if self.pe.dtype != x.dtype or self.pe.device != x.device:
+                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)
+                return
+        pe = torch.zeros(x.size(1), self.d_model)
+        if self.reverse:
+            position = torch.arange(
+                x.size(1) - 1, -1, -1.0, dtype=torch.float32
+            ).unsqueeze(1)
+        else:
+            position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, self.d_model, 2, dtype=torch.float32)
+            * -(math.log(10000.0) / self.d_model)
+        )
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.pe = pe.to(device=x.device, dtype=x.dtype)
+    def forward(self, x: torch.Tensor):
+        """Add positional encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x * self.xscale + self.pe[:, : x.size(1)]
+        return self.dropout(x)
+class ScaledPositionalEncoding(PositionalEncoding):
+    """Scaled positional encoding module.
+    See Sec. 3.2  https://arxiv.org/abs/1809.08895
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+    """
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class."""
+        super().__init__(d_model=d_model, dropout_rate=dropout_rate, max_len=max_len)
+        self.alpha = torch.nn.Parameter(torch.tensor(1.0))
+    def reset_parameters(self):
+        """Reset parameters."""
+        self.alpha.data = torch.tensor(1.0)
+    def forward(self, x):
+        """Add positional encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x + self.alpha * self.pe[:, : x.size(1)]
+        return self.dropout(x)
+class RelPositionalEncoding(PositionalEncoding):
+    """Relative positional encoding module.
+    See : Appendix B in https://arxiv.org/abs/1901.02860
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+    """
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class."""
+        super().__init__(d_model, dropout_rate, max_len, reverse=True)
+    def forward(self, x):
+        """Compute positional encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+            torch.Tensor: Positional embedding tensor (1, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x * self.xscale
+        pos_emb = self.pe[:, : x.size(1)]
+        return self.dropout(x) + self.dropout(pos_emb)

modules/commons/ssim.py ADDED Viewed

	@@ -0,0 +1,391 @@

+# '''
+# https://github.com/One-sixth/ms_ssim_pytorch/blob/master/ssim.py
+# '''
+#
+# import torch
+# import torch.jit
+# import torch.nn.functional as F
+#
+#
+# @torch.jit.script
+# def create_window(window_size: int, sigma: float, channel: int):
+#     '''
+#     Create 1-D gauss kernel
+#     :param window_size: the size of gauss kernel
+#     :param sigma: sigma of normal distribution
+#     :param channel: input channel
+#     :return: 1D kernel
+#     '''
+#     coords = torch.arange(window_size, dtype=torch.float)
+#     coords -= window_size // 2
+#
+#     g = torch.exp(-(coords ** 2) / (2 * sigma ** 2))
+#     g /= g.sum()
+#
+#     g = g.reshape(1, 1, 1, -1).repeat(channel, 1, 1, 1)
+#     return g
+#
+#
+# @torch.jit.script
+# def _gaussian_filter(x, window_1d, use_padding: bool):
+#     '''
+#     Blur input with 1-D kernel
+#     :param x: batch of tensors to be blured
+#     :param window_1d: 1-D gauss kernel
+#     :param use_padding: padding image before conv
+#     :return: blured tensors
+#     '''
+#     C = x.shape[1]
+#     padding = 0
+#     if use_padding:
+#         window_size = window_1d.shape[3]
+#         padding = window_size // 2
+#     out = F.conv2d(x, window_1d, stride=1, padding=(0, padding), groups=C)
+#     out = F.conv2d(out, window_1d.transpose(2, 3), stride=1, padding=(padding, 0), groups=C)
+#     return out
+#
+#
+# @torch.jit.script
+# def ssim(X, Y, window, data_range: float, use_padding: bool = False):
+#     '''
+#     Calculate ssim index for X and Y
+#     :param X: images [B, C, H, N_bins]
+#     :param Y: images [B, C, H, N_bins]
+#     :param window: 1-D gauss kernel
+#     :param data_range: value range of input images. (usually 1.0 or 255)
+#     :param use_padding: padding image before conv
+#     :return:
+#     '''
+#
+#     K1 = 0.01
+#     K2 = 0.03
+#     compensation = 1.0
+#
+#     C1 = (K1 * data_range) ** 2
+#     C2 = (K2 * data_range) ** 2
+#
+#     mu1 = _gaussian_filter(X, window, use_padding)
+#     mu2 = _gaussian_filter(Y, window, use_padding)
+#     sigma1_sq = _gaussian_filter(X * X, window, use_padding)
+#     sigma2_sq = _gaussian_filter(Y * Y, window, use_padding)
+#     sigma12 = _gaussian_filter(X * Y, window, use_padding)
+#
+#     mu1_sq = mu1.pow(2)
+#     mu2_sq = mu2.pow(2)
+#     mu1_mu2 = mu1 * mu2
+#
+#     sigma1_sq = compensation * (sigma1_sq - mu1_sq)
+#     sigma2_sq = compensation * (sigma2_sq - mu2_sq)
+#     sigma12 = compensation * (sigma12 - mu1_mu2)
+#
+#     cs_map = (2 * sigma12 + C2) / (sigma1_sq + sigma2_sq + C2)
+#     # Fixed the issue that the negative value of cs_map caused ms_ssim to output Nan.
+#     cs_map = cs_map.clamp_min(0.)
+#     ssim_map = ((2 * mu1_mu2 + C1) / (mu1_sq + mu2_sq + C1)) * cs_map
+#
+#     ssim_val = ssim_map.mean(dim=(1, 2, 3))  # reduce along CHW
+#     cs = cs_map.mean(dim=(1, 2, 3))
+#
+#     return ssim_val, cs
+#
+#
+# @torch.jit.script
+# def ms_ssim(X, Y, window, data_range: float, weights, use_padding: bool = False, eps: float = 1e-8):
+#     '''
+#     interface of ms-ssim
+#     :param X: a batch of images, (N,C,H,W)
+#     :param Y: a batch of images, (N,C,H,W)
+#     :param window: 1-D gauss kernel
+#     :param data_range: value range of input images. (usually 1.0 or 255)
+#     :param weights: weights for different levels
+#     :param use_padding: padding image before conv
+#     :param eps: use for avoid grad nan.
+#     :return:
+#     '''
+#     levels = weights.shape[0]
+#     cs_vals = []
+#     ssim_vals = []
+#     for _ in range(levels):
+#         ssim_val, cs = ssim(X, Y, window=window, data_range=data_range, use_padding=use_padding)
+#         # Use for fix a issue. When c = a ** b and a is 0, c.backward() will cause the a.grad become inf.
+#         ssim_val = ssim_val.clamp_min(eps)
+#         cs = cs.clamp_min(eps)
+#         cs_vals.append(cs)
+#
+#         ssim_vals.append(ssim_val)
+#         padding = (X.shape[2] % 2, X.shape[3] % 2)
+#         X = F.avg_pool2d(X, kernel_size=2, stride=2, padding=padding)
+#         Y = F.avg_pool2d(Y, kernel_size=2, stride=2, padding=padding)
+#
+#     cs_vals = torch.stack(cs_vals, dim=0)
+#     ms_ssim_val = torch.prod((cs_vals[:-1] ** weights[:-1].unsqueeze(1)) * (ssim_vals[-1] ** weights[-1]), dim=0)
+#     return ms_ssim_val
+#
+#
+# class SSIM(torch.jit.ScriptModule):
+#     __constants__ = ['data_range', 'use_padding']
+#
+#     def __init__(self, window_size=11, window_sigma=1.5, data_range=255., channel=3, use_padding=False):
+#         '''
+#         :param window_size: the size of gauss kernel
+#         :param window_sigma: sigma of normal distribution
+#         :param data_range: value range of input images. (usually 1.0 or 255)
+#         :param channel: input channels (default: 3)
+#         :param use_padding: padding image before conv
+#         '''
+#         super().__init__()
+#         assert window_size % 2 == 1, 'Window size must be odd.'
+#         window = create_window(window_size, window_sigma, channel)
+#         self.register_buffer('window', window)
+#         self.data_range = data_range
+#         self.use_padding = use_padding
+#
+#     @torch.jit.script_method
+#     def forward(self, X, Y):
+#         r = ssim(X, Y, window=self.window, data_range=self.data_range, use_padding=self.use_padding)
+#         return r[0]
+#
+#
+# class MS_SSIM(torch.jit.ScriptModule):
+#     __constants__ = ['data_range', 'use_padding', 'eps']
+#
+#     def __init__(self, window_size=11, window_sigma=1.5, data_range=255., channel=3, use_padding=False, weights=None,
+#                  levels=None, eps=1e-8):
+#         '''
+#         class for ms-ssim
+#         :param window_size: the size of gauss kernel
+#         :param window_sigma: sigma of normal distribution
+#         :param data_range: value range of input images. (usually 1.0 or 255)
+#         :param channel: input channels
+#         :param use_padding: padding image before conv
+#         :param weights: weights for different levels. (default [0.0448, 0.2856, 0.3001, 0.2363, 0.1333])
+#         :param levels: number of downsampling
+#         :param eps: Use for fix a issue. When c = a ** b and a is 0, c.backward() will cause the a.grad become inf.
+#         '''
+#         super().__init__()
+#         assert window_size % 2 == 1, 'Window size must be odd.'
+#         self.data_range = data_range
+#         self.use_padding = use_padding
+#         self.eps = eps
+#
+#         window = create_window(window_size, window_sigma, channel)
+#         self.register_buffer('window', window)
+#
+#         if weights is None:
+#             weights = [0.0448, 0.2856, 0.3001, 0.2363, 0.1333]
+#         weights = torch.tensor(weights, dtype=torch.float)
+#
+#         if levels is not None:
+#             weights = weights[:levels]
+#             weights = weights / weights.sum()
+#
+#         self.register_buffer('weights', weights)
+#
+#     @torch.jit.script_method
+#     def forward(self, X, Y):
+#         return ms_ssim(X, Y, window=self.window, data_range=self.data_range, weights=self.weights,
+#                        use_padding=self.use_padding, eps=self.eps)
+#
+#
+# if __name__ == '__main__':
+#     print('Simple Test')
+#     im = torch.randint(0, 255, (5, 3, 256, 256), dtype=torch.float, device='cuda')
+#     img1 = im / 255
+#     img2 = img1 * 0.5
+#
+#     losser = SSIM(data_range=1.).cuda()
+#     loss = losser(img1, img2).mean()
+#
+#     losser2 = MS_SSIM(data_range=1.).cuda()
+#     loss2 = losser2(img1, img2).mean()
+#
+#     print(loss.item())
+#     print(loss2.item())
+#
+# if __name__ == '__main__':
+#     print('Training Test')
+#     import cv2
+#     import torch.optim
+#     import numpy as np
+#     import imageio
+#     import time
+#
+#     out_test_video = False
+#     # 最好不要直接输出gif图，会非常大，最好先输出mkv文件后用ffmpeg转换到GIF
+#     video_use_gif = False
+#
+#     im = cv2.imread('test_img1.jpg', 1)
+#     t_im = torch.from_numpy(im).cuda().permute(2, 0, 1).float()[None] / 255.
+#
+#     if out_test_video:
+#         if video_use_gif:
+#             fps = 0.5
+#             out_wh = (im.shape[1] // 2, im.shape[0] // 2)
+#             suffix = '.gif'
+#         else:
+#             fps = 5
+#             out_wh = (im.shape[1], im.shape[0])
+#             suffix = '.mkv'
+#         video_last_time = time.perf_counter()
+#         video = imageio.get_writer('ssim_test' + suffix, fps=fps)
+#
+#     # 测试ssim
+#     print('Training SSIM')
+#     rand_im = torch.randint_like(t_im, 0, 255, dtype=torch.float32) / 255.
+#     rand_im.requires_grad = True
+#     optim = torch.optim.Adam([rand_im], 0.003, eps=1e-8)
+#     losser = SSIM(data_range=1., channel=t_im.shape[1]).cuda()
+#     ssim_score = 0
+#     while ssim_score < 0.999:
+#         optim.zero_grad()
+#         loss = losser(rand_im, t_im)
+#         (-loss).sum().backward()
+#         ssim_score = loss.item()
+#         optim.step()
+#         r_im = np.transpose(rand_im.detach().cpu().numpy().clip(0, 1) * 255, [0, 2, 3, 1]).astype(np.uint8)[0]
+#         r_im = cv2.putText(r_im, 'ssim %f' % ssim_score, (10, 30), cv2.FONT_HERSHEY_PLAIN, 2, (255, 0, 0), 2)
+#
+#         if out_test_video:
+#             if time.perf_counter() - video_last_time > 1. / fps:
+#                 video_last_time = time.perf_counter()
+#                 out_frame = cv2.cvtColor(r_im, cv2.COLOR_BGR2RGB)
+#                 out_frame = cv2.resize(out_frame, out_wh, interpolation=cv2.INTER_AREA)
+#                 if isinstance(out_frame, cv2.UMat):
+#                     out_frame = out_frame.get()
+#                 video.append_data(out_frame)
+#
+#         cv2.imshow('ssim', r_im)
+#         cv2.setWindowTitle('ssim', 'ssim %f' % ssim_score)
+#         cv2.waitKey(1)
+#
+#     if out_test_video:
+#         video.close()
+#
+#     # 测试ms_ssim
+#     if out_test_video:
+#         if video_use_gif:
+#             fps = 0.5
+#             out_wh = (im.shape[1] // 2, im.shape[0] // 2)
+#             suffix = '.gif'
+#         else:
+#             fps = 5
+#             out_wh = (im.shape[1], im.shape[0])
+#             suffix = '.mkv'
+#         video_last_time = time.perf_counter()
+#         video = imageio.get_writer('ms_ssim_test' + suffix, fps=fps)
+#
+#     print('Training MS_SSIM')
+#     rand_im = torch.randint_like(t_im, 0, 255, dtype=torch.float32) / 255.
+#     rand_im.requires_grad = True
+#     optim = torch.optim.Adam([rand_im], 0.003, eps=1e-8)
+#     losser = MS_SSIM(data_range=1., channel=t_im.shape[1]).cuda()
+#     ssim_score = 0
+#     while ssim_score < 0.999:
+#         optim.zero_grad()
+#         loss = losser(rand_im, t_im)
+#         (-loss).sum().backward()
+#         ssim_score = loss.item()
+#         optim.step()
+#         r_im = np.transpose(rand_im.detach().cpu().numpy().clip(0, 1) * 255, [0, 2, 3, 1]).astype(np.uint8)[0]
+#         r_im = cv2.putText(r_im, 'ms_ssim %f' % ssim_score, (10, 30), cv2.FONT_HERSHEY_PLAIN, 2, (255, 0, 0), 2)
+#
+#         if out_test_video:
+#             if time.perf_counter() - video_last_time > 1. / fps:
+#                 video_last_time = time.perf_counter()
+#                 out_frame = cv2.cvtColor(r_im, cv2.COLOR_BGR2RGB)
+#                 out_frame = cv2.resize(out_frame, out_wh, interpolation=cv2.INTER_AREA)
+#                 if isinstance(out_frame, cv2.UMat):
+#                     out_frame = out_frame.get()
+#                 video.append_data(out_frame)
+#
+#         cv2.imshow('ms_ssim', r_im)
+#         cv2.setWindowTitle('ms_ssim', 'ms_ssim %f' % ssim_score)
+#         cv2.waitKey(1)
+#
+#     if out_test_video:
+#         video.close()
+"""
+Adapted from https://github.com/Po-Hsun-Su/pytorch-ssim
+"""
+import torch
+import torch.nn.functional as F
+from torch.autograd import Variable
+import numpy as np
+from math import exp
+def gaussian(window_size, sigma):
+    gauss = torch.Tensor([exp(-(x - window_size // 2) ** 2 / float(2 * sigma ** 2)) for x in range(window_size)])
+    return gauss / gauss.sum()
+def create_window(window_size, channel):
+    _1D_window = gaussian(window_size, 1.5).unsqueeze(1)
+    _2D_window = _1D_window.mm(_1D_window.t()).float().unsqueeze(0).unsqueeze(0)
+    window = Variable(_2D_window.expand(channel, 1, window_size, window_size).contiguous())
+    return window
+def _ssim(img1, img2, window, window_size, channel, size_average=True):
+    mu1 = F.conv2d(img1, window, padding=window_size // 2, groups=channel)
+    mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel)
+    mu1_sq = mu1.pow(2)
+    mu2_sq = mu2.pow(2)
+    mu1_mu2 = mu1 * mu2
+    sigma1_sq = F.conv2d(img1 * img1, window, padding=window_size // 2, groups=channel) - mu1_sq
+    sigma2_sq = F.conv2d(img2 * img2, window, padding=window_size // 2, groups=channel) - mu2_sq
+    sigma12 = F.conv2d(img1 * img2, window, padding=window_size // 2, groups=channel) - mu1_mu2
+    C1 = 0.01 ** 2
+    C2 = 0.03 ** 2
+    ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) / ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2))
+    if size_average:
+        return ssim_map.mean()
+    else:
+        return ssim_map.mean(1)
+class SSIM(torch.nn.Module):
+    def __init__(self, window_size=11, size_average=True):
+        super(SSIM, self).__init__()
+        self.window_size = window_size
+        self.size_average = size_average
+        self.channel = 1
+        self.window = create_window(window_size, self.channel)
+    def forward(self, img1, img2):
+        (_, channel, _, _) = img1.size()
+        if channel == self.channel and self.window.data.type() == img1.data.type():
+            window = self.window
+        else:
+            window = create_window(self.window_size, channel)
+            if img1.is_cuda:
+                window = window.cuda(img1.get_device())
+            window = window.type_as(img1)
+            self.window = window
+            self.channel = channel
+        return _ssim(img1, img2, window, self.window_size, channel, self.size_average)
+window = None
+def ssim(img1, img2, window_size=11, size_average=True):
+    (_, channel, _, _) = img1.size()
+    global window
+    if window is None:
+        window = create_window(window_size, channel)
+        if img1.is_cuda:
+            window = window.cuda(img1.get_device())
+        window = window.type_as(img1)
+    return _ssim(img1, img2, window, window_size, channel, size_average)

modules/diffsinger_midi/fs2.py ADDED Viewed

	@@ -0,0 +1,118 @@

+from modules.commons.common_layers import *
+from modules.commons.common_layers import Embedding
+from modules.fastspeech.tts_modules import FastspeechDecoder, DurationPredictor, LengthRegulator, PitchPredictor, \
+    EnergyPredictor, FastspeechEncoder
+from utils.cwt import cwt2f0
+from utils.hparams import hparams
+from utils.pitch_utils import f0_to_coarse, denorm_f0, norm_f0
+from modules.fastspeech.fs2 import FastSpeech2
+class FastspeechMIDIEncoder(FastspeechEncoder):
+    def forward_embedding(self, txt_tokens, midi_embedding, midi_dur_embedding, slur_embedding):
+        # embed tokens and positions
+        x = self.embed_scale * self.embed_tokens(txt_tokens)
+        x = x + midi_embedding + midi_dur_embedding + slur_embedding
+        if hparams['use_pos_embed']:
+            if hparams.get('rel_pos') is not None and hparams['rel_pos']:
+                x = self.embed_positions(x)
+            else:
+                positions = self.embed_positions(txt_tokens)
+                x = x + positions
+        x = F.dropout(x, p=self.dropout, training=self.training)
+        return x
+    def forward(self, txt_tokens, midi_embedding, midi_dur_embedding, slur_embedding):
+        """
+        :param txt_tokens: [B, T]
+        :return: {
+            'encoder_out': [T x B x C]
+        }
+        """
+        encoder_padding_mask = txt_tokens.eq(self.padding_idx).data
+        x = self.forward_embedding(txt_tokens, midi_embedding, midi_dur_embedding, slur_embedding)  # [B, T, H]
+        x = super(FastspeechEncoder, self).forward(x, encoder_padding_mask)
+        return x
+FS_ENCODERS = {
+    'fft': lambda hp, embed_tokens, d: FastspeechMIDIEncoder(
+        embed_tokens, hp['hidden_size'], hp['enc_layers'], hp['enc_ffn_kernel_size'],
+        num_heads=hp['num_heads']),
+}
+class FastSpeech2MIDI(FastSpeech2):
+    def __init__(self, dictionary, out_dims=None):
+        super().__init__(dictionary, out_dims)
+        del self.encoder
+        self.encoder = FS_ENCODERS[hparams['encoder_type']](hparams, self.encoder_embed_tokens, self.dictionary)
+        self.midi_embed = Embedding(300, self.hidden_size, self.padding_idx)
+        self.midi_dur_layer = Linear(1, self.hidden_size)
+        self.is_slur_embed = Embedding(2, self.hidden_size)
+    def forward(self, txt_tokens, mel2ph=None, spk_embed=None,
+                ref_mels=None, f0=None, uv=None, energy=None, skip_decoder=False,
+                spk_embed_dur_id=None, spk_embed_f0_id=None, infer=False, **kwargs):
+        ret = {}
+        midi_embedding = self.midi_embed(kwargs['pitch_midi'])
+        midi_dur_embedding, slur_embedding = 0, 0
+        if kwargs.get('midi_dur') is not None:
+            midi_dur_embedding = self.midi_dur_layer(kwargs['midi_dur'][:, :, None])  # [B, T, 1] -> [B, T, H]
+        if kwargs.get('is_slur') is not None:
+            slur_embedding = self.is_slur_embed(kwargs['is_slur'])
+        encoder_out = self.encoder(txt_tokens, midi_embedding, midi_dur_embedding, slur_embedding)  # [B, T, C]
+        src_nonpadding = (txt_tokens > 0).float()[:, :, None]
+        # add ref style embed
+        # Not implemented
+        # variance encoder
+        var_embed = 0
+        # encoder_out_dur denotes encoder outputs for duration predictor
+        # in speech adaptation, duration predictor use old speaker embedding
+        if hparams['use_spk_embed']:
+            spk_embed_dur = spk_embed_f0 = spk_embed = self.spk_embed_proj(spk_embed)[:, None, :]
+        elif hparams['use_spk_id']:
+            spk_embed_id = spk_embed
+            if spk_embed_dur_id is None:
+                spk_embed_dur_id = spk_embed_id
+            if spk_embed_f0_id is None:
+                spk_embed_f0_id = spk_embed_id
+            spk_embed = self.spk_embed_proj(spk_embed_id)[:, None, :]
+            spk_embed_dur = spk_embed_f0 = spk_embed
+            if hparams['use_split_spk_id']:
+                spk_embed_dur = self.spk_embed_dur(spk_embed_dur_id)[:, None, :]
+                spk_embed_f0 = self.spk_embed_f0(spk_embed_f0_id)[:, None, :]
+        else:
+            spk_embed_dur = spk_embed_f0 = spk_embed = 0
+        # add dur
+        dur_inp = (encoder_out + var_embed + spk_embed_dur) * src_nonpadding
+        mel2ph = self.add_dur(dur_inp, mel2ph, txt_tokens, ret)
+        decoder_inp = F.pad(encoder_out, [0, 0, 1, 0])
+        mel2ph_ = mel2ph[..., None].repeat([1, 1, encoder_out.shape[-1]])
+        decoder_inp_origin = decoder_inp = torch.gather(decoder_inp, 1, mel2ph_)  # [B, T, H]
+        tgt_nonpadding = (mel2ph > 0).float()[:, :, None]
+        # add pitch and energy embed
+        pitch_inp = (decoder_inp_origin + var_embed + spk_embed_f0) * tgt_nonpadding
+        if hparams['use_pitch_embed']:
+            pitch_inp_ph = (encoder_out + var_embed + spk_embed_f0) * src_nonpadding
+            decoder_inp = decoder_inp + self.add_pitch(pitch_inp, f0, uv, mel2ph, ret, encoder_out=pitch_inp_ph)
+        if hparams['use_energy_embed']:
+            decoder_inp = decoder_inp + self.add_energy(pitch_inp, energy, ret)
+        ret['decoder_inp'] = decoder_inp = (decoder_inp + spk_embed) * tgt_nonpadding
+        if skip_decoder:
+            return ret
+        ret['mel_out'] = self.run_decoder(decoder_inp, tgt_nonpadding, ret, infer=infer, **kwargs)
+        return ret

modules/fastspeech/fs2.py ADDED Viewed

	@@ -0,0 +1,255 @@

+from modules.commons.common_layers import *
+from modules.commons.common_layers import Embedding
+from modules.fastspeech.tts_modules import FastspeechDecoder, DurationPredictor, LengthRegulator, PitchPredictor, \
+    EnergyPredictor, FastspeechEncoder
+from utils.cwt import cwt2f0
+from utils.hparams import hparams
+from utils.pitch_utils import f0_to_coarse, denorm_f0, norm_f0
+FS_ENCODERS = {
+    'fft': lambda hp, embed_tokens, d: FastspeechEncoder(
+        embed_tokens, hp['hidden_size'], hp['enc_layers'], hp['enc_ffn_kernel_size'],
+        num_heads=hp['num_heads']),
+}
+FS_DECODERS = {
+    'fft': lambda hp: FastspeechDecoder(
+        hp['hidden_size'], hp['dec_layers'], hp['dec_ffn_kernel_size'], hp['num_heads']),
+}
+class FastSpeech2(nn.Module):
+    def __init__(self, dictionary, out_dims=None):
+        super().__init__()
+        self.dictionary = dictionary
+        self.padding_idx = dictionary.pad()
+        self.enc_layers = hparams['enc_layers']
+        self.dec_layers = hparams['dec_layers']
+        self.hidden_size = hparams['hidden_size']
+        self.encoder_embed_tokens = self.build_embedding(self.dictionary, self.hidden_size)
+        self.encoder = FS_ENCODERS[hparams['encoder_type']](hparams, self.encoder_embed_tokens, self.dictionary)
+        self.decoder = FS_DECODERS[hparams['decoder_type']](hparams)
+        self.out_dims = out_dims
+        if out_dims is None:
+            self.out_dims = hparams['audio_num_mel_bins']
+        self.mel_out = Linear(self.hidden_size, self.out_dims, bias=True)
+        if hparams['use_spk_id']:
+            self.spk_embed_proj = Embedding(hparams['num_spk'] + 1, self.hidden_size)
+            if hparams['use_split_spk_id']:
+                self.spk_embed_f0 = Embedding(hparams['num_spk'] + 1, self.hidden_size)
+                self.spk_embed_dur = Embedding(hparams['num_spk'] + 1, self.hidden_size)
+        elif hparams['use_spk_embed']:
+            self.spk_embed_proj = Linear(256, self.hidden_size, bias=True)
+        predictor_hidden = hparams['predictor_hidden'] if hparams['predictor_hidden'] > 0 else self.hidden_size
+        self.dur_predictor = DurationPredictor(
+            self.hidden_size,
+            n_chans=predictor_hidden,
+            n_layers=hparams['dur_predictor_layers'],
+            dropout_rate=hparams['predictor_dropout'], padding=hparams['ffn_padding'],
+            kernel_size=hparams['dur_predictor_kernel'])
+        self.length_regulator = LengthRegulator()
+        if hparams['use_pitch_embed']:
+            self.pitch_embed = Embedding(300, self.hidden_size, self.padding_idx)
+            if hparams['pitch_type'] == 'cwt':
+                h = hparams['cwt_hidden_size']
+                cwt_out_dims = 10
+                if hparams['use_uv']:
+                    cwt_out_dims = cwt_out_dims + 1
+                self.cwt_predictor = nn.Sequential(
+                    nn.Linear(self.hidden_size, h),
+                    PitchPredictor(
+                        h,
+                        n_chans=predictor_hidden,
+                        n_layers=hparams['predictor_layers'],
+                        dropout_rate=hparams['predictor_dropout'], odim=cwt_out_dims,
+                        padding=hparams['ffn_padding'], kernel_size=hparams['predictor_kernel']))
+                self.cwt_stats_layers = nn.Sequential(
+                    nn.Linear(self.hidden_size, h), nn.ReLU(),
+                    nn.Linear(h, h), nn.ReLU(), nn.Linear(h, 2)
+                )
+            else:
+                self.pitch_predictor = PitchPredictor(
+                    self.hidden_size,
+                    n_chans=predictor_hidden,
+                    n_layers=hparams['predictor_layers'],
+                    dropout_rate=hparams['predictor_dropout'],
+                    odim=2 if hparams['pitch_type'] == 'frame' else 1,
+                    padding=hparams['ffn_padding'], kernel_size=hparams['predictor_kernel'])
+        if hparams['use_energy_embed']:
+            self.energy_embed = Embedding(256, self.hidden_size, self.padding_idx)
+            self.energy_predictor = EnergyPredictor(
+                self.hidden_size,
+                n_chans=predictor_hidden,
+                n_layers=hparams['predictor_layers'],
+                dropout_rate=hparams['predictor_dropout'], odim=1,
+                padding=hparams['ffn_padding'], kernel_size=hparams['predictor_kernel'])
+    def build_embedding(self, dictionary, embed_dim):
+        num_embeddings = len(dictionary)
+        emb = Embedding(num_embeddings, embed_dim, self.padding_idx)
+        return emb
+    def forward(self, txt_tokens, mel2ph=None, spk_embed=None,
+                ref_mels=None, f0=None, uv=None, energy=None, skip_decoder=False,
+                spk_embed_dur_id=None, spk_embed_f0_id=None, infer=False, **kwargs):
+        ret = {}
+        encoder_out = self.encoder(txt_tokens)  # [B, T, C]
+        src_nonpadding = (txt_tokens > 0).float()[:, :, None]
+        # add ref style embed
+        # Not implemented
+        # variance encoder
+        var_embed = 0
+        # encoder_out_dur denotes encoder outputs for duration predictor
+        # in speech adaptation, duration predictor use old speaker embedding
+        if hparams['use_spk_embed']:
+            spk_embed_dur = spk_embed_f0 = spk_embed = self.spk_embed_proj(spk_embed)[:, None, :]
+        elif hparams['use_spk_id']:
+            spk_embed_id = spk_embed
+            if spk_embed_dur_id is None:
+                spk_embed_dur_id = spk_embed_id
+            if spk_embed_f0_id is None:
+                spk_embed_f0_id = spk_embed_id
+            spk_embed = self.spk_embed_proj(spk_embed_id)[:, None, :]
+            spk_embed_dur = spk_embed_f0 = spk_embed
+            if hparams['use_split_spk_id']:
+                spk_embed_dur = self.spk_embed_dur(spk_embed_dur_id)[:, None, :]
+                spk_embed_f0 = self.spk_embed_f0(spk_embed_f0_id)[:, None, :]
+        else:
+            spk_embed_dur = spk_embed_f0 = spk_embed = 0
+        # add dur
+        dur_inp = (encoder_out + var_embed + spk_embed_dur) * src_nonpadding
+        mel2ph = self.add_dur(dur_inp, mel2ph, txt_tokens, ret)
+        decoder_inp = F.pad(encoder_out, [0, 0, 1, 0])
+        mel2ph_ = mel2ph[..., None].repeat([1, 1, encoder_out.shape[-1]])
+        decoder_inp_origin = decoder_inp = torch.gather(decoder_inp, 1, mel2ph_)  # [B, T, H]
+        tgt_nonpadding = (mel2ph > 0).float()[:, :, None]
+        # add pitch and energy embed
+        pitch_inp = (decoder_inp_origin + var_embed + spk_embed_f0) * tgt_nonpadding
+        if hparams['use_pitch_embed']:
+            pitch_inp_ph = (encoder_out + var_embed + spk_embed_f0) * src_nonpadding
+            decoder_inp = decoder_inp + self.add_pitch(pitch_inp, f0, uv, mel2ph, ret, encoder_out=pitch_inp_ph)
+        if hparams['use_energy_embed']:
+            decoder_inp = decoder_inp + self.add_energy(pitch_inp, energy, ret)
+        ret['decoder_inp'] = decoder_inp = (decoder_inp + spk_embed) * tgt_nonpadding
+        if skip_decoder:
+            return ret
+        ret['mel_out'] = self.run_decoder(decoder_inp, tgt_nonpadding, ret, infer=infer, **kwargs)
+        return ret
+    def add_dur(self, dur_input, mel2ph, txt_tokens, ret):
+        """
+        :param dur_input: [B, T_txt, H]
+        :param mel2ph: [B, T_mel]
+        :param txt_tokens: [B, T_txt]
+        :param ret:
+        :return:
+        """
+        src_padding = txt_tokens == 0
+        dur_input = dur_input.detach() + hparams['predictor_grad'] * (dur_input - dur_input.detach())
+        if mel2ph is None:
+            dur, xs = self.dur_predictor.inference(dur_input, src_padding)
+            ret['dur'] = xs
+            ret['dur_choice'] = dur
+            mel2ph = self.length_regulator(dur, src_padding).detach()
+            # from modules.fastspeech.fake_modules import FakeLengthRegulator
+            # fake_lr = FakeLengthRegulator()
+            # fake_mel2ph = fake_lr(dur, (1 - src_padding.long()).sum(-1))[..., 0].detach()
+            # print(mel2ph == fake_mel2ph)
+        else:
+            ret['dur'] = self.dur_predictor(dur_input, src_padding)
+        ret['mel2ph'] = mel2ph
+        return mel2ph
+    def add_energy(self, decoder_inp, energy, ret):
+        decoder_inp = decoder_inp.detach() + hparams['predictor_grad'] * (decoder_inp - decoder_inp.detach())
+        ret['energy_pred'] = energy_pred = self.energy_predictor(decoder_inp)[:, :, 0]
+        if energy is None:
+            energy = energy_pred
+        energy = torch.clamp(energy * 256 // 4, max=255).long()
+        energy_embed = self.energy_embed(energy)
+        return energy_embed
+    def add_pitch(self, decoder_inp, f0, uv, mel2ph, ret, encoder_out=None):
+        if hparams['pitch_type'] == 'ph':
+            pitch_pred_inp = encoder_out.detach() + hparams['predictor_grad'] * (encoder_out - encoder_out.detach())
+            pitch_padding = encoder_out.sum().abs() == 0
+            ret['pitch_pred'] = pitch_pred = self.pitch_predictor(pitch_pred_inp)
+            if f0 is None:
+                f0 = pitch_pred[:, :, 0]
+            ret['f0_denorm'] = f0_denorm = denorm_f0(f0, None, hparams, pitch_padding=pitch_padding)
+            pitch = f0_to_coarse(f0_denorm)  # start from 0 [B, T_txt]
+            pitch = F.pad(pitch, [1, 0])
+            pitch = torch.gather(pitch, 1, mel2ph)  # [B, T_mel]
+            pitch_embed = self.pitch_embed(pitch)
+            return pitch_embed
+        decoder_inp = decoder_inp.detach() + hparams['predictor_grad'] * (decoder_inp - decoder_inp.detach())
+        pitch_padding = mel2ph == 0
+        if hparams['pitch_type'] == 'cwt':
+            pitch_padding = None
+            ret['cwt'] = cwt_out = self.cwt_predictor(decoder_inp)
+            stats_out = self.cwt_stats_layers(encoder_out[:, 0, :])  # [B, 2]
+            mean = ret['f0_mean'] = stats_out[:, 0]
+            std = ret['f0_std'] = stats_out[:, 1]
+            cwt_spec = cwt_out[:, :, :10]
+            if f0 is None:
+                std = std * hparams['cwt_std_scale']
+                f0 = self.cwt2f0_norm(cwt_spec, mean, std, mel2ph)
+                if hparams['use_uv']:
+                    assert cwt_out.shape[-1] == 11
+                    uv = cwt_out[:, :, -1] > 0
+        elif hparams['pitch_ar']:
+            ret['pitch_pred'] = pitch_pred = self.pitch_predictor(decoder_inp, f0 if self.training else None)
+            if f0 is None:
+                f0 = pitch_pred[:, :, 0]
+        else:
+            ret['pitch_pred'] = pitch_pred = self.pitch_predictor(decoder_inp)
+            if f0 is None:
+                f0 = pitch_pred[:, :, 0]
+            if hparams['use_uv'] and uv is None:
+                uv = pitch_pred[:, :, 1] > 0
+        ret['f0_denorm'] = f0_denorm = denorm_f0(f0, uv, hparams, pitch_padding=pitch_padding)
+        if pitch_padding is not None:
+            f0[pitch_padding] = 0
+        pitch = f0_to_coarse(f0_denorm)  # start from 0
+        pitch_embed = self.pitch_embed(pitch)
+        return pitch_embed
+    def run_decoder(self, decoder_inp, tgt_nonpadding, ret, infer, **kwargs):
+        x = decoder_inp  # [B, T, H]
+        x = self.decoder(x)
+        x = self.mel_out(x)
+        return x * tgt_nonpadding
+    def cwt2f0_norm(self, cwt_spec, mean, std, mel2ph):
+        f0 = cwt2f0(cwt_spec, mean, std, hparams['cwt_scales'])
+        f0 = torch.cat(
+            [f0] + [f0[:, -1:]] * (mel2ph.shape[1] - f0.shape[1]), 1)
+        f0_norm = norm_f0(f0, None, hparams)
+        return f0_norm
+    def out2mel(self, out):
+        return out
+    @staticmethod
+    def mel_norm(x):
+        return (x + 5.5) / (6.3 / 2) - 1
+    @staticmethod
+    def mel_denorm(x):
+        return (x + 1) * (6.3 / 2) - 5.5

modules/fastspeech/pe.py ADDED Viewed

	@@ -0,0 +1,149 @@

+from modules.commons.common_layers import *
+from utils.hparams import hparams
+from modules.fastspeech.tts_modules import PitchPredictor
+from utils.pitch_utils import denorm_f0
+class Prenet(nn.Module):
+    def __init__(self, in_dim=80, out_dim=256, kernel=5, n_layers=3, strides=None):
+        super(Prenet, self).__init__()
+        padding = kernel // 2
+        self.layers = []
+        self.strides = strides if strides is not None else [1] * n_layers
+        for l in range(n_layers):
+            self.layers.append(nn.Sequential(
+                nn.Conv1d(in_dim, out_dim, kernel_size=kernel, padding=padding, stride=self.strides[l]),
+                nn.ReLU(),
+                nn.BatchNorm1d(out_dim)
+            ))
+            in_dim = out_dim
+        self.layers = nn.ModuleList(self.layers)
+        self.out_proj = nn.Linear(out_dim, out_dim)
+    def forward(self, x):
+        """
+        :param x: [B, T, 80]
+        :return: [L, B, T, H], [B, T, H]
+        """
+        padding_mask = x.abs().sum(-1).eq(0).data  # [B, T]
+        nonpadding_mask_TB = 1 - padding_mask.float()[:, None, :]  # [B, 1, T]
+        x = x.transpose(1, 2)
+        hiddens = []
+        for i, l in enumerate(self.layers):
+            nonpadding_mask_TB = nonpadding_mask_TB[:, :, ::self.strides[i]]
+            x = l(x) * nonpadding_mask_TB
+        hiddens.append(x)
+        hiddens = torch.stack(hiddens, 0)  # [L, B, H, T]
+        hiddens = hiddens.transpose(2, 3)  # [L, B, T, H]
+        x = self.out_proj(x.transpose(1, 2))  # [B, T, H]
+        x = x * nonpadding_mask_TB.transpose(1, 2)
+        return hiddens, x
+class ConvBlock(nn.Module):
+    def __init__(self, idim=80, n_chans=256, kernel_size=3, stride=1, norm='gn', dropout=0):
+        super().__init__()
+        self.conv = ConvNorm(idim, n_chans, kernel_size, stride=stride)
+        self.norm = norm
+        if self.norm == 'bn':
+            self.norm = nn.BatchNorm1d(n_chans)
+        elif self.norm == 'in':
+            self.norm = nn.InstanceNorm1d(n_chans, affine=True)
+        elif self.norm == 'gn':
+            self.norm = nn.GroupNorm(n_chans // 16, n_chans)
+        elif self.norm == 'ln':
+            self.norm = LayerNorm(n_chans // 16, n_chans)
+        elif self.norm == 'wn':
+            self.conv = torch.nn.utils.weight_norm(self.conv.conv)
+        self.dropout = nn.Dropout(dropout)
+        self.relu = nn.ReLU()
+    def forward(self, x):
+        """
+        :param x: [B, C, T]
+        :return: [B, C, T]
+        """
+        x = self.conv(x)
+        if not isinstance(self.norm, str):
+            if self.norm == 'none':
+                pass
+            elif self.norm == 'ln':
+                x = self.norm(x.transpose(1, 2)).transpose(1, 2)
+            else:
+                x = self.norm(x)
+        x = self.relu(x)
+        x = self.dropout(x)
+        return x
+class ConvStacks(nn.Module):
+    def __init__(self, idim=80, n_layers=5, n_chans=256, odim=32, kernel_size=5, norm='gn',
+                 dropout=0, strides=None, res=True):
+        super().__init__()
+        self.conv = torch.nn.ModuleList()
+        self.kernel_size = kernel_size
+        self.res = res
+        self.in_proj = Linear(idim, n_chans)
+        if strides is None:
+            strides = [1] * n_layers
+        else:
+            assert len(strides) == n_layers
+        for idx in range(n_layers):
+            self.conv.append(ConvBlock(
+                n_chans, n_chans, kernel_size, stride=strides[idx], norm=norm, dropout=dropout))
+        self.out_proj = Linear(n_chans, odim)
+    def forward(self, x, return_hiddens=False):
+        """
+        :param x: [B, T, H]
+        :return: [B, T, H]
+        """
+        x = self.in_proj(x)
+        x = x.transpose(1, -1)  # (B, idim, Tmax)
+        hiddens = []
+        for f in self.conv:
+            x_ = f(x)
+            x = x + x_ if self.res else x_  # (B, C, Tmax)
+            hiddens.append(x)
+        x = x.transpose(1, -1)
+        x = self.out_proj(x)  # (B, Tmax, H)
+        if return_hiddens:
+            hiddens = torch.stack(hiddens, 1)  # [B, L, C, T]
+            return x, hiddens
+        return x
+class PitchExtractor(nn.Module):
+    def __init__(self, n_mel_bins=80, conv_layers=2):
+        super().__init__()
+        self.hidden_size = hparams['hidden_size']
+        self.predictor_hidden = hparams['predictor_hidden'] if hparams['predictor_hidden'] > 0 else self.hidden_size
+        self.conv_layers = conv_layers
+        self.mel_prenet = Prenet(n_mel_bins, self.hidden_size, strides=[1, 1, 1])
+        if self.conv_layers > 0:
+            self.mel_encoder = ConvStacks(
+                    idim=self.hidden_size, n_chans=self.hidden_size, odim=self.hidden_size, n_layers=self.conv_layers)
+        self.pitch_predictor = PitchPredictor(
+            self.hidden_size, n_chans=self.predictor_hidden,
+            n_layers=5, dropout_rate=0.1, odim=2,
+            padding=hparams['ffn_padding'], kernel_size=hparams['predictor_kernel'])
+    def forward(self, mel_input=None):
+        ret = {}
+        mel_hidden = self.mel_prenet(mel_input)[1]
+        if self.conv_layers > 0:
+            mel_hidden = self.mel_encoder(mel_hidden)
+        ret['pitch_pred'] = pitch_pred = self.pitch_predictor(mel_hidden)
+        pitch_padding = mel_input.abs().sum(-1) == 0
+        use_uv = hparams['pitch_type'] == 'frame' and hparams['use_uv']
+        ret['f0_denorm_pred'] = denorm_f0(
+            pitch_pred[:, :, 0], (pitch_pred[:, :, 1] > 0) if use_uv else None,
+            hparams, pitch_padding=pitch_padding)
+        return ret

modules/fastspeech/tts_modules.py ADDED Viewed

	@@ -0,0 +1,357 @@

+import logging
+import math
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+from modules.commons.espnet_positional_embedding import RelPositionalEncoding
+from modules.commons.common_layers import SinusoidalPositionalEmbedding, Linear, EncSALayer, DecSALayer, BatchNorm1dTBC
+from utils.hparams import hparams
+DEFAULT_MAX_SOURCE_POSITIONS = 2000
+DEFAULT_MAX_TARGET_POSITIONS = 2000
+class TransformerEncoderLayer(nn.Module):
+    def __init__(self, hidden_size, dropout, kernel_size=None, num_heads=2, norm='ln'):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.dropout = dropout
+        self.num_heads = num_heads
+        self.op = EncSALayer(
+            hidden_size, num_heads, dropout=dropout,
+            attention_dropout=0.0, relu_dropout=dropout,
+            kernel_size=kernel_size
+            if kernel_size is not None else hparams['enc_ffn_kernel_size'],
+            padding=hparams['ffn_padding'],
+            norm=norm, act=hparams['ffn_act'])
+    def forward(self, x, **kwargs):
+        return self.op(x, **kwargs)
+######################
+# fastspeech modules
+######################
+class LayerNorm(torch.nn.LayerNorm):
+    """Layer normalization module.
+    :param int nout: output dim size
+    :param int dim: dimension to be normalized
+    """
+    def __init__(self, nout, dim=-1):
+        """Construct an LayerNorm object."""
+        super(LayerNorm, self).__init__(nout, eps=1e-12)
+        self.dim = dim
+    def forward(self, x):
+        """Apply layer normalization.
+        :param torch.Tensor x: input tensor
+        :return: layer normalized tensor
+        :rtype torch.Tensor
+        """
+        if self.dim == -1:
+            return super(LayerNorm, self).forward(x)
+        return super(LayerNorm, self).forward(x.transpose(1, -1)).transpose(1, -1)
+class DurationPredictor(torch.nn.Module):
+    """Duration predictor module.
+    This is a module of duration predictor described in `FastSpeech: Fast, Robust and Controllable Text to Speech`_.
+    The duration predictor predicts a duration of each frame in log domain from the hidden embeddings of encoder.
+    .. _`FastSpeech: Fast, Robust and Controllable Text to Speech`:
+        https://arxiv.org/pdf/1905.09263.pdf
+    Note:
+        The calculation domain of outputs is different between in `forward` and in `inference`. In `forward`,
+        the outputs are calculated in log domain but in `inference`, those are calculated in linear domain.
+    """
+    def __init__(self, idim, n_layers=2, n_chans=384, kernel_size=3, dropout_rate=0.1, offset=1.0, padding='SAME'):
+        """Initilize duration predictor module.
+        Args:
+            idim (int): Input dimension.
+            n_layers (int, optional): Number of convolutional layers.
+            n_chans (int, optional): Number of channels of convolutional layers.
+            kernel_size (int, optional): Kernel size of convolutional layers.
+            dropout_rate (float, optional): Dropout rate.
+            offset (float, optional): Offset value to avoid nan in log domain.
+        """
+        super(DurationPredictor, self).__init__()
+        self.offset = offset
+        self.conv = torch.nn.ModuleList()
+        self.kernel_size = kernel_size
+        self.padding = padding
+        for idx in range(n_layers):
+            in_chans = idim if idx == 0 else n_chans
+            self.conv += [torch.nn.Sequential(
+                torch.nn.ConstantPad1d(((kernel_size - 1) // 2, (kernel_size - 1) // 2)
+                                       if padding == 'SAME'
+                                       else (kernel_size - 1, 0), 0),
+                torch.nn.Conv1d(in_chans, n_chans, kernel_size, stride=1, padding=0),
+                torch.nn.ReLU(),
+                LayerNorm(n_chans, dim=1),
+                torch.nn.Dropout(dropout_rate)
+            )]
+        if hparams['dur_loss'] in ['mse', 'huber']:
+            odims = 1
+        elif hparams['dur_loss'] == 'mog':
+            odims = 15
+        elif hparams['dur_loss'] == 'crf':
+            odims = 32
+            from torchcrf import CRF
+            self.crf = CRF(odims, batch_first=True)
+        self.linear = torch.nn.Linear(n_chans, odims)
+    def _forward(self, xs, x_masks=None, is_inference=False):
+        xs = xs.transpose(1, -1)  # (B, idim, Tmax)
+        for f in self.conv:
+            xs = f(xs)  # (B, C, Tmax)
+            if x_masks is not None:
+                xs = xs * (1 - x_masks.float())[:, None, :]
+        xs = self.linear(xs.transpose(1, -1))  # [B, T, C]
+        xs = xs * (1 - x_masks.float())[:, :, None]  # (B, T, C)
+        if is_inference:
+            return self.out2dur(xs), xs
+        else:
+            if hparams['dur_loss'] in ['mse']:
+                xs = xs.squeeze(-1)  # (B, Tmax)
+        return xs
+    def out2dur(self, xs):
+        if hparams['dur_loss'] in ['mse']:
+            # NOTE: calculate in log domain
+            xs = xs.squeeze(-1)  # (B, Tmax)
+            dur = torch.clamp(torch.round(xs.exp() - self.offset), min=0).long()  # avoid negative value
+        elif hparams['dur_loss'] == 'mog':
+            return NotImplementedError
+        elif hparams['dur_loss'] == 'crf':
+            dur = torch.LongTensor(self.crf.decode(xs)).cuda()
+        return dur
+    def forward(self, xs, x_masks=None):
+        """Calculate forward propagation.
+        Args:
+            xs (Tensor): Batch of input sequences (B, Tmax, idim).
+            x_masks (ByteTensor, optional): Batch of masks indicating padded part (B, Tmax).
+        Returns:
+            Tensor: Batch of predicted durations in log domain (B, Tmax).
+        """
+        return self._forward(xs, x_masks, False)
+    def inference(self, xs, x_masks=None):
+        """Inference duration.
+        Args:
+            xs (Tensor): Batch of input sequences (B, Tmax, idim).
+            x_masks (ByteTensor, optional): Batch of masks indicating padded part (B, Tmax).
+        Returns:
+            LongTensor: Batch of predicted durations in linear domain (B, Tmax).
+        """
+        return self._forward(xs, x_masks, True)
+class LengthRegulator(torch.nn.Module):
+    def __init__(self, pad_value=0.0):
+        super(LengthRegulator, self).__init__()
+        self.pad_value = pad_value
+    def forward(self, dur, dur_padding=None, alpha=1.0):
+        """
+        Example (no batch dim version):
+            1. dur = [2,2,3]
+            2. token_idx = [[1],[2],[3]], dur_cumsum = [2,4,7], dur_cumsum_prev = [0,2,4]
+            3. token_mask = [[1,1,0,0,0,0,0],
+                             [0,0,1,1,0,0,0],
+                             [0,0,0,0,1,1,1]]
+            4. token_idx * token_mask = [[1,1,0,0,0,0,0],
+                                         [0,0,2,2,0,0,0],
+                                         [0,0,0,0,3,3,3]]
+            5. (token_idx * token_mask).sum(0) = [1,1,2,2,3,3,3]
+        :param dur: Batch of durations of each frame (B, T_txt)
+        :param dur_padding: Batch of padding of each frame (B, T_txt)
+        :param alpha: duration rescale coefficient
+        :return:
+            mel2ph (B, T_speech)
+        """
+        assert alpha > 0
+        dur = torch.round(dur.float() * alpha).long()
+        if dur_padding is not None:
+            dur = dur * (1 - dur_padding.long())
+        token_idx = torch.arange(1, dur.shape[1] + 1)[None, :, None].to(dur.device)
+        dur_cumsum = torch.cumsum(dur, 1)
+        dur_cumsum_prev = F.pad(dur_cumsum, [1, -1], mode='constant', value=0)
+        pos_idx = torch.arange(dur.sum(-1).max())[None, None].to(dur.device)
+        token_mask = (pos_idx >= dur_cumsum_prev[:, :, None]) & (pos_idx < dur_cumsum[:, :, None])
+        mel2ph = (token_idx * token_mask.long()).sum(1)
+        return mel2ph
+class PitchPredictor(torch.nn.Module):
+    def __init__(self, idim, n_layers=5, n_chans=384, odim=2, kernel_size=5,
+                 dropout_rate=0.1, padding='SAME'):
+        """Initilize pitch predictor module.
+        Args:
+            idim (int): Input dimension.
+            n_layers (int, optional): Number of convolutional layers.
+            n_chans (int, optional): Number of channels of convolutional layers.
+            kernel_size (int, optional): Kernel size of convolutional layers.
+            dropout_rate (float, optional): Dropout rate.
+        """
+        super(PitchPredictor, self).__init__()
+        self.conv = torch.nn.ModuleList()
+        self.kernel_size = kernel_size
+        self.padding = padding
+        for idx in range(n_layers):
+            in_chans = idim if idx == 0 else n_chans
+            self.conv += [torch.nn.Sequential(
+                torch.nn.ConstantPad1d(((kernel_size - 1) // 2, (kernel_size - 1) // 2)
+                                       if padding == 'SAME'
+                                       else (kernel_size - 1, 0), 0),
+                torch.nn.Conv1d(in_chans, n_chans, kernel_size, stride=1, padding=0),
+                torch.nn.ReLU(),
+                LayerNorm(n_chans, dim=1),
+                torch.nn.Dropout(dropout_rate)
+            )]
+        self.linear = torch.nn.Linear(n_chans, odim)
+        self.embed_positions = SinusoidalPositionalEmbedding(idim, 0, init_size=4096)
+        self.pos_embed_alpha = nn.Parameter(torch.Tensor([1]))
+    def forward(self, xs):
+        """
+        :param xs: [B, T, H]
+        :return: [B, T, H]
+        """
+        positions = self.pos_embed_alpha * self.embed_positions(xs[..., 0])
+        xs = xs + positions
+        xs = xs.transpose(1, -1)  # (B, idim, Tmax)
+        for f in self.conv:
+            xs = f(xs)  # (B, C, Tmax)
+        # NOTE: calculate in log domain
+        xs = self.linear(xs.transpose(1, -1))  # (B, Tmax, H)
+        return xs
+class EnergyPredictor(PitchPredictor):
+    pass
+def mel2ph_to_dur(mel2ph, T_txt, max_dur=None):
+    B, _ = mel2ph.shape
+    dur = mel2ph.new_zeros(B, T_txt + 1).scatter_add(1, mel2ph, torch.ones_like(mel2ph))
+    dur = dur[:, 1:]
+    if max_dur is not None:
+        dur = dur.clamp(max=max_dur)
+    return dur
+class FFTBlocks(nn.Module):
+    def __init__(self, hidden_size, num_layers, ffn_kernel_size=9, dropout=None, num_heads=2,
+                 use_pos_embed=True, use_last_norm=True, norm='ln', use_pos_embed_alpha=True):
+        super().__init__()
+        self.num_layers = num_layers
+        embed_dim = self.hidden_size = hidden_size
+        self.dropout = dropout if dropout is not None else hparams['dropout']
+        self.use_pos_embed = use_pos_embed
+        self.use_last_norm = use_last_norm
+        if use_pos_embed:
+            self.max_source_positions = DEFAULT_MAX_TARGET_POSITIONS
+            self.padding_idx = 0
+            self.pos_embed_alpha = nn.Parameter(torch.Tensor([1])) if use_pos_embed_alpha else 1
+            self.embed_positions = SinusoidalPositionalEmbedding(
+                embed_dim, self.padding_idx, init_size=DEFAULT_MAX_TARGET_POSITIONS,
+            )
+        self.layers = nn.ModuleList([])
+        self.layers.extend([
+            TransformerEncoderLayer(self.hidden_size, self.dropout,
+                                    kernel_size=ffn_kernel_size, num_heads=num_heads)
+            for _ in range(self.num_layers)
+        ])
+        if self.use_last_norm:
+            if norm == 'ln':
+                self.layer_norm = nn.LayerNorm(embed_dim)
+            elif norm == 'bn':
+                self.layer_norm = BatchNorm1dTBC(embed_dim)
+        else:
+            self.layer_norm = None
+    def forward(self, x, padding_mask=None, attn_mask=None, return_hiddens=False):
+        """
+        :param x: [B, T, C]
+        :param padding_mask: [B, T]
+        :return: [B, T, C] or [L, B, T, C]
+        """
+        padding_mask = x.abs().sum(-1).eq(0).data if padding_mask is None else padding_mask
+        nonpadding_mask_TB = 1 - padding_mask.transpose(0, 1).float()[:, :, None]  # [T, B, 1]
+        if self.use_pos_embed:
+            positions = self.pos_embed_alpha * self.embed_positions(x[..., 0])
+            x = x + positions
+            x = F.dropout(x, p=self.dropout, training=self.training)
+        # B x T x C -> T x B x C
+        x = x.transpose(0, 1) * nonpadding_mask_TB
+        hiddens = []
+        for layer in self.layers:
+            x = layer(x, encoder_padding_mask=padding_mask, attn_mask=attn_mask) * nonpadding_mask_TB
+            hiddens.append(x)
+        if self.use_last_norm:
+            x = self.layer_norm(x) * nonpadding_mask_TB
+        if return_hiddens:
+            x = torch.stack(hiddens, 0)  # [L, T, B, C]
+            x = x.transpose(1, 2)  # [L, B, T, C]
+        else:
+            x = x.transpose(0, 1)  # [B, T, C]
+        return x
+class FastspeechEncoder(FFTBlocks):
+    def __init__(self, embed_tokens, hidden_size=None, num_layers=None, kernel_size=None, num_heads=2):
+        hidden_size = hparams['hidden_size'] if hidden_size is None else hidden_size
+        kernel_size = hparams['enc_ffn_kernel_size'] if kernel_size is None else kernel_size
+        num_layers = hparams['dec_layers'] if num_layers is None else num_layers
+        super().__init__(hidden_size, num_layers, kernel_size, num_heads=num_heads,
+                         use_pos_embed=False)  # use_pos_embed_alpha for compatibility
+        self.embed_tokens = embed_tokens
+        self.embed_scale = math.sqrt(hidden_size)
+        self.padding_idx = 0
+        if hparams.get('rel_pos') is not None and hparams['rel_pos']:
+            self.embed_positions = RelPositionalEncoding(hidden_size, dropout_rate=0.0)
+        else:
+            self.embed_positions = SinusoidalPositionalEmbedding(
+                hidden_size, self.padding_idx, init_size=DEFAULT_MAX_TARGET_POSITIONS,
+            )
+    def forward(self, txt_tokens):
+        """
+        :param txt_tokens: [B, T]
+        :return: {
+            'encoder_out': [T x B x C]
+        }
+        """
+        encoder_padding_mask = txt_tokens.eq(self.padding_idx).data
+        x = self.forward_embedding(txt_tokens)  # [B, T, H]
+        x = super(FastspeechEncoder, self).forward(x, encoder_padding_mask)
+        return x
+    def forward_embedding(self, txt_tokens):
+        # embed tokens and positions
+        x = self.embed_scale * self.embed_tokens(txt_tokens)
+        if hparams['use_pos_embed']:
+            positions = self.embed_positions(txt_tokens)
+            x = x + positions
+        x = F.dropout(x, p=self.dropout, training=self.training)
+        return x
+class FastspeechDecoder(FFTBlocks):
+    def __init__(self, hidden_size=None, num_layers=None, kernel_size=None, num_heads=None):
+        num_heads = hparams['num_heads'] if num_heads is None else num_heads
+        hidden_size = hparams['hidden_size'] if hidden_size is None else hidden_size
+        kernel_size = hparams['dec_ffn_kernel_size'] if kernel_size is None else kernel_size
+        num_layers = hparams['dec_layers'] if num_layers is None else num_layers
+        super().__init__(hidden_size, num_layers, kernel_size, num_heads=num_heads)