Higobeatz
/

Diff-Pitcher

Model card Files Files and versions Community

jerryhai commited on Aug 3, 2024

Commit

90f7c1e

1 Parent(s): caf7184

Track binary files with Git LFS

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.ipynb_checkpoints/requirements-checkpoint.txt +17 -0
.ipynb_checkpoints/score_based_apc-checkpoint.py +159 -0
.ipynb_checkpoints/template_based_apc-checkpoint.py +89 -0
README.md +86 -3
examples/off-key.wav +0 -0
examples/reference.wav +0 -0
examples/score_midi.midi +0 -0
examples/score_midi.npy +3 -0
examples/score_vocal.wav +0 -0
output_score.wav +0 -0
output_template.wav +0 -0
pitch_controller/README.md +1 -0
pitch_controller/__pycache__/utils.cpython-310.pyc +0 -0
pitch_controller/config/DiffWorld_24k.yaml +24 -0
pitch_controller/data/example/f0/p225_001.wav.npy +3 -0
pitch_controller/data/example/mel/p225_001.wav.npy +3 -0
pitch_controller/data/example/wav/p225_001.wav +0 -0
pitch_controller/data/example/world/p225_001.wav.npy +3 -0
pitch_controller/data/prepare_f0.py +66 -0
pitch_controller/data/prepare_mel.py +72 -0
pitch_controller/data/prepare_world.py +85 -0
pitch_controller/dataset/__init__.py +1 -0
pitch_controller/dataset/__pycache__/__init__.cpython-310.pyc +0 -0
pitch_controller/dataset/__pycache__/__init__.cpython-39.pyc +0 -0
pitch_controller/dataset/__pycache__/content_enc.cpython-310.pyc +0 -0
pitch_controller/dataset/__pycache__/content_enc.cpython-39.pyc +0 -0
pitch_controller/dataset/__pycache__/diff.cpython-310.pyc +0 -0
pitch_controller/dataset/__pycache__/diff.cpython-39.pyc +0 -0
pitch_controller/dataset/__pycache__/diff_lpc.cpython-310.pyc +0 -0
pitch_controller/dataset/diff_lpc.py +271 -0
pitch_controller/dataset/diff_lpc_content.py +231 -0
pitch_controller/load_vocoder.py +51 -0
pitch_controller/models/__pycache__/base.cpython-310.pyc +0 -0
pitch_controller/models/__pycache__/base.cpython-39.pyc +0 -0
pitch_controller/models/__pycache__/modules.cpython-310.pyc +0 -0
pitch_controller/models/__pycache__/modules.cpython-39.pyc +0 -0
pitch_controller/models/__pycache__/pitch.cpython-39.pyc +0 -0
pitch_controller/models/__pycache__/unet.cpython-310.pyc +0 -0
pitch_controller/models/__pycache__/unet.cpython-39.pyc +0 -0
pitch_controller/models/__pycache__/update_unet.cpython-310.pyc +0 -0
pitch_controller/models/__pycache__/utils.cpython-310.pyc +0 -0
pitch_controller/models/__pycache__/utils.cpython-39.pyc +0 -0
pitch_controller/models/base.py +30 -0
pitch_controller/models/modules.py +237 -0
pitch_controller/models/unet.py +153 -0
pitch_controller/models/utils.py +110 -0
pitch_controller/modules/BigVGAN/LICENSE +21 -0
pitch_controller/modules/BigVGAN/README.md +95 -0
pitch_controller/modules/BigVGAN/__pycache__/env.cpython-310.pyc +0 -0
pitch_controller/modules/BigVGAN/__pycache__/inference.cpython-310.pyc +0 -0

.ipynb_checkpoints/requirements-checkpoint.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+diffusers
+einops
+fastdtw
+librosa
+matplotlib
+music21
+numpy
+pandas
+pretty_midi
+pysptk
+pyworld
+scipy
+soundfile
+tgt
+torch
+torchaudio
+tqdm

.ipynb_checkpoints/score_based_apc-checkpoint.py ADDED Viewed

	@@ -0,0 +1,159 @@

+import os.path
+import numpy as np
+import pandas as pd
+import torch
+import yaml
+import librosa
+import soundfile as sf
+from tqdm import tqdm
+from diffusers import DDIMScheduler
+from pitch_controller.models.unet import UNetPitcher
+from pitch_controller.utils import minmax_norm_diff, reverse_minmax_norm_diff
+from pitch_controller.modules.BigVGAN.inference import load_model
+from utils import get_mel, get_world_mel, get_f0, f0_to_coarse, show_plot, get_matched_f0, log_f0
+from pitch_predictor.models.transformer import PitchFormer
+import pretty_midi
+def prepare_midi_wav(wav_id, midi_id, sr=24000):
+    midi = pretty_midi.PrettyMIDI(midi_id)
+    roll = midi.get_piano_roll()
+    roll = np.pad(roll, ((0, 0), (0, 1000)), constant_values=0)
+    roll[roll > 0] = 100
+    onset = midi.get_onsets()
+    before_onset = list(np.round(onset * 100 - 1).astype(int))
+    roll[:, before_onset] = 0
+    wav, sr = librosa.load(wav_id, sr=sr)
+    start = 0
+    end = round(100 * len(wav) / sr) / 100
+    # save audio
+    wav_seg = wav[round(start * sr):round(end * sr)]
+    cur_roll = roll[:, round(100 * start):round(100 * end)]
+    return wav_seg, cur_roll
+def algin_mapping(content, target_len):
+    # align content with mel
+    src_len = content.shape[-1]
+    target = torch.zeros([content.shape[0], target_len], dtype=torch.float).to(content.device)
+    temp = torch.arange(src_len+1) * target_len / src_len
+    for i in range(target_len):
+        cur_idx = torch.argmin(torch.abs(temp-i))
+        target[:, i] = content[:, cur_idx]
+    return target
+def midi_to_hz(midi):
+    idx = torch.zeros(midi.shape[-1])
+    for frame in range(midi.shape[-1]):
+        midi_frame = midi[:, frame]
+        non_zero = midi_frame.nonzero()
+        if len(non_zero) != 0:
+            hz = librosa.midi_to_hz(non_zero[0])
+            idx[frame] = torch.tensor(hz)
+    return idx
+@torch.no_grad()
+def score_pitcher(source, pitch_ref, model, hifigan, pitcher, steps=50, shift_semi=0, mask_with_source=False):
+    wav, midi = prepare_midi_wav(source, pitch_ref, sr=sr)
+    source_mel = get_world_mel(None, sr=sr, wav=wav)
+    midi = torch.tensor(midi, dtype=torch.float32)
+    midi = algin_mapping(midi, source_mel.shape[-1])
+    midi = midi_to_hz(midi)
+    f0_ori = np.nan_to_num(get_f0(source))
+    source_mel = torch.from_numpy(source_mel).float().unsqueeze(0).to(device)
+    f0_ori = torch.from_numpy(f0_ori).float().unsqueeze(0).to(device)
+    midi = midi.unsqueeze(0).to(device)
+    f0_pred = pitcher(sp=source_mel, midi=midi)
+    if mask_with_source:
+        # mask unvoiced frames based on original pitch estimation
+        f0_pred[f0_ori == 0] = 0
+    f0_pred = f0_pred.cpu().numpy()[0]
+    # limit range
+    f0_pred[f0_pred < librosa.note_to_hz('C2')] = 0
+    f0_pred[f0_pred > librosa.note_to_hz('C6')] = librosa.note_to_hz('C6')
+    f0_pred = f0_pred * (2 ** (shift_semi / 12))
+    f0_pred = log_f0(f0_pred, {'f0_bin': 345,
+                               'f0_min': librosa.note_to_hz('C2'),
+                               'f0_max': librosa.note_to_hz('C#6')})
+    f0_pred = torch.from_numpy(f0_pred).float().unsqueeze(0).to(device)
+    noise_scheduler = DDIMScheduler(num_train_timesteps=1000)
+    generator = torch.Generator(device=device).manual_seed(2024)
+    noise_scheduler.set_timesteps(steps)
+    noise = torch.randn(source_mel.shape, generator=generator, device=device)
+    pred = noise
+    source_x = minmax_norm_diff(source_mel, vmax=max_mel, vmin=min_mel)
+    for t in tqdm(noise_scheduler.timesteps):
+        pred = noise_scheduler.scale_model_input(pred, t)
+        model_output = model(x=pred, mean=source_x, f0=f0_pred, t=t, ref=None, embed=None)
+        pred = noise_scheduler.step(model_output=model_output,
+                                    timestep=t,
+                                    sample=pred,
+                                    eta=1, generator=generator).prev_sample
+    pred = reverse_minmax_norm_diff(pred, vmax=max_mel, vmin=min_mel)
+    pred_audio = hifigan(pred)
+    pred_audio = pred_audio.cpu().squeeze().clamp(-1, 1)
+    return pred_audio
+if __name__ == '__main__':
+    min_mel = np.log(1e-5)
+    max_mel = 2.5
+    sr = 24000
+    use_gpu = torch.cuda.is_available()
+    device = 'cuda' if use_gpu else 'cpu'
+    # load diffusion model
+    config = yaml.load(open('pitch_controller/config/DiffWorld_24k.yaml'), Loader=yaml.FullLoader)
+    mel_cfg = config['logmel']
+    ddpm_cfg = config['ddpm']
+    unet_cfg = config['unet']
+    model = UNetPitcher(**unet_cfg)
+    unet_path = 'ckpts/world_fixed_40.pt'
+    state_dict = torch.load(unet_path)
+    for key in list(state_dict.keys()):
+        state_dict[key.replace('_orig_mod.', '')] = state_dict.pop(key)
+    model.load_state_dict(state_dict)
+    if use_gpu:
+        model.cuda()
+    model.eval()
+    #  load vocoder
+    hifi_path = 'ckpts/bigvgan_24khz_100band/g_05000000.pt'
+    hifigan, cfg = load_model(hifi_path, device=device)
+    hifigan.eval()
+    # load pitch predictor
+    pitcher = PitchFormer(100, 512).to(device)
+    ckpt = torch.load('ckpts/ckpt_transformer_pitch/transformer_pitch_360.pt')
+    pitcher.load_state_dict(ckpt)
+    pitcher.eval()
+    pred_audio = score_pitcher('examples/score_vocal.wav', 'examples/score_midi.midi', model, hifigan, pitcher, steps=50)
+    sf.write('output_score.wav', pred_audio, samplerate=sr)

.ipynb_checkpoints/template_based_apc-checkpoint.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import os.path
+import numpy as np
+import pandas as pd
+import torch
+import yaml
+import librosa
+import soundfile as sf
+from tqdm import tqdm
+from diffusers import DDIMScheduler
+from pitch_controller.models.unet import UNetPitcher
+from pitch_controller.utils import minmax_norm_diff, reverse_minmax_norm_diff
+from pitch_controller.modules.BigVGAN.inference import load_model
+from utils import get_mel, get_world_mel, get_f0, f0_to_coarse, show_plot, get_matched_f0, log_f0
+@torch.no_grad()
+def template_pitcher(source, pitch_ref, model, hifigan, steps=50, shift_semi=0):
+    source_mel = get_world_mel(source, sr=sr)
+    f0_ref = get_matched_f0(source, pitch_ref, 'world')
+    f0_ref = f0_ref * 2 ** (shift_semi / 12)
+    f0_ref = log_f0(f0_ref, {'f0_bin': 345,
+                             'f0_min': librosa.note_to_hz('C2'),
+                             'f0_max': librosa.note_to_hz('C#6')})
+    source_mel = torch.from_numpy(source_mel).float().unsqueeze(0).to(device)
+    f0_ref = torch.from_numpy(f0_ref).float().unsqueeze(0).to(device)
+    noise_scheduler = DDIMScheduler(num_train_timesteps=1000)
+    generator = torch.Generator(device=device).manual_seed(2024)
+    noise_scheduler.set_timesteps(steps)
+    noise = torch.randn(source_mel.shape, generator=generator, device=device)
+    pred = noise
+    source_x = minmax_norm_diff(source_mel, vmax=max_mel, vmin=min_mel)
+    for t in tqdm(noise_scheduler.timesteps):
+        pred = noise_scheduler.scale_model_input(pred, t)
+        model_output = model(x=pred, mean=source_x, f0=f0_ref, t=t, ref=None, embed=None)
+        pred = noise_scheduler.step(model_output=model_output,
+                                    timestep=t,
+                                    sample=pred,
+                                    eta=1, generator=generator).prev_sample
+    pred = reverse_minmax_norm_diff(pred, vmax=max_mel, vmin=min_mel)
+    pred_audio = hifigan(pred)
+    pred_audio = pred_audio.cpu().squeeze().clamp(-1, 1)
+    return pred_audio
+if __name__ == '__main__':
+    min_mel = np.log(1e-5)
+    max_mel = 2.5
+    sr = 24000
+    use_gpu = torch.cuda.is_available()
+    device = 'cuda' if use_gpu else 'cpu'
+    # load diffusion model
+    config = yaml.load(open('pitch_controller/config/DiffWorld_24k.yaml'), Loader=yaml.FullLoader)
+    mel_cfg = config['logmel']
+    ddpm_cfg = config['ddpm']
+    unet_cfg = config['unet']
+    model = UNetPitcher(**unet_cfg)
+    unet_path = 'ckpts/world_fixed_40.pt'
+    state_dict = torch.load(unet_path)
+    for key in list(state_dict.keys()):
+        state_dict[key.replace('_orig_mod.', '')] = state_dict.pop(key)
+    model.load_state_dict(state_dict)
+    if use_gpu:
+        model.cuda()
+    model.eval()
+    #  load vocoder
+    hifi_path = 'ckpts/bigvgan_24khz_100band/g_05000000.pt'
+    hifigan, cfg = load_model(hifi_path, device=device)
+    hifigan.eval()
+    pred_audio = template_pitcher('examples/off-key.wav', 'examples/reference.wav', model, hifigan, steps=50, shift_semi=0)
+    sf.write('output_template.wav', pred_audio, samplerate=sr)

README.md CHANGED Viewed

@@ -1,3 +1,86 @@
----
-license: mit
----

+<img src="img\cover.png">
+# Diff-Pitcher (PyTorch)
+Official Pytorch Implementation of [Diff-Pitcher: Diffusion-based Singing Voice Pitch Correction](https://engineering.jhu.edu/lcap/data/uploads/pdfs/waspaa2023_hai.pdf)
+--------------------
+Thank you all for your interest in this research project. I am currently optimizing the model's performance and computation efficiency. I plan to release a user-friendly version, either a GUI or a VST, in the first half of this year, and will update the open-source license.
+If you are familiar with PyTorch, you can follow [Code Examples](#examples) to use Diff-Pitcher.
+--------------------
+Diff-Pitcher
+- [Demo Page](#demo)
+- [Todo List](#todo)
+- [Code Examples](#examples)
+- [References](#references)
+- [Acknowledgement](#acknowledgement)
+## Demo
+🎵 Listen to [examples](https://jhu-lcap.github.io/Diff-Pitcher/)
+## Todo
+- [x] Update codes and demo
+- [x] Support 🤗 [Diffusers](https://github.com/huggingface/diffusers)
+- [x] Upload checkpoints
+- [x] Pipeline tutorial
+- [ ] Merge to [Your-Stable-Audio](https://github.com/haidog-yaqub/Your-Stable-Audio)
+- [ ] Audio Plugin Support
+## Examples
+- Download checkpoints: 🎒[ckpts](https://github.com/haidog-yaqub/DiffPitcher/tree/main/ckpts)
+- Prepare environment: [requirements.txt](requirements.txt)
+- Feel free to try:
+  - template-based automatic pitch correction: [template_based_apc.py](template_based_apc.py)
+  - score-based automatic pitch correction: [score_based_apc.py](score_based_apc.py)
+## References
+If you find the code useful for your research, please consider citing:
+```bibtex
+@inproceedings{hai2023diff,
+  title={Diff-Pitcher: Diffusion-Based Singing Voice Pitch Correction},
+  author={Hai, Jiarui and Elhilali, Mounya},
+  booktitle={2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
+  pages={1--5},
+  year={2023},
+  organization={IEEE}
+}
+```
+This repo is inspired by:
+```bibtex
+@article{popov2021diffusion,
+  title={Diffusion-based voice conversion with fast maximum likelihood sampling scheme},
+  author={Popov, Vadim and Vovk, Ivan and Gogoryan, Vladimir and Sadekova, Tasnima and Kudinov, Mikhail and Wei, Jiansheng},
+  journal={arXiv preprint arXiv:2109.13821},
+  year={2021}
+}
+```
+```bibtex
+@inproceedings{liu2022diffsinger,
+  title={Diffsinger: Singing voice synthesis via shallow diffusion mechanism},
+  author={Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Zhao, Zhou},
+  booktitle={Proceedings of the AAAI conference on artificial intelligence},
+  volume={36},
+  number={10},
+  pages={11020--11028},
+  year={2022}
+}
+```
+## Acknowledgement
+[Welcome to LCAP! < LCAP (jhu.edu)](https://engineering.jhu.edu/lcap/)
+We borrow code from following repos:
+ - `Diffusion Schedulers` are based on 🤗 [Diffusers](https://github.com/huggingface/diffusers)
+ - `2D UNet` is based on [DiffVC](https://github.com/huawei-noah/Speech-Backbones/tree/main/DiffVC)

examples/off-key.wav ADDED Viewed

Binary file (816 kB). View file

examples/reference.wav ADDED Viewed

Binary file (816 kB). View file

examples/score_midi.midi ADDED Viewed

Binary file (121 Bytes). View file

examples/score_midi.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7baacba4afb8813d057e420cd63853657401403b6c798f6cb7f06673e7dcea5a
+size 559232

examples/score_vocal.wav ADDED Viewed

Binary file (262 kB). View file

output_score.wav ADDED Viewed

Binary file (262 kB). View file

output_template.wav ADDED Viewed

Binary file (225 kB). View file

pitch_controller/README.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Diffusion-based Pitch Controller

pitch_controller/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (1.94 kB). View file

pitch_controller/config/DiffWorld_24k.yaml ADDED Viewed

	@@ -0,0 +1,24 @@

+version: 1.0
+logmel:
+  n_mels: 100
+  sampling_rate: 24000
+  n_fft: 1024
+  hop_size: 256
+  max: 2.5
+  min: -12
+unet:
+  dim_base: 256
+  use_embed: False
+  dim_embed: None
+  use_ref_t: False
+  dim_cond: 128
+  dim_mults: [1, 2, 4]
+ddpm:
+  num_train_steps: 1000
+  inference_steps: 100
+  eta: 0.8

pitch_controller/data/example/f0/p225_001.wav.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8df28ae08ef686e7c7e523fdde25b62fbd05725cdacc043cde407a898182272f
+size 1672

pitch_controller/data/example/mel/p225_001.wav.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8bf3c0e6956f57acdd82f5d91f6390ce148d89066faedbdd6f6ac8c48d1d2c76
+size 77328

pitch_controller/data/example/wav/p225_001.wav ADDED Viewed

Binary file (197 kB). View file

pitch_controller/data/example/world/p225_001.wav.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e00d5eb7fa9df26321df3f3df06e2ff44c3b3732cc5179ef135e41ffeb3a3b82
+size 77328

pitch_controller/data/prepare_f0.py ADDED Viewed

	@@ -0,0 +1,66 @@

+# import amfm_decompy.basic_tools as basic
+# import amfm_decompy.pYAAPT as pYAAPT
+from multiprocessing import Process
+import os
+import numpy as np
+import pandas as pd
+import librosa
+from librosa.core import load
+from tqdm import tqdm
+def get_f0(wav_path):
+    wav, _ = load(wav_path, sr=24000)
+    wav = wav[:(wav.shape[0] // 256) * 256]
+    wav = np.pad(wav, 384, mode='reflect')
+    f0, _, _ = librosa.pyin(wav, frame_length=1024, hop_length=256, center=False,
+                            fmin=librosa.note_to_hz('C2'),
+                            fmax=librosa.note_to_hz('C6'))
+    return np.nan_to_num(f0)
+def chunks(arr, m):
+    result = [[] for i in range(m)]
+    for i in range(len(arr)):
+        result[i%m].append(arr[i])
+    return result
+def extract_f0(subset):
+    meta = pd.read_csv('../raw_data/meta_fix.csv')
+    meta = meta[meta['subset'] == 'train']
+    # meta = meta[meta['folder'] == 'VCTK-Corpus/vocal/']
+    for i in tqdm(subset):
+        line = meta.iloc[i]
+        audio_dir = '../raw_data/' + line['folder'] + line['subfolder']
+        f = line['file_name']
+        f0_dir = audio_dir.replace('vocal', 'f0').replace('raw_data/', '24k_data_f0/')
+        try:
+            np.load(os.path.join(f0_dir, f+'.npy'))
+        except:
+            print(line)
+            f0 = get_f0(os.path.join(audio_dir, f))
+            if os.path.exists(f0_dir) is False:
+                os.makedirs(f0_dir, exist_ok=True)
+            np.save(os.path.join(f0_dir, f + '.npy'), f0)
+        # if os.path.exists(os.path.join(f0_dir, f+'.npy')) is False:
+            # f0 = get_yaapt_f0(os.path.join(audio_dir, f))
+if __name__ == '__main__':
+    cores = 8
+    meta = pd.read_csv('../raw_data/meta_fix.csv')
+    meta = meta[meta['subset']=='train']
+    # meta = meta[meta['folder'] == 'VCTK-Corpus/vocal/']
+    idx_list = [i for i in range(len(meta))]
+    subsets = chunks(idx_list, cores)
+    for subset in subsets:
+        t = Process(target=extract_f0, args=(subset,))
+        t.start()

pitch_controller/data/prepare_mel.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import os
+import numpy as np
+import librosa
+from librosa.core import load
+from librosa.filters import mel as librosa_mel_fn
+mel_basis = librosa_mel_fn(sr=24000, n_fft=1024, n_mels=100, fmin=0, fmax=12000)
+from tqdm import tqdm
+import pandas as pd
+from multiprocessing import Process
+# def get_f0(wav_path):
+#     wav, _ = load(wav_path, sr=22050)
+#     wav = wav[:(wav.shape[0] // 256) * 256]
+#     wav = np.pad(wav, 384, mode='reflect')
+#     f0, _, _ = librosa.pyin(wav, frame_length=1024, hop_length=256, center=False,
+#                             fmin=librosa.note_to_hz('C2'),
+#                             fmax=librosa.note_to_hz('C6'))
+#     return np.nan_to_num(f0)
+def get_mel(wav_path):
+    wav, _ = load(wav_path, sr=24000)
+    wav = wav[:(wav.shape[0] // 256)*256]
+    wav = np.pad(wav, 384, mode='reflect')
+    stft = librosa.core.stft(wav, n_fft=1024, hop_length=256, win_length=1024, window='hann', center=False)
+    stftm = np.sqrt(np.real(stft) ** 2 + np.imag(stft) ** 2 + (1e-9))
+    mel_spectrogram = np.matmul(mel_basis, stftm)
+    log_mel_spectrogram = np.log(np.clip(mel_spectrogram, a_min=1e-5, a_max=None))
+    return log_mel_spectrogram
+def chunks(arr, m):
+    result = [[] for i in range(m)]
+    for i in range(len(arr)):
+        result[i%m].append(arr[i])
+    return result
+def extract_mel(subset):
+    meta = pd.read_csv('../raw_data/meta_fix.csv')
+    meta = meta[meta['folder'] == 'eval/vocal/']
+    for i in tqdm(subset):
+        line = meta.iloc[i]
+        audio_dir = '../raw_data/' + line['folder'] + line['subfolder']
+        f = line['file_name']
+        mel_dir = audio_dir.replace('vocal', 'mel').replace('raw_data/', '24k_data/')
+        if os.path.exists(os.path.join(mel_dir, f+'.npy')) is False:
+            mel = get_mel(os.path.join(audio_dir, f))
+            if os.path.exists(mel_dir) is False:
+                os.makedirs(mel_dir)
+            np.save(os.path.join(mel_dir, f+'.npy'), mel)
+if __name__ == '__main__':
+    cores = 8
+    meta = pd.read_csv('../raw_data/meta_fix.csv')
+    meta = meta[meta['folder'] == 'eval/vocal/']
+    idx_list = [i for i in range(len(meta))]
+    subsets = chunks(idx_list, cores)
+    for subset in subsets:
+        t = Process(target=extract_mel, args=(subset,))
+        t.start()

pitch_controller/data/prepare_world.py ADDED Viewed

	@@ -0,0 +1,85 @@

+from multiprocessing import Process
+import os
+import numpy as np
+import librosa
+from librosa.core import load
+from librosa.filters import mel as librosa_mel_fn
+mel_basis = librosa_mel_fn(sr=24000, n_fft=1024, n_mels=100, fmin=0, fmax=12000)
+from tqdm import tqdm
+import pandas as pd
+import pyworld as pw
+def get_world_mel(wav_path, sr=24000):
+    wav, _ = librosa.load(wav_path, sr=sr)
+    wav = (wav * 32767).astype(np.int16)
+    wav = (wav / 32767).astype(np.float64)
+    # wav = wav.astype(np.float64)
+    wav = wav[:(wav.shape[0] // 256) * 256]
+    _f0, t = pw.dio(wav, sr, frame_period=256/sr*1000)
+    f0 = pw.stonemask(wav, _f0, t, sr)
+    sp = pw.cheaptrick(wav, f0, t, sr)
+    ap = pw.d4c(wav, f0, t, sr)
+    wav_hat = pw.synthesize(f0 * 0, sp, ap, sr, frame_period=256/sr*1000)
+    # pyworld output does not pad left
+    wav_hat = wav_hat[:len(wav)]
+    # wav_hat = wav_hat[256//2: len(wav)+256//2]
+    assert len(wav_hat) == len(wav)
+    wav = wav_hat.astype(np.float32)
+    wav = np.pad(wav, 384, mode='reflect')
+    stft = librosa.core.stft(wav, n_fft=1024, hop_length=256, win_length=1024, window='hann', center=False)
+    stftm = np.sqrt(np.real(stft) ** 2 + np.imag(stft) ** 2 + (1e-9))
+    mel_spectrogram = np.matmul(mel_basis, stftm)
+    log_mel_spectrogram = np.log(np.clip(mel_spectrogram, a_min=1e-5, a_max=None))
+    return log_mel_spectrogram, f0
+def chunks(arr, m):
+    result = [[] for i in range(m)]
+    for i in range(len(arr)):
+        result[i%m].append(arr[i])
+    return result
+def extract_pw(subset, save_f0=False):
+    meta = pd.read_csv('../raw_data/meta_fix.csv')
+    meta = meta[meta['subset'] == 'train']
+    for i in tqdm(subset):
+        line = meta.iloc[i]
+        audio_dir = '../raw_data/' + line['folder'] + line['subfolder']
+        f = line['file_name']
+        mel_dir = audio_dir.replace('vocal', 'world').replace('raw_data/', '24k_data/')
+        f0_dir = audio_dir.replace('vocal', 'f0').replace('raw_data/', '24k_f0/')
+        if os.path.exists(os.path.join(mel_dir, f+'.npy')) is False:
+            mel = get_world_mel(os.path.join(audio_dir, f))
+            if os.path.exists(mel_dir) is False:
+                os.makedirs(mel_dir)
+            np.save(os.path.join(mel_dir, f+'.npy'), mel)
+            if save_f0 is True:
+                if os.path.exists(f0_dir) is False:
+                    os.makedirs(f0_dir)
+                np.save(os.path.join(f0_dir, f + '.npy'), f0)
+if __name__ == '__main__':
+    cores = 8
+    meta = pd.read_csv('../raw_data/meta_fix.csv')
+    meta = meta[meta['subset'] == 'train']
+    idx_list = [i for i in range(len(meta))]
+    subsets = chunks(idx_list, cores)
+    for subset in subsets:
+        t = Process(target=extract_pw, args=(subset,))
+        t.start()

pitch_controller/dataset/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .diff_lpc import VCDecLPCDataset, VCDecLPCBatchCollate, VCDecLPCTest

pitch_controller/dataset/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (237 Bytes). View file

pitch_controller/dataset/__pycache__/__init__.cpython-39.pyc ADDED Viewed

Binary file (311 Bytes). View file

pitch_controller/dataset/__pycache__/content_enc.cpython-310.pyc ADDED Viewed

Binary file (2.85 kB). View file

pitch_controller/dataset/__pycache__/content_enc.cpython-39.pyc ADDED Viewed

Binary file (2.84 kB). View file

pitch_controller/dataset/__pycache__/diff.cpython-310.pyc ADDED Viewed

Binary file (5.79 kB). View file

pitch_controller/dataset/__pycache__/diff.cpython-39.pyc ADDED Viewed

Binary file (5.83 kB). View file

pitch_controller/dataset/__pycache__/diff_lpc.cpython-310.pyc ADDED Viewed

Binary file (7.03 kB). View file

pitch_controller/dataset/diff_lpc.py ADDED Viewed

	@@ -0,0 +1,271 @@

+import os
+import random
+import numpy as np
+import torch
+import tgt
+import pandas as pd
+from torch.utils.data import Dataset
+import librosa
+def f0_to_coarse(f0, hparams):
+    f0_bin = hparams['f0_bin']
+    f0_max = hparams['f0_max']
+    f0_min = hparams['f0_min']
+    is_torch = isinstance(f0, torch.Tensor)
+    # to mel scale
+    f0_mel_min = 1127 * np.log(1 + f0_min / 700)
+    f0_mel_max = 1127 * np.log(1 + f0_max / 700)
+    f0_mel = 1127 * (1 + f0 / 700).log() if is_torch else 1127 * np.log(1 + f0 / 700)
+    unvoiced = (f0_mel == 0)
+    f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * (f0_bin - 2) / (f0_mel_max - f0_mel_min) + 1
+    f0_mel[f0_mel <= 1] = 1
+    f0_mel[f0_mel > f0_bin - 1] = f0_bin - 1
+    f0_mel[unvoiced] = 0
+    f0_coarse = (f0_mel + 0.5).long() if is_torch else np.rint(f0_mel).astype(int)
+    assert f0_coarse.max() <= 255 and f0_coarse.min() >= 0, (f0_coarse.max(), f0_coarse.min())
+    return f0_coarse
+def log_f0(f0, hparams):
+    f0_bin = hparams['f0_bin']
+    f0_max = hparams['f0_max']
+    f0_min = hparams['f0_min']
+    f0_mel = np.zeros_like(f0)
+    f0_mel[f0 != 0] = 12*np.log2(f0[f0 != 0]/f0_min) + 1
+    f0_mel_min = 12*np.log2(f0_min/f0_min) + 1
+    f0_mel_max = 12*np.log2(f0_max/f0_min) + 1
+    unvoiced = (f0_mel == 0)
+    f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * (f0_bin - 2) / (f0_mel_max - f0_mel_min) + 1
+    f0_mel[f0_mel <= 1] = 1
+    f0_mel[f0_mel > f0_bin - 1] = f0_bin - 1
+    f0_mel[unvoiced] = 0
+    f0_coarse = np.rint(f0_mel).astype(int)
+    assert f0_coarse.max() <= (f0_bin-1) and f0_coarse.min() >= 0, (f0_coarse.max(), f0_coarse.min())
+    return f0_coarse
+# training "average voice" encoder
+class VCDecLPCDataset(Dataset):
+    def __init__(self, data_dir, subset, content_dir='lpc_mel_512', extract_emb=False,
+                 f0_type='bins'):
+        self.path = data_dir
+        meta = pd.read_csv(data_dir + 'meta_fix.csv')
+        self.meta = meta[meta['subset'] == subset]
+        self.content_dir = content_dir
+        self.extract_emb = extract_emb
+        self.f0_type = f0_type
+    def get_vc_data(self, audio_path, mel_id):
+        mel_dir = audio_path.replace('vocal', 'mel')
+        embed_dir = audio_path.replace('vocal', 'embed')
+        pitch_dir = audio_path.replace('vocal', 'f0')
+        content_dir = audio_path.replace('vocal', self.content_dir)
+        mel = os.path.join(mel_dir, mel_id + '.npy')
+        embed = os.path.join(embed_dir, mel_id + '.npy')
+        pitch = os.path.join(pitch_dir, mel_id + '.npy')
+        content = os.path.join(content_dir, mel_id + '.npy')
+        mel = np.load(mel)
+        if self.extract_emb:
+            embed = np.load(embed)
+        else:
+            embed = np.zeros(1)
+        pitch = np.load(pitch)
+        content = np.load(content)
+        pitch = np.nan_to_num(pitch)
+        if self.f0_type == 'bins':
+            pitch = f0_to_coarse(pitch, {'f0_bin': 256,
+                                         'f0_min': librosa.note_to_hz('C2'),
+                                         'f0_max': librosa.note_to_hz('C6')})
+        elif self.f0_type == 'log':
+            pitch = log_f0(pitch, {'f0_bin': 345,
+                                   'f0_min': librosa.note_to_hz('C2'),
+                                   'f0_max': librosa.note_to_hz('C#6')})
+        mel = torch.from_numpy(mel).float()
+        embed = torch.from_numpy(embed).float()
+        pitch = torch.from_numpy(pitch).float()
+        content = torch.from_numpy(content).float()
+        return (mel, embed, pitch, content)
+    def __getitem__(self, index):
+        row = self.meta.iloc[index]
+        mel_id = row['file_name']
+        audio_path = self.path + row['folder'] + row['subfolder']
+        mel, embed, pitch, content = self.get_vc_data(audio_path, mel_id)
+        item = {'mel': mel, 'embed': embed, 'f0': pitch, 'content': content}
+        return item
+    def __len__(self):
+        return len(self.meta)
+class VCDecLPCBatchCollate(object):
+    def __init__(self, train_frames, eps=1e-5):
+        self.train_frames = train_frames
+        self.eps = eps
+    def __call__(self, batch):
+        train_frames = self.train_frames
+        eps = self.eps
+        B = len(batch)
+        embed = torch.stack([item['embed'] for item in batch], 0)
+        n_mels = batch[0]['mel'].shape[0]
+        content_dim = batch[0]['content'].shape[0]
+        # min value of log-mel spectrogram is np.log(eps) == padding zero in time domain
+        mels1 = torch.ones((B, n_mels, train_frames), dtype=torch.float32) * np.log(eps)
+        mels2 = torch.ones((B, n_mels, train_frames), dtype=torch.float32) * np.log(eps)
+        # ! need to deal with empty frames here
+        contents1 = torch.ones((B, content_dim, train_frames), dtype=torch.float32) * np.log(eps)
+        f0s1 = torch.zeros((B, train_frames), dtype=torch.float32)
+        max_starts = [max(item['mel'].shape[-1] - train_frames, 0)
+                      for item in batch]
+        starts1 = [random.choice(range(m)) if m > 0 else 0 for m in max_starts]
+        starts2 = [random.choice(range(m)) if m > 0 else 0 for m in max_starts]
+        mel_lengths = []
+        for i, item in enumerate(batch):
+            mel = item['mel']
+            f0 = item['f0']
+            content = item['content']
+            if mel.shape[-1] < train_frames:
+                mel_length = mel.shape[-1]
+            else:
+                mel_length = train_frames
+            mels1[i, :, :mel_length] = mel[:, starts1[i]:starts1[i] + mel_length]
+            f0s1[i, :mel_length] = f0[starts1[i]:starts1[i] + mel_length]
+            contents1[i, :, :mel_length] = content[:, starts1[i]:starts1[i] + mel_length]
+            mels2[i, :, :mel_length] = mel[:, starts2[i]:starts2[i] + mel_length]
+            mel_lengths.append(mel_length)
+        mel_lengths = torch.LongTensor(mel_lengths)
+        return {'mel1': mels1, 'mel2': mels2, 'mel_lengths': mel_lengths,
+                'embed': embed,
+                'f0_1': f0s1,
+                'content1': contents1}
+class VCDecLPCTest(Dataset):
+    def __init__(self, data_dir, subset='test', eps=1e-5, test_frames=256, content_dir='lpc_mel_512', extract_emb=False,
+                 f0_type='bins'):
+        self.path = data_dir
+        meta = pd.read_csv(data_dir + 'meta_test.csv')
+        self.meta = meta[meta['subset'] == subset]
+        self.content_dir = content_dir
+        self.extract_emb = extract_emb
+        self.eps = eps
+        self.test_frames = test_frames
+        self.f0_type = f0_type
+    def get_vc_data(self, audio_path, mel_id, pitch_shift):
+        mel_dir = audio_path.replace('vocal', 'mel')
+        embed_dir = audio_path.replace('vocal', 'embed')
+        pitch_dir = audio_path.replace('vocal', 'f0')
+        content_dir = audio_path.replace('vocal', self.content_dir)
+        mel = os.path.join(mel_dir, mel_id + '.npy')
+        embed = os.path.join(embed_dir, mel_id + '.npy')
+        pitch = os.path.join(pitch_dir, mel_id + '.npy')
+        content = os.path.join(content_dir, mel_id + '.npy')
+        mel = np.load(mel)
+        if self.extract_emb:
+            embed = np.load(embed)
+        else:
+            embed = np.zeros(1)
+        pitch = np.load(pitch)
+        content = np.load(content)
+        pitch = np.nan_to_num(pitch)
+        pitch = pitch*pitch_shift
+        if self.f0_type == 'bins':
+            pitch = f0_to_coarse(pitch, {'f0_bin': 256,
+                                         'f0_min': librosa.note_to_hz('C2'),
+                                         'f0_max': librosa.note_to_hz('C6')})
+        elif self.f0_type == 'log':
+            pitch = log_f0(pitch, {'f0_bin': 345,
+                                   'f0_min': librosa.note_to_hz('C2'),
+                                   'f0_max': librosa.note_to_hz('C#6')})
+        mel = torch.from_numpy(mel).float()
+        embed = torch.from_numpy(embed).float()
+        pitch = torch.from_numpy(pitch).float()
+        content = torch.from_numpy(content).float()
+        return (mel, embed, pitch, content)
+    def __getitem__(self, index):
+        row = self.meta.iloc[index]
+        mel_id = row['content_file_name']
+        audio_path = self.path + row['content_folder'] + row['content_subfolder']
+        pitch_shift = row['pitch_shift']
+        mel1, _, f0, content = self.get_vc_data(audio_path, mel_id, pitch_shift)
+        mel_id = row['timbre_file_name']
+        audio_path = self.path + row['timbre_folder'] + row['timbre_subfolder']
+        mel2, embed, _, _ = self.get_vc_data(audio_path, mel_id, pitch_shift)
+        n_mels = mel1.shape[0]
+        content_dim = content.shape[0]
+        mels1 = torch.ones((n_mels, self.test_frames), dtype=torch.float32) * np.log(self.eps)
+        mels2 = torch.ones((n_mels, self.test_frames), dtype=torch.float32) * np.log(self.eps)
+        lpcs1 = torch.ones((content_dim, self.test_frames), dtype=torch.float32) * np.log(self.eps)
+        f0s1 = torch.zeros(self.test_frames, dtype=torch.float32)
+        if mel1.shape[-1] < self.test_frames:
+            mel_length = mel1.shape[-1]
+        else:
+            mel_length = self.test_frames
+        mels1[:, :mel_length] = mel1[:, :mel_length]
+        f0s1[:mel_length] = f0[:mel_length]
+        lpcs1[:, :mel_length] = content[:, :mel_length]
+        if mel2.shape[-1] < self.test_frames:
+            mel_length = mel2.shape[-1]
+        else:
+            mel_length = self.test_frames
+        mels2[:, :mel_length] = mel2[:, :mel_length]
+        return {'mel1': mels1, 'mel2': mels2, 'embed': embed, 'f0_1': f0s1, 'content1': lpcs1}
+    def __len__(self):
+        return len(self.meta)
+if __name__ == '__main__':
+    f0 = np.array([110.0, 220.0, librosa.note_to_hz('C2'), 0, librosa.note_to_hz('E3'), librosa.note_to_hz('C6')])
+    # 50 midi notes = (50-1)
+    pitch = log_f0(f0, {'f0_bin': 345,
+                        'f0_min': librosa.note_to_hz('C2'),
+                        'f0_max': librosa.note_to_hz('C#6')})

pitch_controller/dataset/diff_lpc_content.py ADDED Viewed

	@@ -0,0 +1,231 @@

+import os
+import random
+import numpy as np
+import torch
+import tgt
+import pandas as pd
+from torch.utils.data import Dataset
+import librosa
+def f0_to_coarse(f0, hparams):
+    f0_bin = hparams['f0_bin']
+    f0_max = hparams['f0_max']
+    f0_min = hparams['f0_min']
+    is_torch = isinstance(f0, torch.Tensor)
+    # to mel scale
+    f0_mel_min = 1127 * np.log(1 + f0_min / 700)
+    f0_mel_max = 1127 * np.log(1 + f0_max / 700)
+    f0_mel = 1127 * (1 + f0 / 700).log() if is_torch else 1127 * np.log(1 + f0 / 700)
+    unvoiced = (f0_mel == 0)
+    f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * (f0_bin - 2) / (f0_mel_max - f0_mel_min) + 1
+    f0_mel[f0_mel <= 1] = 1
+    f0_mel[f0_mel > f0_bin - 1] = f0_bin - 1
+    f0_mel[unvoiced] = 0
+    f0_coarse = (f0_mel + 0.5).long() if is_torch else np.rint(f0_mel).astype(int)
+    assert f0_coarse.max() <= 255 and f0_coarse.min() >= 0, (f0_coarse.max(), f0_coarse.min())
+    return f0_coarse
+# training "average voice" encoder
+class VCDecLPCDataset(Dataset):
+    def __init__(self, data_dir, subset, content_dir='lpc_mel_512', extract_emb=False):
+        self.path = data_dir
+        meta = pd.read_csv(data_dir + 'meta_fix.csv')
+        self.meta = meta[meta['subset'] == subset]
+        self.content_dir = content_dir
+        self.extract_emb = extract_emb
+    def get_vc_data(self, audio_path, mel_id):
+        mel_dir = audio_path.replace('vocal', 'mel')
+        embed_dir = audio_path.replace('vocal', 'embed')
+        pitch_dir = audio_path.replace('vocal', 'f0')
+        content_dir = audio_path.replace('vocal', self.content_dir)
+        mel = os.path.join(mel_dir, mel_id + '.npy')
+        embed = os.path.join(embed_dir, mel_id + '.npy')
+        pitch = os.path.join(pitch_dir, mel_id + '.npy')
+        content = os.path.join(content_dir, mel_id + '.npy')
+        mel = np.load(mel)
+        if self.extract_emb:
+            embed = np.load(embed)
+        else:
+            embed = np.zeros(1)
+        pitch = np.load(pitch)
+        content = np.load(content)
+        pitch = np.nan_to_num(pitch)
+        pitch = f0_to_coarse(pitch, {'f0_bin': 256,
+                                     'f0_min': librosa.note_to_hz('C2'),
+                                     'f0_max': librosa.note_to_hz('C6')})
+        mel = torch.from_numpy(mel).float()
+        embed = torch.from_numpy(embed).float()
+        pitch = torch.from_numpy(pitch).float()
+        content = torch.from_numpy(content).float()
+        return (mel, embed, pitch, content)
+    def __getitem__(self, index):
+        row = self.meta.iloc[index]
+        mel_id = row['file_name']
+        audio_path = self.path + row['folder'] + row['subfolder']
+        mel, embed, pitch, content = self.get_vc_data(audio_path, mel_id)
+        item = {'mel': mel, 'embed': embed, 'f0': pitch, 'content': content}
+        return item
+    def __len__(self):
+        return len(self.meta)
+class VCDecLPCBatchCollate(object):
+    def __init__(self, train_frames, eps=np.log(1e-5), content_eps=np.log(1e-12)):
+        self.train_frames = train_frames
+        self.eps = eps
+        self.content_eps = content_eps
+    def __call__(self, batch):
+        train_frames = self.train_frames
+        eps = self.eps
+        content_eps = self.content_eps
+        B = len(batch)
+        embed = torch.stack([item['embed'] for item in batch], 0)
+        n_mels = batch[0]['mel'].shape[0]
+        content_dim = batch[0]['content'].shape[0]
+        # min value of log-mel spectrogram is np.log(eps) == padding zero in time domain
+        mels1 = torch.ones((B, n_mels, train_frames), dtype=torch.float32) * eps
+        mels2 = torch.ones((B, n_mels, train_frames), dtype=torch.float32) * eps
+        # using a different eps
+        contents1 = torch.ones((B, content_dim, train_frames), dtype=torch.float32) * content_eps
+        f0s1 = torch.zeros((B, train_frames), dtype=torch.float32)
+        max_starts = [max(item['mel'].shape[-1] - train_frames, 0)
+                      for item in batch]
+        starts1 = [random.choice(range(m)) if m > 0 else 0 for m in max_starts]
+        starts2 = [random.choice(range(m)) if m > 0 else 0 for m in max_starts]
+        mel_lengths = []
+        for i, item in enumerate(batch):
+            mel = item['mel']
+            f0 = item['f0']
+            content = item['content']
+            if mel.shape[-1] < train_frames:
+                mel_length = mel.shape[-1]
+            else:
+                mel_length = train_frames
+            mels1[i, :, :mel_length] = mel[:, starts1[i]:starts1[i] + mel_length]
+            f0s1[i, :mel_length] = f0[starts1[i]:starts1[i] + mel_length]
+            contents1[i, :, :mel_length] = content[:, starts1[i]:starts1[i] + mel_length]
+            mels2[i, :, :mel_length] = mel[:, starts2[i]:starts2[i] + mel_length]
+            mel_lengths.append(mel_length)
+        mel_lengths = torch.LongTensor(mel_lengths)
+        return {'mel1': mels1, 'mel2': mels2, 'mel_lengths': mel_lengths,
+                'embed': embed,
+                'f0_1': f0s1,
+                'content1': contents1}
+class VCDecLPCTest(Dataset):
+    def __init__(self, data_dir, subset='test', eps=np.log(1e-5), content_eps=np.log(1e-12), test_frames=256, content_dir='lpc_mel_512', extract_emb=False):
+        self.path = data_dir
+        meta = pd.read_csv(data_dir + 'meta_test.csv')
+        self.meta = meta[meta['subset'] == subset]
+        self.content_dir = content_dir
+        self.extract_emb = extract_emb
+        self.eps = eps
+        self.content_eps = content_eps
+        self.test_frames = test_frames
+    def get_vc_data(self, audio_path, mel_id, pitch_shift):
+        mel_dir = audio_path.replace('vocal', 'mel')
+        embed_dir = audio_path.replace('vocal', 'embed')
+        pitch_dir = audio_path.replace('vocal', 'f0')
+        content_dir = audio_path.replace('vocal', self.content_dir)
+        mel = os.path.join(mel_dir, mel_id + '.npy')
+        embed = os.path.join(embed_dir, mel_id + '.npy')
+        pitch = os.path.join(pitch_dir, mel_id + '.npy')
+        content = os.path.join(content_dir, mel_id + '.npy')
+        mel = np.load(mel)
+        if self.extract_emb:
+            embed = np.load(embed)
+        else:
+            embed = np.zeros(1)
+        pitch = np.load(pitch)
+        content = np.load(content)
+        pitch = np.nan_to_num(pitch)
+        pitch = pitch*pitch_shift
+        pitch = f0_to_coarse(pitch, {'f0_bin': 256,
+                                     'f0_min': librosa.note_to_hz('C2'),
+                                     'f0_max': librosa.note_to_hz('C6')})
+        mel = torch.from_numpy(mel).float()
+        embed = torch.from_numpy(embed).float()
+        pitch = torch.from_numpy(pitch).float()
+        content = torch.from_numpy(content).float()
+        return (mel, embed, pitch, content)
+    def __getitem__(self, index):
+        row = self.meta.iloc[index]
+        mel_id = row['content_file_name']
+        audio_path = self.path + row['content_folder'] + row['content_subfolder']
+        pitch_shift = row['pitch_shift']
+        mel1, _, f0, content = self.get_vc_data(audio_path, mel_id, pitch_shift)
+        mel_id = row['timbre_file_name']
+        audio_path = self.path + row['timbre_folder'] + row['timbre_subfolder']
+        mel2, embed, _, _ = self.get_vc_data(audio_path, mel_id, pitch_shift)
+        n_mels = mel1.shape[0]
+        content_dim = content.shape[0]
+        mels1 = torch.ones((n_mels, self.test_frames), dtype=torch.float32) * self.eps
+        mels2 = torch.ones((n_mels, self.test_frames), dtype=torch.float32) * self.eps
+        # content
+        lpcs1 = torch.ones((content_dim, self.test_frames), dtype=torch.float32) * self.content_eps
+        f0s1 = torch.zeros(self.test_frames, dtype=torch.float32)
+        if mel1.shape[-1] < self.test_frames:
+            mel_length = mel1.shape[-1]
+        else:
+            mel_length = self.test_frames
+        mels1[:, :mel_length] = mel1[:, :mel_length]
+        f0s1[:mel_length] = f0[:mel_length]
+        lpcs1[:, :mel_length] = content[:, :mel_length]
+        if mel2.shape[-1] < self.test_frames:
+            mel_length = mel2.shape[-1]
+        else:
+            mel_length = self.test_frames
+        mels2[:, :mel_length] = mel2[:, :mel_length]
+        return {'mel1': mels1, 'mel2': mels2, 'embed': embed, 'f0_1': f0s1, 'content1': lpcs1}
+    def __len__(self):
+        return len(self.meta)

pitch_controller/load_vocoder.py ADDED Viewed

	@@ -0,0 +1,51 @@

+# from nsf_hifigan.models import load_model
+from modules.BigVGAN.inference import load_model
+import librosa
+import torch
+import torch.nn.functional as F
+import torchaudio
+import torchaudio.transforms as transforms
+import numpy as np
+import soundfile as sf
+class LogMelSpectrogram(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.melspctrogram = transforms.MelSpectrogram(
+            sample_rate=22050,
+            n_fft=1024,
+            win_length=1024,
+            hop_length=256,
+            center=False,
+            power=1.0,
+            norm="slaney",
+            n_mels=80,
+            mel_scale="slaney",
+            f_max=8000,
+            f_min=0,
+        )
+    def forward(self, wav):
+        wav = F.pad(wav, ((1024 - 256) // 2, (1024 - 256) // 2), "reflect")
+        mel = self.melspctrogram(wav)
+        logmel = torch.log(torch.clamp(mel, min=1e-5))
+        return logmel
+hifigan, cfg = load_model('modules/BigVGAN/ckpt/bigvgan_22khz_80band/g_05000000', device='cuda')
+M = LogMelSpectrogram()
+source, sr = torchaudio.load("music.mp3")
+source = torchaudio.functional.resample(source, sr, 22050)
+source = source.unsqueeze(0)
+mel = M(source).squeeze(0)
+# f0, f0_bin = get_pitch("116_1_pred.wav")
+# f0 = torch.tensor(f0).unsqueeze(0)
+with torch.no_grad():
+    y_hat = hifigan(mel.cuda()).cpu().numpy().squeeze(1)
+sf.write('test.wav', y_hat[0], samplerate=22050)

pitch_controller/models/__pycache__/base.cpython-310.pyc ADDED Viewed

Binary file (1.17 kB). View file

pitch_controller/models/__pycache__/base.cpython-39.pyc ADDED Viewed

Binary file (1.14 kB). View file

pitch_controller/models/__pycache__/modules.cpython-310.pyc ADDED Viewed

Binary file (8.26 kB). View file

pitch_controller/models/__pycache__/modules.cpython-39.pyc ADDED Viewed

Binary file (8.45 kB). View file

pitch_controller/models/__pycache__/pitch.cpython-39.pyc ADDED Viewed

Binary file (1.1 kB). View file

pitch_controller/models/__pycache__/unet.cpython-310.pyc ADDED Viewed

Binary file (3.56 kB). View file

pitch_controller/models/__pycache__/unet.cpython-39.pyc ADDED Viewed

Binary file (3.48 kB). View file

pitch_controller/models/__pycache__/update_unet.cpython-310.pyc ADDED Viewed

Binary file (3.69 kB). View file

pitch_controller/models/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (3.99 kB). View file

pitch_controller/models/__pycache__/utils.cpython-39.pyc ADDED Viewed

Binary file (3.98 kB). View file

pitch_controller/models/base.py ADDED Viewed

	@@ -0,0 +1,30 @@

+# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the MIT License.
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# MIT License for more details.
+import numpy as np
+import torch
+class BaseModule(torch.nn.Module):
+    def __init__(self):
+        super(BaseModule, self).__init__()
+    @property
+    def nparams(self):
+        num_params = 0
+        for name, param in self.named_parameters():
+            if param.requires_grad:
+                num_params += np.prod(param.detach().cpu().numpy().shape)
+        return num_params
+    def relocate_input(self, x: list):
+        device = next(self.parameters()).device
+        for i in range(len(x)):
+            if isinstance(x[i], torch.Tensor) and x[i].device != device:
+                x[i] = x[i].to(device)
+        return x

pitch_controller/models/modules.py ADDED Viewed

	@@ -0,0 +1,237 @@

+# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the MIT License.
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# MIT License for more details.
+import math
+import torch
+from einops import rearrange
+from .base import BaseModule
+class Mish(BaseModule):
+    def forward(self, x):
+        return x * torch.tanh(torch.nn.functional.softplus(x))
+class Upsample(BaseModule):
+    def __init__(self, dim):
+        super(Upsample, self).__init__()
+        self.conv = torch.nn.ConvTranspose2d(dim, dim, 4, 2, 1)
+    def forward(self, x):
+        return self.conv(x)
+class Downsample(BaseModule):
+    def __init__(self, dim):
+        super(Downsample, self).__init__()
+        self.conv = torch.nn.Conv2d(dim, dim, 3, 2, 1)
+    def forward(self, x):
+        return self.conv(x)
+class Rezero(BaseModule):
+    def __init__(self, fn):
+        super(Rezero, self).__init__()
+        self.fn = fn
+        self.g = torch.nn.Parameter(torch.zeros(1))
+    def forward(self, x):
+        return self.fn(x) * self.g
+class Block(BaseModule):
+    def __init__(self, dim, dim_out, groups=8):
+        super(Block, self).__init__()
+        self.block = torch.nn.Sequential(torch.nn.Conv2d(dim, dim_out, 3,
+                                         padding=1), torch.nn.GroupNorm(
+                                         groups, dim_out), Mish())
+    def forward(self, x):
+        output = self.block(x)
+        return output
+class ResnetBlock(BaseModule):
+    def __init__(self, dim, dim_out, time_emb_dim, groups=8):
+        super(ResnetBlock, self).__init__()
+        self.mlp = torch.nn.Sequential(Mish(), torch.nn.Linear(time_emb_dim,
+                                                               dim_out))
+        self.block1 = Block(dim, dim_out, groups=groups)
+        self.block2 = Block(dim_out, dim_out, groups=groups)
+        if dim != dim_out:
+            self.res_conv = torch.nn.Conv2d(dim, dim_out, 1)
+        else:
+            self.res_conv = torch.nn.Identity()
+    def forward(self, x, time_emb):
+        h = self.block1(x)
+        h += self.mlp(time_emb).unsqueeze(-1).unsqueeze(-1)
+        h = self.block2(h)
+        output = h + self.res_conv(x)
+        return output
+class LinearAttention(BaseModule):
+    def __init__(self, dim, heads=4, dim_head=32, q_norm=True):
+        super(LinearAttention, self).__init__()
+        self.heads = heads
+        hidden_dim = dim_head * heads
+        self.to_qkv = torch.nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
+        self.to_out = torch.nn.Conv2d(hidden_dim, dim, 1)
+        self.q_norm = q_norm
+    def forward(self, x):
+        b, c, h, w = x.shape
+        qkv = self.to_qkv(x)
+        q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)',
+                            heads=self.heads, qkv=3)
+        k = k.softmax(dim=-1)
+        if self.q_norm:
+            q = q.softmax(dim=-2)
+        context = torch.einsum('bhdn,bhen->bhde', k, v)
+        out = torch.einsum('bhde,bhdn->bhen', context, q)
+        out = rearrange(out, 'b heads c (h w) -> b (heads c) h w',
+                        heads=self.heads, h=h, w=w)
+        return self.to_out(out)
+class Residual(BaseModule):
+    def __init__(self, fn):
+        super(Residual, self).__init__()
+        self.fn = fn
+    def forward(self, x, *args, **kwargs):
+        output = self.fn(x, *args, **kwargs) + x
+        return output
+def get_timestep_embedding(
+    timesteps: torch.Tensor,
+    embedding_dim: int,
+    flip_sin_to_cos: bool = False,
+    downscale_freq_shift: float = 1,
+    scale: float = 1,
+    max_period: int = 10000,
+):
+    """
+    This matches the implementation in Denoising Diffusion Probabilistic Models: Create sinusoidal timestep embeddings.
+    :param timesteps: a 1-D Tensor of N indices, one per batch element.
+                      These may be fractional.
+    :param embedding_dim: the dimension of the output. :param max_period: controls the minimum frequency of the
+    embeddings. :return: an [N x dim] Tensor of positional embeddings.
+    """
+    assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array"
+    half_dim = embedding_dim // 2
+    exponent = -math.log(max_period) * torch.arange(
+        start=0, end=half_dim, dtype=torch.float32, device=timesteps.device
+    )
+    exponent = exponent / (half_dim - downscale_freq_shift)
+    emb = torch.exp(exponent)
+    emb = timesteps[:, None].float() * emb[None, :]
+    # scale embeddings
+    emb = scale * emb
+    # concat sine and cosine embeddings
+    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
+    # flip sine and cosine embeddings
+    if flip_sin_to_cos:
+        emb = torch.cat([emb[:, half_dim:], emb[:, :half_dim]], dim=-1)
+    # zero pad
+    if embedding_dim % 2 == 1:
+        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
+    return emb
+class Timesteps(BaseModule):
+    def __init__(self, num_channels: int, flip_sin_to_cos: bool, downscale_freq_shift: float):
+        super().__init__()
+        self.num_channels = num_channels
+        self.flip_sin_to_cos = flip_sin_to_cos
+        self.downscale_freq_shift = downscale_freq_shift
+    def forward(self, timesteps):
+        t_emb = get_timestep_embedding(
+            timesteps,
+            self.num_channels,
+            flip_sin_to_cos=self.flip_sin_to_cos,
+            downscale_freq_shift=self.downscale_freq_shift,
+        )
+        return t_emb
+class PitchPosEmb(BaseModule):
+    def __init__(self, dim, flip_sin_to_cos=False, downscale_freq_shift=0):
+        super(PitchPosEmb, self).__init__()
+        self.dim = dim
+        self.flip_sin_to_cos = flip_sin_to_cos
+        self.downscale_freq_shift = downscale_freq_shift
+    def forward(self, x):
+        # B * L
+        b, l = x.shape
+        x = rearrange(x, 'b l -> (b l)')
+        emb = get_timestep_embedding(
+            x,
+            self.dim,
+            flip_sin_to_cos=self.flip_sin_to_cos,
+            downscale_freq_shift=self.downscale_freq_shift,
+        )
+        emb = rearrange(emb, '(b l) d -> b d l', b=b, l=l)
+        return emb
+class TimbreBlock(BaseModule):
+    def __init__(self, out_dim):
+        super(TimbreBlock, self).__init__()
+        base_dim = out_dim // 4
+        self.block11 = torch.nn.Sequential(torch.nn.Conv2d(1, 2 * base_dim,
+                                                           3, 1, 1),
+                                           torch.nn.InstanceNorm2d(2 * base_dim, affine=True),
+                                           torch.nn.GLU(dim=1))
+        self.block12 = torch.nn.Sequential(torch.nn.Conv2d(base_dim, 2 * base_dim,
+                                                           3, 1, 1),
+                                           torch.nn.InstanceNorm2d(2 * base_dim, affine=True),
+                                           torch.nn.GLU(dim=1))
+        self.block21 = torch.nn.Sequential(torch.nn.Conv2d(base_dim, 4 * base_dim,
+                                                           3, 1, 1),
+                                           torch.nn.InstanceNorm2d(4 * base_dim, affine=True),
+                                           torch.nn.GLU(dim=1))
+        self.block22 = torch.nn.Sequential(torch.nn.Conv2d(2 * base_dim, 4 * base_dim,
+                                                           3, 1, 1),
+                                           torch.nn.InstanceNorm2d(4 * base_dim, affine=True),
+                                           torch.nn.GLU(dim=1))
+        self.block31 = torch.nn.Sequential(torch.nn.Conv2d(2 * base_dim, 8 * base_dim,
+                                                           3, 1, 1),
+                                           torch.nn.InstanceNorm2d(8 * base_dim, affine=True),
+                                           torch.nn.GLU(dim=1))
+        self.block32 = torch.nn.Sequential(torch.nn.Conv2d(4 * base_dim, 8 * base_dim,
+                                                           3, 1, 1),
+                                           torch.nn.InstanceNorm2d(8 * base_dim, affine=True),
+                                           torch.nn.GLU(dim=1))
+        self.final_conv = torch.nn.Conv2d(4 * base_dim, out_dim, 1)
+    def forward(self, x):
+        y = self.block11(x)
+        y = self.block12(y)
+        y = self.block21(y)
+        y = self.block22(y)
+        y = self.block31(y)
+        y = self.block32(y)
+        y = self.final_conv(y)
+        return y.sum((2, 3)) / (y.shape[2] * y.shape[3])

pitch_controller/models/unet.py ADDED Viewed

	@@ -0,0 +1,153 @@

+import math
+import torch
+from .base import BaseModule
+from .modules import Mish, Upsample, Downsample, Rezero, Block, ResnetBlock
+from .modules import LinearAttention, Residual, Timesteps, TimbreBlock, PitchPosEmb
+from einops import rearrange
+class UNetPitcher(BaseModule):
+    def __init__(self,
+                 dim_base,
+                 dim_cond,
+                 use_ref_t,
+                 use_embed,
+                 dim_embed=256,
+                 dim_mults=(1, 2, 4),
+                 pitch_type='bins'):
+        super(UNetPitcher, self).__init__()
+        self.use_ref_t = use_ref_t
+        self.use_embed = use_embed
+        self.pitch_type = pitch_type
+        dim_in = 2
+        # time embedding
+        self.time_pos_emb = Timesteps(num_channels=dim_base,
+                                      flip_sin_to_cos=True,
+                                      downscale_freq_shift=0)
+        self.mlp = torch.nn.Sequential(torch.nn.Linear(dim_base, dim_base * 4),
+                                       Mish(), torch.nn.Linear(dim_base * 4, dim_base))
+        # speaker embedding
+        timbre_total = 0
+        if use_ref_t:
+            self.ref_block = TimbreBlock(out_dim=dim_cond)
+            timbre_total += dim_cond
+        if use_embed:
+            timbre_total += dim_embed
+        if timbre_total != 0:
+            self.timbre_block = torch.nn.Sequential(
+                torch.nn.Linear(timbre_total, 4 * dim_cond),
+                Mish(),
+                torch.nn.Linear(4 * dim_cond, dim_cond))
+        if use_embed or use_ref_t:
+            dim_in += dim_cond
+        self.pitch_pos_emb = PitchPosEmb(dim_cond)
+        self.pitch_mlp = torch.nn.Sequential(
+            torch.nn.Conv1d(dim_cond, dim_cond * 4, 1, stride=1),
+            Mish(),
+            torch.nn.Conv1d(dim_cond * 4, dim_cond, 1, stride=1), )
+        dim_in += dim_cond
+        # pitch embedding
+        # if self.pitch_type == 'bins':
+        #     print('using mel bins for f0')
+        # elif self.pitch_type == 'log':
+        #     print('using log bins f0')
+        dims = [dim_in, *map(lambda m: dim_base * m, dim_mults)]
+        in_out = list(zip(dims[:-1], dims[1:]))
+        # blocks
+        self.downs = torch.nn.ModuleList([])
+        self.ups = torch.nn.ModuleList([])
+        num_resolutions = len(in_out)
+        for ind, (dim_in, dim_out) in enumerate(in_out):
+            is_last = ind >= (num_resolutions - 1)
+            self.downs.append(torch.nn.ModuleList([
+                ResnetBlock(dim_in, dim_out, time_emb_dim=dim_base),
+                ResnetBlock(dim_out, dim_out, time_emb_dim=dim_base),
+                Residual(Rezero(LinearAttention(dim_out))),
+                Downsample(dim_out) if not is_last else torch.nn.Identity()]))
+        mid_dim = dims[-1]
+        self.mid_block1 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim_base)
+        self.mid_attn = Residual(Rezero(LinearAttention(mid_dim)))
+        self.mid_block2 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim_base)
+        for ind, (dim_in, dim_out) in enumerate(reversed(in_out[1:])):
+            self.ups.append(torch.nn.ModuleList([
+                ResnetBlock(dim_out * 2, dim_in, time_emb_dim=dim_base),
+                ResnetBlock(dim_in, dim_in, time_emb_dim=dim_base),
+                Residual(Rezero(LinearAttention(dim_in))),
+                Upsample(dim_in)]))
+        self.final_block = Block(dim_base, dim_base)
+        self.final_conv = torch.nn.Conv2d(dim_base, 1, 1)
+    def forward(self, x, mean, f0, t, ref=None, embed=None):
+        if not torch.is_tensor(t):
+            t = torch.tensor([t], dtype=torch.long, device=x.device)
+        if len(t.shape) == 0:
+            t = t * torch.ones(x.shape[0], dtype=t.dtype, device=x.device)
+        t = self.time_pos_emb(t)
+        t = self.mlp(t)
+        x = torch.stack([x, mean], 1)
+        f0 = self.pitch_pos_emb(f0)
+        f0 = self.pitch_mlp(f0)
+        f0 = f0.unsqueeze(2)
+        f0 = torch.cat(x.shape[2] * [f0], 2)
+        timbre = None
+        if self.use_ref_t:
+            ref = torch.stack([ref], 1)
+            timbre = self.ref_block(ref)
+        if self.use_embed:
+            if timbre is not None:
+                timbre = torch.cat([timbre, embed], 1)
+            else:
+                timbre = embed
+        if timbre is None:
+            # raise Exception("at least use one timbre condition")
+            condition = f0
+        else:
+            timbre = self.timbre_block(timbre).unsqueeze(-1).unsqueeze(-1)
+            timbre = torch.cat(x.shape[2] * [timbre], 2)
+            timbre = torch.cat(x.shape[3] * [timbre], 3)
+            condition = torch.cat([f0, timbre], 1)
+        x = torch.cat([x, condition], 1)
+        hiddens = []
+        for resnet1, resnet2, attn, downsample in self.downs:
+            x = resnet1(x, t)
+            x = resnet2(x, t)
+            x = attn(x)
+            hiddens.append(x)
+            x = downsample(x)
+        x = self.mid_block1(x, t)
+        x = self.mid_attn(x)
+        x = self.mid_block2(x, t)
+        for resnet1, resnet2, attn, upsample in self.ups:
+            x = torch.cat((x, hiddens.pop()), dim=1)
+            x = resnet1(x, t)
+            x = resnet2(x, t)
+            x = attn(x)
+            x = upsample(x)
+        x = self.final_block(x)
+        output = self.final_conv(x)
+        return output.squeeze(1)

pitch_controller/models/utils.py ADDED Viewed

	@@ -0,0 +1,110 @@

+# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the MIT License.
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# MIT License for more details.
+import torch
+import torchaudio
+import numpy as np
+from librosa.filters import mel as librosa_mel_fn
+from .base import BaseModule
+def mse_loss(x, y, mask, n_feats):
+    loss = torch.sum(((x - y)**2) * mask)
+    return loss / (torch.sum(mask) * n_feats)
+def sequence_mask(length, max_length=None):
+    if max_length is None:
+        max_length = length.max()
+    x = torch.arange(int(max_length), dtype=length.dtype, device=length.device)
+    return x.unsqueeze(0) < length.unsqueeze(1)
+def convert_pad_shape(pad_shape):
+    l = pad_shape[::-1]
+    pad_shape = [item for sublist in l for item in sublist]
+    return pad_shape
+def fix_len_compatibility(length, num_downsamplings_in_unet=2):
+    while True:
+        if length % (2**num_downsamplings_in_unet) == 0:
+            return length
+        length += 1
+class PseudoInversion(BaseModule):
+    def __init__(self, n_mels, sampling_rate, n_fft):
+        super(PseudoInversion, self).__init__()
+        self.n_mels = n_mels
+        self.sampling_rate = sampling_rate
+        self.n_fft = n_fft
+        mel_basis = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=n_mels, fmin=0, fmax=8000)
+        mel_basis_inverse = np.linalg.pinv(mel_basis)
+        mel_basis_inverse = torch.from_numpy(mel_basis_inverse).float()
+        self.register_buffer("mel_basis_inverse", mel_basis_inverse)
+    def forward(self, log_mel_spectrogram):
+        mel_spectrogram = torch.exp(log_mel_spectrogram)
+        stftm = torch.matmul(self.mel_basis_inverse, mel_spectrogram)
+        return stftm
+class InitialReconstruction(BaseModule):
+    def __init__(self, n_fft, hop_size):
+        super(InitialReconstruction, self).__init__()
+        self.n_fft = n_fft
+        self.hop_size = hop_size
+        window = torch.hann_window(n_fft).float()
+        self.register_buffer("window", window)
+    def forward(self, stftm):
+        real_part = torch.ones_like(stftm, device=stftm.device)
+        imag_part = torch.zeros_like(stftm, device=stftm.device)
+        stft = torch.stack([real_part, imag_part], -1)*stftm.unsqueeze(-1)
+        istft = torch.istft(stft, n_fft=self.n_fft,
+                            hop_length=self.hop_size, win_length=self.n_fft,
+                            window=self.window, center=True)
+        return istft.unsqueeze(1)
+# Fast Griffin-Lim algorithm as a PyTorch module
+class FastGL(BaseModule):
+    def __init__(self, n_mels, sampling_rate, n_fft, hop_size, momentum=0.99):
+        super(FastGL, self).__init__()
+        self.n_mels = n_mels
+        self.sampling_rate = sampling_rate
+        self.n_fft = n_fft
+        self.hop_size = hop_size
+        self.momentum = momentum
+        self.pi = PseudoInversion(n_mels, sampling_rate, n_fft)
+        self.ir = InitialReconstruction(n_fft, hop_size)
+        window = torch.hann_window(n_fft).float()
+        self.register_buffer("window", window)
+    @torch.no_grad()
+    def forward(self, s, n_iters=32):
+        c = self.pi(s)
+        x = self.ir(c)
+        x = x.squeeze(1)
+        c = c.unsqueeze(-1)
+        prev_angles = torch.zeros_like(c, device=c.device)
+        for _ in range(n_iters):
+            s = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_size,
+                           win_length=self.n_fft, window=self.window,
+                           center=True)
+            real_part, imag_part = s.unbind(-1)
+            stftm = torch.sqrt(torch.clamp(real_part**2 + imag_part**2, min=1e-8))
+            angles = s / stftm.unsqueeze(-1)
+            s = c * (angles + self.momentum * (angles - prev_angles))
+            x = torch.istft(s, n_fft=self.n_fft, hop_length=self.hop_size,
+                            win_length=self.n_fft, window=self.window,
+                            center=True)
+            prev_angles = angles
+        return x.unsqueeze(1)

pitch_controller/modules/BigVGAN/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2022 NVIDIA CORPORATION.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

pitch_controller/modules/BigVGAN/README.md ADDED Viewed

	@@ -0,0 +1,95 @@

+## BigVGAN: A Universal Neural Vocoder with Large-Scale Training
+#### Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon
+<center><img src="https://user-images.githubusercontent.com/15963413/218609148-881e39df-33af-4af9-ab95-1427c4ebf062.png" width="800"></center>
+### [Paper](https://arxiv.org/abs/2206.04658)
+### [Audio demo](https://bigvgan-demo.github.io/)
+## Installation
+Clone the repository and install dependencies.
+```shell
+# the codebase has been tested on Python 3.8 / 3.10 with PyTorch 1.12.1 / 1.13 conda binaries
+git clone https://github.com/NVIDIA/BigVGAN
+pip install -r requirements.txt
+```
+Create symbolic link to the root of the dataset. The codebase uses filelist with the relative path from the dataset. Below are the example commands for LibriTTS dataset.
+``` shell
+cd LibriTTS && \
+ln -s /path/to/your/LibriTTS/train-clean-100 train-clean-100 && \
+ln -s /path/to/your/LibriTTS/train-clean-360 train-clean-360 && \
+ln -s /path/to/your/LibriTTS/train-other-500 train-other-500 && \
+ln -s /path/to/your/LibriTTS/dev-clean dev-clean && \
+ln -s /path/to/your/LibriTTS/dev-other dev-other && \
+ln -s /path/to/your/LibriTTS/test-clean test-clean && \
+ln -s /path/to/your/LibriTTS/test-other test-other && \
+cd ..
+```
+## Training
+Train BigVGAN model. Below is an example command for training BigVGAN using LibriTTS dataset at 24kHz with a full 100-band mel spectrogram as input.
+```shell
+python train.py \
+--config configs/bigvgan_24khz_100band.json \
+--input_wavs_dir LibriTTS \
+--input_training_file LibriTTS/train-full.txt \
+--input_validation_file LibriTTS/val-full.txt \
+--list_input_unseen_wavs_dir LibriTTS LibriTTS \
+--list_input_unseen_validation_file LibriTTS/dev-clean.txt LibriTTS/dev-other.txt \
+--checkpoint_path exp/bigvgan
+```
+## Synthesis
+Synthesize from BigVGAN model. Below is an example command for generating audio from the model.
+It computes mel spectrograms using wav files from `--input_wavs_dir` and saves the generated audio to `--output_dir`.
+```shell
+python inference.py \
+--checkpoint_file exp/bigvgan/g_05000000 \
+--input_wavs_dir /path/to/your/input_wav \
+--output_dir /path/to/your/output_wav
+```
+`inference_e2e.py` supports synthesis directly from the mel spectrogram saved in `.npy` format, with shapes `[1, channel, frame]` or `[channel, frame]`.
+It loads mel spectrograms from `--input_mels_dir` and saves the generated audio to `--output_dir`.
+Make sure that the STFT hyperparameters for mel spectrogram are the same as the model, which are defined in `config.json` of the corresponding model.
+```shell
+python inference_e2e.py \
+--checkpoint_file exp/bigvgan/g_05000000 \
+--input_mels_dir /path/to/your/input_mel \
+--output_dir /path/to/your/output_wav
+```
+## Pretrained Models
+We provide the [pretrained models](https://drive.google.com/drive/folders/1e9wdM29d-t3EHUpBb8T4dcHrkYGAXTgq).
+One can download the checkpoints of generator (e.g., g_05000000) and discriminator (e.g., do_05000000) within the listed folders.
+|Folder Name|Sampling Rate|Mel band|fmax|Params.|Dataset|Fine-Tuned|
+|------|---|---|---|---|------|---|
+|bigvgan_24khz_100band|24 kHz|100|12000|112M|LibriTTS|No|
+|bigvgan_base_24khz_100band|24 kHz|100|12000|14M|LibriTTS|No|
+|bigvgan_22khz_80band|22 kHz|80|8000|112M|LibriTTS + VCTK + LJSpeech|No|
+|bigvgan_base_22khz_80band|22 kHz|80|8000|14M|LibriTTS + VCTK + LJSpeech|No|
+The paper results are based on 24kHz BigVGAN models trained on LibriTTS dataset.
+We also provide 22kHz BigVGAN models with band-limited setup (i.e., fmax=8000) for TTS applications.
+Note that, the latest checkpoints use ``snakebeta`` activation with log scale parameterization, which have the best overall quality.
+## TODO
+Current codebase only provides a plain PyTorch implementation for the filtered nonlinearity. We are working on a fast CUDA kernel implementation, which will be released in the future.
+## References
+* [HiFi-GAN](https://github.com/jik876/hifi-gan) (for generator and multi-period discriminator)
+* [Snake](https://github.com/EdwardDixon/snake) (for periodic activation)
+* [Alias-free-torch](https://github.com/junjun3518/alias-free-torch) (for anti-aliasing)
+* [Julius](https://github.com/adefossez/julius) (for low-pass filter)
+* [UnivNet](https://github.com/mindslab-ai/univnet) (for multi-resolution discriminator)

pitch_controller/modules/BigVGAN/__pycache__/env.cpython-310.pyc ADDED Viewed

Binary file (845 Bytes). View file

pitch_controller/modules/BigVGAN/__pycache__/inference.cpython-310.pyc ADDED Viewed

Binary file (1.11 kB). View file