PyTorch
ONNX
vocoder
vocos
tts
wetdog commited on
Commit
df061ec
·
verified ·
1 Parent(s): fbdcbdc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -0
README.md CHANGED
@@ -1,3 +1,164 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ ---
5
+ license: mit
6
+ datasets:
7
+ - projecte-aina/festcat_trimmed_denoised
8
+ - projecte-aina/openslr-slr69-ca-trimmed-denoised
9
+ ---
10
+
11
+ # Vocos-mel-22khz
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+
15
+
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+
23
+ **Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features.
24
+ Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain.
25
+ Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through
26
+ inverse Fourier transform.
27
+
28
+ This version of vocos uses 80-bin mel spectrograms as acoustic features which are widespread
29
+ in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
30
+ The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
31
+ acoustic output of several TTS models.
32
+
33
+
34
+
35
+ ## Intended Uses and limitations
36
+
37
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
38
+ The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio
39
+ domain is possible that the model won't produce high quality samples.
40
+
41
+ ## How to Get Started with the Model
42
+
43
+ Use the code below to get started with the model.
44
+
45
+ ### Installation
46
+
47
+ To use Vocos only in inference mode, install it using:
48
+
49
+ ```bash
50
+ pip install git+https://github.com/langtech-bsc/vocos.git@matcha
51
+ ```
52
+
53
+ ### Reconstruct audio from mel-spectrogram
54
+
55
+ ```python
56
+ import torch
57
+
58
+ from vocos import Vocos
59
+
60
+ vocos = Vocos.from_pretrained("BSC-LT/vocos-mel-22khz-cat")
61
+
62
+ mel = torch.randn(1, 80, 256) # B, C, T
63
+ audio = vocos.decode(mel)
64
+ ```
65
+
66
+ ### Copy-synthesis from a file:
67
+
68
+ ```python
69
+ import torchaudio
70
+
71
+ y, sr = torchaudio.load(YOUR_AUDIO_FILE)
72
+ if y.size(0) > 1: # mix to mono
73
+ y = y.mean(dim=0, keepdim=True)
74
+ y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=22050)
75
+ y_hat = vocos(y)
76
+ ```
77
+
78
+ ### Onnx
79
+
80
+ We also release a onnx version of the model, you can check in colab:
81
+
82
+ <a target="_blank" href="https://colab.research.google.com/github/langtech-bsc/vocos/blob/matcha/notebooks/vocos_22khz_onnx_inference.ipynb">
83
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
84
+ </a>
85
+ ## Training Details
86
+
87
+ ### Training Data
88
+
89
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
90
+
91
+ The model was trained on 3 Catalan speech datasets
92
+
93
+ | Dataset | Language | Hours |
94
+ |---------------------|----------|---------|
95
+ | Festcat | ca | 22 |
96
+ | OpenSLR69 | ca | 5 |
97
+ | lafresca | ca | 5 |
98
+
99
+
100
+
101
+ ### Training Procedure
102
+
103
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
104
+ The model was trained for 1M steps and 1k epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 5e-4.
105
+ We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram.
106
+
107
+
108
+ #### Training Hyperparameters
109
+
110
+
111
+ * initial_learning_rate: 5e-4
112
+ * scheduler: cosine without warmup or restarts
113
+ * mel_loss_coeff: 45
114
+ * mrd_loss_coeff: 0.1
115
+ * batch_size: 16
116
+ * num_samples: 16384
117
+
118
+ ## Evaluation
119
+
120
+ <!-- This section describes the evaluation protocols and provides the results. -->
121
+
122
+ Evaluation was done using the metrics on the original repo, after ~ 1000 epochs we achieve:
123
+
124
+ * val_loss:
125
+ * f1_score:
126
+ * mel_loss:
127
+ * periodicity_loss:
128
+ * pesq_score:
129
+ * pitch_loss:
130
+ * utmos_score:
131
+
132
+
133
+ ## Citation
134
+
135
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
136
+
137
+ If this code contributes to your research, please cite the work:
138
+
139
+ ```
140
+ @article{siuzdak2023vocos,
141
+ title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
142
+ author={Siuzdak, Hubert},
143
+ journal={arXiv preprint arXiv:2306.00814},
144
+ year={2023}
145
+ }
146
+ ```
147
+
148
+ ## Additional information
149
+
150
+ ### Author
151
+ The Language Technologies Unit from Barcelona Supercomputing Center.
152
+
153
+ ### Contact
154
+ For further information, please send an email to <[email protected]>.
155
+
156
+ ### Copyright
157
+ Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
158
+
159
+ ### License
160
+ [MIT](https://opensource.org/license/mit)
161
+
162
+ ### Funding
163
+
164
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).