patriotyk commited on
Commit
d726ead
1 Parent(s): a42713b

Initial description commit

Browse files
Files changed (1) hide show
  1. README.md +60 -3
README.md CHANGED
@@ -1,3 +1,60 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ ### Model Description
6
+
7
+ <!-- Provide a longer summary of what this model is. -->
8
+
9
+ **Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features.
10
+ Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain.
11
+ Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through
12
+ inverse Fourier transform.
13
+
14
+ This version of vocos uses 80-bin mel spectrograms as acoustic features which are widespread
15
+ in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
16
+ The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
17
+ acoustic output of several TTS models.
18
+
19
+ ## Intended Uses and limitations
20
+
21
+ The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio
22
+ domain is possible that the model won't produce high quality samples.
23
+
24
+ ### Installation
25
+
26
+ To use Vocos only in inference mode, install it using:
27
+
28
+ ```bash
29
+ pip install git+https://github.com/langtech-bsc/vocos.git@matcha
30
+ ```
31
+
32
+ ### Reconstruct audio from mel-spectrogram
33
+
34
+ ```python
35
+ import torch
36
+
37
+ from vocos import Vocos
38
+
39
+ vocos = Vocos.from_pretrained("patriotyk/vocos-mel-hifigan-compat-44100khz")
40
+
41
+ mel = torch.randn(1, 80, 256) # B, C, T
42
+ audio = vocos.decode(mel)
43
+ ```
44
+
45
+ ### Training Data
46
+
47
+ The model was trained on private 800+ hours dataset, made from Ukraininan audio books, using [narizaka](https://github.com/patriotyk/narizaka) tool.
48
+
49
+ ### Training Procedure
50
+
51
+ The model was trained for 2.0M steps and 210 epochs with a batch size of 20. We used a Cosine scheduler with a initial learning rate of 3e-4.
52
+
53
+ #### Training Hyperparameters
54
+
55
+ * initial_learning_rate: 3e-4
56
+ * scheduler: cosine without warmup or restarts
57
+ * mel_loss_coeff: 45
58
+ * mrd_loss_coeff: 1.0
59
+ * batch_size: 20
60
+ * num_samples: 32768