PyTorch
ONNX
vocoder
vocos
tts
wetdog commited on
Commit
7724eae
·
verified ·
1 Parent(s): 5801792

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -6
README.md CHANGED
@@ -5,7 +5,7 @@ datasets:
5
  - projecte-aina/openslr-slr69-ca-trimmed-denoised
6
  ---
7
 
8
- # Vocos-mel-22khz
9
 
10
  <!-- Provide a quick summary of what the model is/does. -->
11
 
@@ -22,12 +22,13 @@ Unlike other typical GAN-based vocoders, Vocos does not model audio samples in t
22
  Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through
23
  inverse Fourier transform.
24
 
25
- This version of vocos uses 80-bin mel spectrograms as acoustic features which are widespread
26
  in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
27
  The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
28
- acoustic output of several TTS models.
29
-
30
 
 
31
 
32
  ## Intended Uses and limitations
33
 
@@ -79,6 +80,7 @@ We also release a onnx version of the model, you can check in colab:
79
  <a target="_blank" href="https://colab.research.google.com/github/langtech-bsc/vocos/blob/matcha/notebooks/vocos_22khz_onnx_inference.ipynb">
80
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
81
  </a>
 
82
  ## Training Details
83
 
84
  ### Training Data
@@ -98,7 +100,7 @@ The model was trained on 3 Catalan speech datasets
98
  ### Training Procedure
99
 
100
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
101
- The model was trained for 1M steps and 1k epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 5e-4.
102
  We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram.
103
 
104
 
@@ -116,7 +118,7 @@ We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 inst
116
 
117
  <!-- This section describes the evaluation protocols and provides the results. -->
118
 
119
- Evaluation was done using the metrics on the original repo, after ~ 1000 epochs we achieve:
120
 
121
  * val_loss: 3.57
122
  * f1_score: 0.95
 
5
  - projecte-aina/openslr-slr69-ca-trimmed-denoised
6
  ---
7
 
8
+ # Vocos-mel-22khz-cat
9
 
10
  <!-- Provide a quick summary of what the model is/does. -->
11
 
 
22
  Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through
23
  inverse Fourier transform.
24
 
25
+ This version of **Vocos** uses 80-bin mel spectrograms as acoustic features which are widespread
26
  in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
27
  The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
28
+ acoustic output of several TTS models. This version is tailored for the Catalan language,
29
+ as it was trained only on Catalan speech datasets.
30
 
31
+ We are grateful with the authors for open sourcing the code allowing us to modify and train this version.
32
 
33
  ## Intended Uses and limitations
34
 
 
80
  <a target="_blank" href="https://colab.research.google.com/github/langtech-bsc/vocos/blob/matcha/notebooks/vocos_22khz_onnx_inference.ipynb">
81
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
82
  </a>
83
+
84
  ## Training Details
85
 
86
  ### Training Data
 
100
  ### Training Procedure
101
 
102
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
103
+ The model was trained for 1.5M steps and 1.3k epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 5e-4.
104
  We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram.
105
 
106
 
 
118
 
119
  <!-- This section describes the evaluation protocols and provides the results. -->
120
 
121
+ Evaluation was done using the metrics on the [original repo](https://github.com/gemelo-ai/vocos), after ~ 1000 epochs we achieve:
122
 
123
  * val_loss: 3.57
124
  * f1_score: 0.95