projecte-aina
/

alvocat-vocos-22khz

 ---
 license: mit
 ---
+---
+license: mit
+datasets:
+- projecte-aina/festcat_trimmed_denoised
+- projecte-aina/openslr-slr69-ca-trimmed-denoised
+---
+# Vocos-mel-22khz
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+**Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features.
+Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain.
+Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through
+inverse Fourier transform.
+This version of vocos uses 80-bin mel spectrograms as acoustic features which are widespread
+in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
+The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
+acoustic output of several TTS models.
+## Intended Uses and limitations
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio
+domain is possible that the model won't produce high quality samples.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+### Installation
+To use Vocos only in inference mode, install it using:
+```bash
+pip install git+https://github.com/langtech-bsc/vocos.git@matcha
+```
+### Reconstruct audio from mel-spectrogram
+```python
+import torch
+from vocos import Vocos
+vocos = Vocos.from_pretrained("BSC-LT/vocos-mel-22khz-cat")
+mel = torch.randn(1, 80, 256)  # B, C, T
+audio = vocos.decode(mel)
+```
+### Copy-synthesis from a file:
+```python
+import torchaudio
+y, sr = torchaudio.load(YOUR_AUDIO_FILE)
+if y.size(0) > 1:  # mix to mono
+    y = y.mean(dim=0, keepdim=True)
+y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=22050)
+y_hat = vocos(y)
+```
+### Onnx
+We also release a onnx version of the model, you can check in colab:
+<a target="_blank" href="https://colab.research.google.com/github/langtech-bsc/vocos/blob/matcha/notebooks/vocos_22khz_onnx_inference.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The model was trained on 3 Catalan speech datasets
+| Dataset             | Language | Hours   |
+|---------------------|----------|---------|
+| Festcat             | ca       | 22      |
+| OpenSLR69           | ca       | 5       |
+| lafresca            | ca       | 5       |
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+The model was trained for 1M steps and 1k epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 5e-4.
+We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram.
+#### Training Hyperparameters
+* initial_learning_rate: 5e-4
+* scheduler: cosine without warmup or restarts
+* mel_loss_coeff: 45
+* mrd_loss_coeff: 0.1
+* batch_size: 16
+* num_samples: 16384
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+Evaluation was done using the metrics on the original repo, after ~ 1000 epochs we achieve:
+* val_loss:
+* f1_score:
+* mel_loss:
+* periodicity_loss:
+* pesq_score:
+* pitch_loss:
+* utmos_score:
+## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+If this code contributes to your research, please cite the work:
+```
+@article{siuzdak2023vocos,
+  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
+  author={Siuzdak, Hubert},
+  journal={arXiv preprint arXiv:2306.00814},
+  year={2023}
+}
+```
+## Additional information
+### Author
+The Language Technologies Unit from Barcelona Supercomputing Center.
+### Contact
+For further information, please send an email to <[email protected]>.
+### Copyright
+Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
+### License
+[MIT](https://opensource.org/license/mit)
+### Funding
+This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).