File size: 3,496 Bytes
911ed2a da717af 911ed2a da717af 6327888 911ed2a da717af 911ed2a da717af 6327888 da717af 911ed2a da717af 911ed2a 6327888 911ed2a da717af 911ed2a da717af 911ed2a da717af 911ed2a da717af 911ed2a da717af 911ed2a da717af 911ed2a da717af 911ed2a 6327888 911ed2a da717af 911ed2a da717af 911ed2a da717af 911ed2a da717af 911ed2a da717af |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
library_name: transformers
tags:
- speech
- text-to-speech
datasets:
- openslr/librispeech_asr
- MushanW/GLOBE
- MikhailT/hifi-tts
---
# Model Card for Model ID
Voicera is a AR text-to-speech model trained on ~1000hrs of speech data.
speech is converted to discrete tokens using "Multi-Scale Neural Audio Codec (SNAC)" model
**NB: This is not a SOTA model, and not accuarate enough for production usecase**
## Model Details
### Model Description
"Voicera" is a text-to-speech (TTS) model designed for generating speech from written text.
It uses a GPT-2 type architecture, which helps in creating natural and expressive speech.
The model converts audio into tokens using the "Multi-Scale Neural Audio Codec (SNAC)" model, allowing it to understand and produce speech sounds.
Voicera aims to provide clear and understandable speech, focusing on natural pronunciation and intonation.
It's a project to explore TTS technology and improve audio output quality.
- **Developed by:** Lwasinam Dilli
- **Funded by :** Lwasinam Dilli
- **Model type:** GPT2-Transformer architecture
- **License:** Free and Open to use I guess :)
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo :** [Demos](https://lwasinam.github.io/)
## How to Get Started with the Model
There are three models, We have the base model and two other finetuned on jenny and expresso datasets
The best of all currently is the Jenny finetune
Here are colab link to all 3 respectively
1. [Base Model](https://colab.research.google.com/drive/10nPKliRs1C3ofv2J16_HGDlmzfd-yBtj#scrollTo=r17orAuZ45Q2)
2. [Jenny-Finetune](https://colab.research.google.com/drive/1MSzGGqIhGYVCn76alsX9oBzwC4EtOQSR#scrollTo=Oz0DG-MtovBw)
3. [Expresso-Finetune](https://colab.research.google.com/drive/1wzwSOtpT1CpEMvbcjvvgEKQZoQa5bX2p#scrollTo=YrBUwCNYmmUW&uniqifier=1)
## Training Details
### Training Data
Training data consist of clean subset of Hifi, Libri-Speech, Libri-TTs and Globe datasets
### Training Procedure
During training, audio tokens are generated from snac model and concatenated with text tokens, They are all trained in an autoregressive manner
but since we're interested in just audio tokens, text token loss is reduced by 0.1.
#### Preprocessing
Hugging Face had pretty much all the datasets I needed. I just had to filter out audio more than 10secs due to compute restraints
#### Training Hyperparameters
- Weight decay 0.1
- batch_size 1 with grad_accumulation of 32
- Scheduler : CosineAnnealingWarmRestart with minimum learning rate of 1e-7 and Num of steps for Warm Restart being 500
## Evaluation
I should probably work on this, the loss went down and the output got better :)
### Results
Check out the demo page her -> [Demo](https://lwasinam.github.io/)
#### Summary
- **Hardware Type:** Tesla P100
- **Hours used:** 300+hrs
- **Cloud Provider:** Kaggle :)
## Citation [optional]
**BibTeX:**
```
@software{Betker_TorToiSe_text-to-speech_2022,
author = {Betker, James},
month = apr,
title = {{TorToiSe text-to-speech}},
url = {https://github.com/neonbjb/tortoise-tts},
version = {2.0},
year = {2022}
}
@software{Siuzdak_SNAC_Multi-Scale_Neural_2024,
author = {Siuzdak, Hubert},
month = feb,
title = {{SNAC: Multi-Scale Neural Audio Codec}},
url = {https://github.com/hubertsiuzdak/snac},
year = {2024}
}
```
## Model Card Authors [optional]
Lwasinam Dilli |