--- license: cc-by-nc-4.0 tags: - tts - gpt2 - vae pipeline_tag: text-to-speech --- # Malayalam Text-to-Speech This repository contains the **Swaram (mal)** text-to-speech (TTS) model checkpoint. ## Model Details **Swaram** (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a **conditional variational autoencoder** (VAE) architecture. Swaram's text encoder is built on top of the **Wav2Vec2 decoder**. A **VAE** is used as the decoder. A **flow-based module** predicts **spectrogram-based acoustic features**, which is composed of the **Transformer-based Contextualizer** and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of **transposed convolutional layers**. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input. ## Architecture ![architecture](architecture.png) ## Usage ``` pip install --upgrade transformers accelerate ``` Then, run inference with the following code-snippet: ```python from transformers import VitsModel, AutoTokenizer import torch model = VitsModel.from_pretrained("aoxo/swaram") tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram") text = "കള്ളാ കടയാടി മോനെ" inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): output = model(**inputs).waveform ``` The resulting waveform can be saved as a `.wav` file: ```python import scipy scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output) ``` Or displayed in a Jupyter Notebook / Google Colab: ```python from IPython.display import Audio Audio(output, rate=model.config.sampling_rate) ``` ## License The model is licensed as **CC-BY-NC 4.0**.