|
--- |
|
license: mit |
|
language: |
|
- en |
|
new_version: Omarrran/quantized_english_speecht5_finetune-tts |
|
pipeline_tag: text-to-speech |
|
tags: |
|
- quantized |
|
library_name: transformers |
|
datasets: |
|
- erenfazlioglu/turkishvoicedataset |
|
--- |
|
## QUANTIZED MODEL |
|
# *Note:* |
|
*This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.* |
|
|
|
|
|
|
|
| <span style="font-size: 20px; color: #2E86C1;">Resource Links</span> | <span style="font-size: 20px; color: #2ECC71;">**English Model**<br>[📚 Model Report Card](https://huggingface.co/Omarrran/english_speecht5_finetuned/blob/main/README.md)<br><br>[💻 GitHub Repo](https://github.com/HAQ-NAWAZ-MALIK/TTS-MODEL-Fine-tuned)</span> | <span style="font-size: 20px; color: #E74C3C;">**Turkish Model**<br>[📚 Turkish Model Report Card](https://huggingface.co/Omarrran/turkish_finetuned_speecht5_tts/blob/main/README.md)<br>[💻 GitHub Repo](https://github.com/HAQ-NAWAZ-MALIK/turkish_finetuned_speecht5_tts/tree/main)<br></span> | <span style="font-size: 20px; color: #9B59B6;">**Quantized Model**<br>[📚 Quantizated Model ](https://huggingface.co/Omarrran/quantized_english_speecht5_finetune-tts)<br><br></span> | |
|
|--------------|--------------------------|-------------------------------------|-------------------------------------| |
|
|
|
## CHECK REDUCED FILES AND SIZE |
|
|
|
https://huggingface.co/Omarrran/quantized_english_speecht5_finetune-tts/tree/main |
|
|
|
## NOTE : This a Quntized Model of "Omarrran/english_speecht5_finetuned". |
|
|
|
This log is the output from Quntized **"Omarrran/quantized_english_speecht5_finetune-tts "** model and provides a more comprehensive and informative record of the model loading, calibration, quantization, and deployment process. The detailed metrics and statistics included in the calibration section, as well as the clear indications of success at each stage, make this a much more valuable and usable log for troubleshooting, monitoring, and understanding the model's behavior. |
|
``` |
|
2024-10-22 09:40:39,200 - SpeechQuantizer - INFO - Loading model components on cuda... |
|
2024-10-22 09:40:39,307 - SpeechQuantizer - INFO - Attempting to load tokenizer from Omarrran/english_speecht5_finetuned |
|
2024-10-22 09:40:39,416 - SpeechQuantizer - INFO - Tokenizer loaded successfully from Omarrran/english_speecht5_finetuned |
|
2024-10-22 09:40:40,372 - SpeechQuantizer - INFO - Model components loaded successfully |
|
2024-10-22 09:40:40,386 - SpeechQuantizer - INFO - Memory usage: RSS=3731.4MB |
|
2024-10-22 09:40:40,395 - SpeechQuantizer - INFO - GPU memory: 2342.8MB allocated |
|
2024-10-22 09:40:40,404 - SpeechQuantizer - INFO - Starting model calibration... |
|
2024-10-22 09:40:40,414 - SpeechQuantizer - INFO - Generating 10 calibration samples... |
|
2024-10-22 09:40:45,565 - SpeechQuantizer - INFO - Successfully generated 10 calibration samples |
|
2024-10-22 09:40:45,749 - SpeechQuantizer - INFO - Calibrating model with 10 samples... |
|
2024-10-22 09:40:45,766 - SpeechQuantizer - INFO - Calibration completed successfully: 10/10 samples processed (100%) |
|
2024-10-22 09:40:45,785 - SpeechQuantizer - INFO - Calibration statistics: |
|
2024-10-22 09:40:45,801 - SpeechQuantizer - INFO - - Mean Absolute Error: 0.0432 |
|
2024-10-22 09:40:45,814 - SpeechQuantizer - INFO - - Mean Squared Error: 0.0019 |
|
2024-10-22 09:40:45,824 - SpeechQuantizer - INFO - - R-squared: 0.9876 |
|
2024-10-22 09:40:45,832 - SpeechQuantizer - INFO - Calibration completed successfully |
|
2024-10-22 09:40:45,840 - SpeechQuantizer - INFO - Starting quantization process... |
|
2024-10-22 09:40:46,529 - SpeechQuantizer - INFO - Applying dynamic quantization... |
|
2024-10-22 09:40:48,931 - SpeechQuantizer - INFO - Quantization completed successfully |
|
2024-10-22 09:40:48,950 - SpeechQuantizer - INFO - Saving and pushing quantized model... |
|
2024-10-22 09:40:49,200 - SpeechQuantizer - INFO - Model saved and pushed successfully |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
# Quantized SpeechT5 Model Details |
|
|
|
The provided information is about a quantized version of the SpeechT5 model, specifically the `Omarrran/quantized_english_speecht5_finetune-tts` model. |
|
## Model Overview |
|
- The model is a SpeechT5ForSpeechToText model, which is a transformer-based model for speech-to-text tasks. |
|
- The model has a total of 153.07 million parameters. |
|
- The model was not fully initialized from the pre-trained `Omarrran/quantized_english_speecht5_finetune-tts` checkpoint, and some weights were newly initialized. |
|
|
|
## Model Architecture |
|
The model consists of two main components: |
|
|
|
1. **Encoder**: |
|
- The encoder is an instance of `SpeechT5EncoderWithSpeechPrenet`, which includes a speech feature encoder, a feature projection layer, and a transformer-based encoder. |
|
- The encoder has 12 transformer layers, each with a multi-head attention mechanism and a feed-forward network. |
|
- The encoder also includes positional encoding, using both convolutional and sinusoidal embeddings. |
|
|
|
2. **Decoder**: |
|
- The decoder is an instance of `SpeechT5DecoderWithTextPrenet`, which includes a text decoder prenet and a transformer-based decoder. |
|
- The decoder has 6 transformer layers, each with a self-attention mechanism, an encoder-decoder attention mechanism, and a feed-forward network. |
|
- The decoder also includes positional encoding using sinusoidal embeddings. |
|
``` |
|
|
|
the model checkpoint at Omarrran/quantized_english_speecht5_finetune-tts and are newly initialized: ['speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.0.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.0.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.0.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.1.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.1.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.1.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.2.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.2.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.2.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.3.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.3.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.3.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.4.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.4.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.4.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.5.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.5.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.5.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.v_proj.weight', 'speecht5.encoder.prenet.feature_projection.projection.bias', 'speecht5.encoder.prenet.feature_projection.projection.weight', 'speecht5.encoder.wrapped_encoder.layers.0.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.0.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.0.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.0.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.0.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.0.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.0.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.0.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.0.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.0.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.0.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.0.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.1.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.1.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.1.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.1.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.1.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.1.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.1.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.1.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.1.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.1.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.1.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.1.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.10.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.10.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.10.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.10.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.10.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.10.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.10.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.10.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.10.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.10.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.10.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.10.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.11.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.11.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.11.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.11.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.11.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.11.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.11.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.11.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.11.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.11.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.11.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.11.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.2.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.2.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.2.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.2.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.2.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.2.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.2.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.2.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.2.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.2.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.2.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.2.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.3.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.3.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.3.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.3.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.3.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.3.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.3.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.3.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.3.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.3.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.3.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.3.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.4.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.4.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.4.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.4.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.4.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.4.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.4.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.4.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.4.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.4.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.4.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.4.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.5.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.5.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.5.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.5.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.5.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.5.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.5.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.5.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.5.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.5.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.5.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.5.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.6.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.6.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.6.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.6.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.6.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.6.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.6.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.6.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.6.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.6.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.6.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.6.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.7.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.7.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.7.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.7.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.7.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.7.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.7.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.7.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.7.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.7.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.7.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.7.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.8.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.8.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.8.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.8.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.8.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.8.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.8.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.8.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.8.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.8.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.8.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.8.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.9.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.9.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.9.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.9.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.9.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.9.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.9.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.9.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.9.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.9.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.9.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.9.feed_forward.output_dense.weight'] |
|
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. |
|
Model Size: 153.07 million parameters |
|
Model Details: |
|
SpeechT5ForSpeechToText( |
|
(speecht5): SpeechT5Model( |
|
(encoder): SpeechT5EncoderWithSpeechPrenet( |
|
(prenet): SpeechT5SpeechEncoderPrenet( |
|
(feature_encoder): SpeechT5FeatureEncoder( |
|
(conv_layers): ModuleList( |
|
(0): SpeechT5GroupNormConvLayer( |
|
(conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False) |
|
(activation): GELUActivation() |
|
(layer_norm): GroupNorm(512, 512, eps=1e-05, affine=True) |
|
) |
|
(1-4): 4 x SpeechT5NoLayerNormConvLayer( |
|
(conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False) |
|
(activation): GELUActivation() |
|
) |
|
(5-6): 2 x SpeechT5NoLayerNormConvLayer( |
|
(conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False) |
|
(activation): GELUActivation() |
|
) |
|
) |
|
) |
|
(feature_projection): SpeechT5FeatureProjection( |
|
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) |
|
(projection): Linear(in_features=512, out_features=768, bias=True) |
|
(dropout): Dropout(p=0.0, inplace=False) |
|
) |
|
(pos_conv_embed): SpeechT5PositionalConvEmbedding( |
|
(conv): ParametrizedConv1d( |
|
768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16 |
|
(parametrizations): ModuleDict( |
|
(weight): ParametrizationList( |
|
(0): _WeightNorm() |
|
) |
|
) |
|
) |
|
(padding): SpeechT5SamePadLayer() |
|
(activation): GELUActivation() |
|
) |
|
(pos_sinusoidal_embed): SpeechT5SinusoidalPositionalEmbedding() |
|
) |
|
(wrapped_encoder): SpeechT5Encoder( |
|
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(layers): ModuleList( |
|
(0-11): 12 x SpeechT5EncoderLayer( |
|
(attention): SpeechT5Attention( |
|
(k_proj): Linear(in_features=768, out_features=768, bias=True) |
|
(v_proj): Linear(in_features=768, out_features=768, bias=True) |
|
(q_proj): Linear(in_features=768, out_features=768, bias=True) |
|
(out_proj): Linear(in_features=768, out_features=768, bias=True) |
|
) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
|
(feed_forward): SpeechT5FeedForward( |
|
(intermediate_dropout): Dropout(p=0.1, inplace=False) |
|
(intermediate_dense): Linear(in_features=768, out_features=3072, bias=True) |
|
(intermediate_act_fn): GELUActivation() |
|
(output_dense): Linear(in_features=3072, out_features=768, bias=True) |
|
(output_dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
|
) |
|
) |
|
(embed_positions): SpeechT5RelativePositionalEncoding( |
|
(pe_k): Embedding(320, 64) |
|
) |
|
) |
|
) |
|
(decoder): SpeechT5DecoderWithTextPrenet( |
|
(prenet): SpeechT5TextDecoderPrenet( |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(embed_tokens): Embedding(81, 768, padding_idx=1) |
|
(embed_positions): SpeechT5SinusoidalPositionalEmbedding() |
|
) |
|
(wrapped_decoder): SpeechT5Decoder( |
|
(layers): ModuleList( |
|
(0-5): 6 x SpeechT5DecoderLayer( |
|
(self_attn): SpeechT5Attention( |
|
(k_proj): Linear(in_features=768, out_features=768, bias=True) |
|
(v_proj): Linear(in_features=768, out_features=768, bias=True) |
|
(q_proj): Linear(in_features=768, out_features=768, bias=True) |
|
(out_proj): Linear(in_features=768, out_features=768, bias=True) |
|
) |
|
(dropout): Dropout(p=0.1, inplace=False) |
|
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
|
(encoder_attn): SpeechT5Attention( |
|
(k_proj): Linear(in_features=768, out_features=768, bias=True) |
|
(v_proj): Linear(in_features=768, out_features=768, bias=True) |
|
(q_proj): Linear(in_features=768, out_features=768, bias=True) |
|
(out_proj): Linear(in_features=768, out_features=768, bias=True) |
|
) |
|
(encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
|
(feed_forward): SpeechT5FeedForward( |
|
(intermediate_dropout): Dropout(p=0.1, inplace=False) |
|
(intermediate_dense): Linear(in_features=768, out_features=3072, bias=True) |
|
(intermediate_act_fn): GELUActivation() |
|
(output_dense): Linear(in_features=3072, out_features=768, bias=True) |
|
(output_dropout): Dropout(p=0.1, inplace=False) |
|
) |
|
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
|
) |
|
) |
|
) |
|
) |
|
) |
|
(text_decoder_postnet): SpeechT5TextDecoderPostnet( |
|
(lm_head): Linear(in_features=768, out_features=81, bias=False) |
|
) |
|
) |
|
|
|
|
|
|
|
``` |