Omarrran commited on
Commit
4198e4c
·
verified ·
1 Parent(s): e86d32d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md CHANGED
@@ -27,4 +27,147 @@ This log is the output from Quntized "Omarrran/quantized_english_speecht5_finet
27
  2024-10-22 09:40:49,200 - SpeechQuantizer - INFO - Model saved and pushed successfully
28
 
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ```
 
27
  2024-10-22 09:40:49,200 - SpeechQuantizer - INFO - Model saved and pushed successfully
28
 
29
 
30
+ ```
31
+
32
+
33
+
34
+
35
+ # Quantized SpeechT5 Model Details
36
+
37
+ The provided information is about a quantized version of the SpeechT5 model, specifically the `Omarrran/quantized_english_speecht5_finetune-tts` model.
38
+ ## Model Overview
39
+ - The model is a SpeechT5ForSpeechToText model, which is a transformer-based model for speech-to-text tasks.
40
+ - The model has a total of 153.07 million parameters.
41
+ - The model was not fully initialized from the pre-trained `Omarrran/quantized_english_speecht5_finetune-tts` checkpoint, and some weights were newly initialized.
42
+
43
+ ## Model Architecture
44
+ The model consists of two main components:
45
+
46
+ 1. **Encoder**:
47
+ - The encoder is an instance of `SpeechT5EncoderWithSpeechPrenet`, which includes a speech feature encoder, a feature projection layer, and a transformer-based encoder.
48
+ - The encoder has 12 transformer layers, each with a multi-head attention mechanism and a feed-forward network.
49
+ - The encoder also includes positional encoding, using both convolutional and sinusoidal embeddings.
50
+
51
+ 2. **Decoder**:
52
+ - The decoder is an instance of `SpeechT5DecoderWithTextPrenet`, which includes a text decoder prenet and a transformer-based decoder.
53
+ - The decoder has 6 transformer layers, each with a self-attention mechanism, an encoder-decoder attention mechanism, and a feed-forward network.
54
+ - The decoder also includes positional encoding using sinusoidal embeddings.
55
+ ```
56
+
57
+ Some weights of SpeechT5ForSpeechToText were not initialized from the model checkpoint at Omarrran/quantized_english_speecht5_finetune-tts and are newly initialized: ['speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.0.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.0.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.0.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.0.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.1.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.1.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.1.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.1.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.2.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.2.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.2.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.2.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.3.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.3.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.3.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.3.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.4.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.4.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.4.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.4.self_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.encoder_attn.v_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.feed_forward.intermediate_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.5.feed_forward.intermediate_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.5.feed_forward.output_dense.bias', 'speecht5.decoder.wrapped_decoder.layers.5.feed_forward.output_dense.weight', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.k_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.k_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.out_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.out_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.q_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.q_proj.weight', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.v_proj.bias', 'speecht5.decoder.wrapped_decoder.layers.5.self_attn.v_proj.weight', 'speecht5.encoder.prenet.feature_projection.projection.bias', 'speecht5.encoder.prenet.feature_projection.projection.weight', 'speecht5.encoder.wrapped_encoder.layers.0.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.0.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.0.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.0.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.0.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.0.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.0.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.0.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.0.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.0.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.0.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.0.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.1.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.1.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.1.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.1.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.1.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.1.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.1.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.1.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.1.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.1.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.1.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.1.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.10.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.10.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.10.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.10.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.10.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.10.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.10.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.10.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.10.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.10.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.10.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.10.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.11.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.11.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.11.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.11.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.11.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.11.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.11.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.11.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.11.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.11.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.11.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.11.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.2.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.2.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.2.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.2.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.2.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.2.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.2.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.2.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.2.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.2.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.2.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.2.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.3.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.3.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.3.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.3.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.3.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.3.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.3.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.3.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.3.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.3.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.3.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.3.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.4.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.4.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.4.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.4.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.4.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.4.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.4.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.4.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.4.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.4.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.4.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.4.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.5.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.5.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.5.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.5.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.5.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.5.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.5.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.5.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.5.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.5.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.5.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.5.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.6.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.6.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.6.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.6.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.6.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.6.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.6.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.6.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.6.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.6.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.6.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.6.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.7.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.7.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.7.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.7.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.7.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.7.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.7.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.7.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.7.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.7.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.7.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.7.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.8.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.8.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.8.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.8.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.8.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.8.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.8.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.8.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.8.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.8.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.8.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.8.feed_forward.output_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.9.attention.k_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.9.attention.k_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.9.attention.out_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.9.attention.out_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.9.attention.q_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.9.attention.q_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.9.attention.v_proj.bias', 'speecht5.encoder.wrapped_encoder.layers.9.attention.v_proj.weight', 'speecht5.encoder.wrapped_encoder.layers.9.feed_forward.intermediate_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.9.feed_forward.intermediate_dense.weight', 'speecht5.encoder.wrapped_encoder.layers.9.feed_forward.output_dense.bias', 'speecht5.encoder.wrapped_encoder.layers.9.feed_forward.output_dense.weight']
58
+ You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
59
+ Model Size: 153.07 million parameters
60
+ Model Details:
61
+ SpeechT5ForSpeechToText(
62
+ (speecht5): SpeechT5Model(
63
+ (encoder): SpeechT5EncoderWithSpeechPrenet(
64
+ (prenet): SpeechT5SpeechEncoderPrenet(
65
+ (feature_encoder): SpeechT5FeatureEncoder(
66
+ (conv_layers): ModuleList(
67
+ (0): SpeechT5GroupNormConvLayer(
68
+ (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
69
+ (activation): GELUActivation()
70
+ (layer_norm): GroupNorm(512, 512, eps=1e-05, affine=True)
71
+ )
72
+ (1-4): 4 x SpeechT5NoLayerNormConvLayer(
73
+ (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
74
+ (activation): GELUActivation()
75
+ )
76
+ (5-6): 2 x SpeechT5NoLayerNormConvLayer(
77
+ (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
78
+ (activation): GELUActivation()
79
+ )
80
+ )
81
+ )
82
+ (feature_projection): SpeechT5FeatureProjection(
83
+ (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
84
+ (projection): Linear(in_features=512, out_features=768, bias=True)
85
+ (dropout): Dropout(p=0.0, inplace=False)
86
+ )
87
+ (pos_conv_embed): SpeechT5PositionalConvEmbedding(
88
+ (conv): ParametrizedConv1d(
89
+ 768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16
90
+ (parametrizations): ModuleDict(
91
+ (weight): ParametrizationList(
92
+ (0): _WeightNorm()
93
+ )
94
+ )
95
+ )
96
+ (padding): SpeechT5SamePadLayer()
97
+ (activation): GELUActivation()
98
+ )
99
+ (pos_sinusoidal_embed): SpeechT5SinusoidalPositionalEmbedding()
100
+ )
101
+ (wrapped_encoder): SpeechT5Encoder(
102
+ (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
103
+ (dropout): Dropout(p=0.1, inplace=False)
104
+ (layers): ModuleList(
105
+ (0-11): 12 x SpeechT5EncoderLayer(
106
+ (attention): SpeechT5Attention(
107
+ (k_proj): Linear(in_features=768, out_features=768, bias=True)
108
+ (v_proj): Linear(in_features=768, out_features=768, bias=True)
109
+ (q_proj): Linear(in_features=768, out_features=768, bias=True)
110
+ (out_proj): Linear(in_features=768, out_features=768, bias=True)
111
+ )
112
+ (dropout): Dropout(p=0.1, inplace=False)
113
+ (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
114
+ (feed_forward): SpeechT5FeedForward(
115
+ (intermediate_dropout): Dropout(p=0.1, inplace=False)
116
+ (intermediate_dense): Linear(in_features=768, out_features=3072, bias=True)
117
+ (intermediate_act_fn): GELUActivation()
118
+ (output_dense): Linear(in_features=3072, out_features=768, bias=True)
119
+ (output_dropout): Dropout(p=0.1, inplace=False)
120
+ )
121
+ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
122
+ )
123
+ )
124
+ (embed_positions): SpeechT5RelativePositionalEncoding(
125
+ (pe_k): Embedding(320, 64)
126
+ )
127
+ )
128
+ )
129
+ (decoder): SpeechT5DecoderWithTextPrenet(
130
+ (prenet): SpeechT5TextDecoderPrenet(
131
+ (dropout): Dropout(p=0.1, inplace=False)
132
+ (embed_tokens): Embedding(81, 768, padding_idx=1)
133
+ (embed_positions): SpeechT5SinusoidalPositionalEmbedding()
134
+ )
135
+ (wrapped_decoder): SpeechT5Decoder(
136
+ (layers): ModuleList(
137
+ (0-5): 6 x SpeechT5DecoderLayer(
138
+ (self_attn): SpeechT5Attention(
139
+ (k_proj): Linear(in_features=768, out_features=768, bias=True)
140
+ (v_proj): Linear(in_features=768, out_features=768, bias=True)
141
+ (q_proj): Linear(in_features=768, out_features=768, bias=True)
142
+ (out_proj): Linear(in_features=768, out_features=768, bias=True)
143
+ )
144
+ (dropout): Dropout(p=0.1, inplace=False)
145
+ (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
146
+ (encoder_attn): SpeechT5Attention(
147
+ (k_proj): Linear(in_features=768, out_features=768, bias=True)
148
+ (v_proj): Linear(in_features=768, out_features=768, bias=True)
149
+ (q_proj): Linear(in_features=768, out_features=768, bias=True)
150
+ (out_proj): Linear(in_features=768, out_features=768, bias=True)
151
+ )
152
+ (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
153
+ (feed_forward): SpeechT5FeedForward(
154
+ (intermediate_dropout): Dropout(p=0.1, inplace=False)
155
+ (intermediate_dense): Linear(in_features=768, out_features=3072, bias=True)
156
+ (intermediate_act_fn): GELUActivation()
157
+ (output_dense): Linear(in_features=3072, out_features=768, bias=True)
158
+ (output_dropout): Dropout(p=0.1, inplace=False)
159
+ )
160
+ (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
161
+ )
162
+ )
163
+ )
164
+ )
165
+ )
166
+ (text_decoder_postnet): SpeechT5TextDecoderPostnet(
167
+ (lm_head): Linear(in_features=768, out_features=81, bias=False)
168
+ )
169
+ )
170
+
171
+
172
+
173
  ```