puccho commited on
Commit
a3146db
Β·
verified Β·
1 Parent(s): cc4c889

Upload 24 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,17 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ show_case/audio-1434542201-headset.wav filter=lfs diff=lfs merge=lfs -text
37
+ show_case/p225_002.wav filter=lfs diff=lfs merge=lfs -text
38
+ show_case/Real.wav filter=lfs diff=lfs merge=lfs -text
39
+ show_case/Scene_example.wav filter=lfs diff=lfs merge=lfs -text
40
+ show_case/SER(emotion)_example.wav filter=lfs diff=lfs merge=lfs -text
41
+ show_case/SFT_Fisher_example.wav filter=lfs diff=lfs merge=lfs -text
42
+ show_case/SG_audio_1.wav filter=lfs diff=lfs merge=lfs -text
43
+ show_case/SGR_018.wav filter=lfs diff=lfs merge=lfs -text
44
+ show_case/SLR_example.wav filter=lfs diff=lfs merge=lfs -text
45
+ show_case/SNV_example.wav filter=lfs diff=lfs merge=lfs -text
46
+ show_case/Sound_Vocal_example.wav filter=lfs diff=lfs merge=lfs -text
47
+ show_case/SVD_14154_file31512.mp3.wav_16k.wav_norm.wav_mono.wav_silence.wav filter=lfs diff=lfs merge=lfs -text
48
+ Soundwave/assets/audio/example_1.wav filter=lfs diff=lfs merge=lfs -text
49
+ Soundwave/assets/logo.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -10,5 +10,5 @@ pinned: false
10
  license: apache-2.0
11
  short_description: The Official Demo of Soundwave
12
  ---
13
-
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
10
  license: apache-2.0
11
  short_description: The Official Demo of Soundwave
12
  ---
13
+ This space is designed to provide an intuitive demonstration for the paper titled "Soundwave" and the link to the paper is: arxiv.org/abs/2502.12900.
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Soundwave/LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
Soundwave/README.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Soundwave: *Less is More* for Speech-Text Alignment in LLMs
2
+
3
+ <p align="center">
4
+ <img src="assets/logo.png" style="width:240px; height:240px; margin-bottom:10px;"/>
5
+ </p>
6
+
7
+ <p align="center">
8
+ <font size="3"><a href="https://huggingface.co/papers/2502.12900">πŸ€— Paper</a>&nbsp|&nbsp<a href="https://huggingface.co/FreedomIntelligence/Soundwave">πŸ€— Model</a>|&nbsp<a href="https://arxiv.org/abs/2502.12900">πŸ“ƒ Paper</a>|&nbsp<a href="https://huggingface.co/spaces/FreedomIntelligence/SoundwaveDemo">πŸ“Ό Online Demo</a>&nbsp</font>
9
+ </p>
10
+
11
+ <div>
12
+ <h2>✨ Highlights of Our Soundwave Model !️</h2>
13
+ <ul>
14
+ <font size="3"><li>A Speech-to-Text Model Bridging the Gap Between Speech and Text</li></font>
15
+ <font size="3"><li>Utilizes Data-Efficient Strategy and Unique Architecture, Trained on Only 10k Hours of Data</li></font>
16
+ <font size="3"><li>Exceptional Performance in Speech Translation and AIR-Bench Speech Tasks</li></font>
17
+ <font size="3"><li>Retains Intelligence During Conversations, Ideal for Interactive Tasks</li></font>
18
+ </ul>
19
+ </div>
20
+
21
+
22
+ ## πŸ’Œ News
23
+ > <ul>
24
+ > <font size="3"><li>[05/03/2025] πŸ”₯ We released our Soundwave weights <a href="https://huggingface.co/FreedomIntelligence/Soundwave">πŸ€— Model </a> ! </li></font>
25
+ > <font size="3"><li>[19/02/2025] Try our model now in the <a href="https://huggingface.co/spaces/FreedomIntelligence/SoundwaveDemo">πŸ“Ό Online Demo</a> . </li></font>
26
+ > <font size="3"><li>[19/02/2025] The online demo and model weights are coming soon. </li></font>
27
+ > <font size="3"><li>[18/02/2025] Release the model architecture and inference code. </li></font>
28
+ > </ul>
29
+
30
+ ## Project Structure
31
+ ```
32
+ .
33
+ β”œβ”€β”€ assets/
34
+ β”‚ └── audio/ # Directory for test audio files (e.g., .wav files)
35
+ β”œβ”€β”€ README.md
36
+ β”œβ”€β”€ run_inference.py # Main inference script
37
+ └── Soundwave.py # Model architecture
38
+ ```
39
+
40
+
41
+ ## Getting Started
42
+
43
+ ### Installation Requirements
44
+ <font size="3">Python version 3.10.11 is used in the Soundwave project.</font>
45
+ ```bash
46
+ conda create -n soundwave python=3.10.11
47
+ conda activate soundwave
48
+ pip install -r requirements.txt
49
+ ```
50
+
51
+ ## Inference
52
+ > <font size="3">Before starting, ensure you have at least 21GB of GPU memory to run our model inference.</font><br>
53
+
54
+ ### Usage Command
55
+ <font size="3">To run the inference script and process the audio, use the following command:</font>
56
+ ```bash
57
+ python run_inference.py --model_path <model_path>
58
+ # model_path: Path to the pre-trained Soundwave model.
59
+ ```
60
+
61
+ <font size="3">Below are some quick usage examples you can try:</font>
62
+ ```python
63
+ import torch
64
+ import librosa
65
+ from run_inference import load_model, gen_model_inputs, CONFIG
66
+
67
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
68
+
69
+ model, audio_processor, tokenizer = load_model("FreedomIntelligence/Soundwave", device)
70
+
71
+ # apply chat template
72
+ prompt = "What does the person say?"
73
+ model_inputs = gen_model_inputs(tokenizer, prompt, device)
74
+
75
+ # audio preprocess
76
+ audio_path = "assets/audio/example_1.wav"
77
+ audio, _ = librosa.load(audio_path, sr=CONFIG.sampling_rate, mono=True)
78
+ audio_feat = audio_processor(
79
+ audio, sampling_rate=CONFIG.sampling_rate, return_tensors="pt"
80
+ ).input_features.to(device, dtype=torch.float16)
81
+
82
+ # inference
83
+ output_ids = model.generate(
84
+ **model_inputs,
85
+ audios=audio_feat,
86
+ max_new_tokens=512,
87
+ eos_token_id=tokenizer.eos_token_id,
88
+ do_sample=True,
89
+ top_p=0.9,
90
+ temperature=0.2
91
+ )
92
+
93
+ input_token_len = model_inputs["input_ids"].shape[1]
94
+ response = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
95
+
96
+ print(response)
97
+ ```
98
+ ## Citation
99
+ <font size="3">If you found this repository useful, please consider citing this work:</font>
100
+ ```
101
+ @article{zhang2025soundwave,
102
+ title={Soundwave: Less is More for Speech-Text Alignment in LLMs},
103
+ author={Zhang, Yuhao and Liu, Zhiheng and Bu, Fan and Zhang, Ruiyu and Wang, Benyou and Li, Haizhou},
104
+ journal={arXiv preprint arXiv:2502.12900},
105
+ year={2025}
106
+ }
107
+ ```
Soundwave/Soundwave.py ADDED
@@ -0,0 +1,341 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List, Optional, Tuple, Union
2
+
3
+ import torch
4
+ import torch.nn as nn
5
+ import torch.nn.functional as F
6
+ from torch.nn import CrossEntropyLoss
7
+
8
+
9
+ from transformers import AutoConfig, AutoModelForCausalLM, \
10
+ LlamaConfig, LlamaModel, LlamaForCausalLM
11
+ from transformers.trainer_pt_utils import LabelSmoother
12
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
13
+ from transformers.models.whisper.modeling_whisper import WhisperEncoder, WhisperConfig
14
+
15
+
16
+ IGNORE_TOKEN_ID = LabelSmoother.ignore_index
17
+
18
+
19
+ class SoundwaveConfig(LlamaConfig):
20
+ model_type = "Soundwave"
21
+
22
+ class LookBackModule(nn.Module):
23
+ def __init__(self, cfg: LlamaConfig):
24
+ super().__init__()
25
+ self.encoder_attn = nn.MultiheadAttention(
26
+ cfg.hidden_size,
27
+ cfg.num_attention_heads,
28
+ dropout=0.1,
29
+ batch_first=True
30
+ )
31
+ self.atten_layer_norm = nn.LayerNorm(cfg.hidden_size)
32
+
33
+
34
+ def forward(self, x, wav_feature, bf_shrink_padding_mask):
35
+
36
+ residual = x
37
+ x, _ = self.encoder_attn(
38
+ query=x,
39
+ key=wav_feature,
40
+ value=wav_feature,
41
+ key_padding_mask=bf_shrink_padding_mask,
42
+ )
43
+ x += residual
44
+ x = self.atten_layer_norm(x)
45
+ return x
46
+
47
+ class SoundwaveModel(LlamaModel):
48
+ config_class = SoundwaveConfig
49
+
50
+ def __init__(self, config: LlamaConfig):
51
+ super(SoundwaveModel, self).__init__(config)
52
+
53
+ if hasattr(config, "adapter_size"):
54
+ self.mm_projector1 = nn.Linear(config.adapter_size*2 , config.hidden_size)
55
+ self.lbm = LookBackModule(config)
56
+ self.out_norm = nn.LayerNorm(config.hidden_size)
57
+ self.audio_feature_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
58
+
59
+ asr_encoder_layer = nn.TransformerEncoderLayer(
60
+ d_model=config.hidden_size,
61
+ nhead=config.num_attention_heads,
62
+ dim_feedforward=config.hidden_size*2,
63
+ dropout=0.1,
64
+ norm_first=True
65
+ )
66
+ self.asr_transformer_encoder = nn.TransformerEncoder(asr_encoder_layer, num_layers=1)
67
+
68
+ if hasattr(config, "audio_tower"):
69
+ self.audio_tower = WhisperEncoder(WhisperConfig.from_pretrained(config.audio_tower))
70
+ self.mask_tensor=(torch.ones([1,1024])>0)
71
+ self.length=-1
72
+
73
+ def forward(
74
+ self,
75
+ input_ids: torch.LongTensor = None,
76
+ attention_mask: Optional[torch.Tensor] = None,
77
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
78
+ inputs_embeds: Optional[torch.FloatTensor] = None,
79
+ use_cache: Optional[bool] = None,
80
+ output_attentions: Optional[bool] = None,
81
+ output_hidden_states: Optional[bool] = None,
82
+ audios: Optional[torch.FloatTensor] = None,
83
+ return_dict: Optional[bool] = None,
84
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
85
+
86
+ if inputs_embeds is None:
87
+ inputs_embeds = self.embed_tokens(input_ids)
88
+
89
+ if (input_ids.shape[1] != 1 or self.training) and audios is not None:
90
+ audio_list=[]
91
+
92
+ for audio in audios:
93
+ with torch.no_grad():
94
+ audio=audio.unsqueeze(0)
95
+ audio_feature = self.audio_tower(audio).last_hidden_state
96
+
97
+ audio_feature = audio_feature.view(audio_feature.shape[0], audio_feature.shape[1]//2, 2 * audio_feature.shape[2])
98
+ audio_feature = self.mm_projector1(audio_feature)
99
+ audio_feature = self.asr_transformer_encoder(audio_feature)
100
+ audio_feature = self.out_norm(audio_feature)
101
+ audio_list.append(audio_feature[0])
102
+
103
+ audio_features = torch.stack(audio_list, dim=0)
104
+
105
+ predict_logits = self.audio_feature_head(audio_features)
106
+
107
+ new_input_embeds = []
108
+ label_shift = []
109
+ label_extend = -1
110
+ new_input_ids = []
111
+ tokens = predict_logits.argmax(dim=-1)
112
+ shrink_mask = tokens.roll(1) != tokens
113
+ shrink_mask[:,0] = True
114
+
115
+ lengths = shrink_mask.long().sum(-1)
116
+ shrink_2d = audio_features[shrink_mask]
117
+ num_patches = self.config.audio_patch_size
118
+ l_index=0
119
+ shrink_features = []
120
+ for v, audio_feature, mask in zip(lengths, audio_features, ~shrink_mask):
121
+ shrink_feature = shrink_2d[l_index:l_index+v]
122
+ shrink_feature = self.lbm(shrink_feature, audio_feature, bf_shrink_padding_mask=mask)
123
+ shrink_features.append(shrink_feature)
124
+ l_index += v
125
+
126
+ if self.training:
127
+ maxn_length = lengths.max()
128
+ label_extend = maxn_length - num_patches
129
+ for cur_input_ids, cur_input_embeds, shrink_feature in zip(input_ids, inputs_embeds, shrink_features):
130
+ pad_ids = torch.full(size=(maxn_length,), fill_value=self.config.llm_pad_token_id, dtype=torch.long).to(attention_mask.device)
131
+ pad_embeds = self.embed_tokens(pad_ids)
132
+ v = shrink_feature.shape[0]
133
+ audio_start_token_pos = torch.where(cur_input_ids == self.config.audio_patch_token)[0][:1]
134
+ cur_new_input_id = torch.cat((cur_input_ids[:audio_start_token_pos], cur_input_ids[audio_start_token_pos: audio_start_token_pos+1].repeat(v), cur_input_ids[audio_start_token_pos + num_patches:], pad_ids[:maxn_length - v]), dim=0)
135
+ cur_new_input_embeds = torch.cat((
136
+ cur_input_embeds[:audio_start_token_pos],
137
+ shrink_feature,
138
+ cur_input_embeds[audio_start_token_pos + num_patches:],pad_embeds[:maxn_length-v]), dim=0)
139
+ new_input_embeds.append(cur_new_input_embeds)
140
+ new_input_ids.append(cur_new_input_id)
141
+ label_shift.append(v - num_patches)
142
+
143
+ input_ids = torch.stack(new_input_ids, dim=0)
144
+ attention_mask=input_ids.ne(self.config.llm_pad_token_id)
145
+ inputs_embeds = torch.stack(new_input_embeds, dim=0)
146
+ else:
147
+ for cur_input_ids, cur_input_embeds, shrink_feature in zip(input_ids, inputs_embeds, shrink_features):
148
+ v = shrink_feature.shape[0]
149
+
150
+ audio_start_token_pos = torch.where(cur_input_ids == self.config.audio_patch_token)[0][:1]
151
+ cur_new_input_id = torch.cat((cur_input_ids[:audio_start_token_pos],cur_input_ids[audio_start_token_pos: audio_start_token_pos+1].repeat(v), cur_input_ids[audio_start_token_pos + num_patches:]),dim=0)
152
+ cur_new_input_embeds = torch.cat((
153
+ cur_input_embeds[:audio_start_token_pos],
154
+ shrink_feature,
155
+ cur_input_embeds[audio_start_token_pos + num_patches:]), dim=0)
156
+ new_input_embeds.append(cur_new_input_embeds)
157
+ new_input_ids.append(cur_new_input_id)
158
+ input_ids = torch.stack(new_input_ids, dim=0)
159
+ attention_mask=input_ids.ne(self.config.llm_pad_token_id)
160
+ inputs_embeds = torch.stack(new_input_embeds, dim=0)
161
+ self.mask_tensor.to(input_ids.device)[0][:attention_mask.shape[1]]=attention_mask[0]
162
+ self.length=attention_mask.shape[1]
163
+
164
+ if not self.training:
165
+ attention_mask=self.mask_tensor.to(input_ids.device)[:,:self.length]
166
+ self.length+=1
167
+
168
+ return_state=super(SoundwaveModel, self).forward(
169
+ input_ids=None, attention_mask=attention_mask, past_key_values=past_key_values,
170
+ inputs_embeds=inputs_embeds, use_cache=use_cache,
171
+ output_attentions=output_attentions, output_hidden_states=output_hidden_states,
172
+ return_dict=return_dict
173
+ )
174
+ if self.training:
175
+ return_state["audio_features"] = predict_logits
176
+ return_state["label_shift"] = label_shift
177
+ return_state["label_extend"] = label_extend
178
+
179
+ return return_state
180
+
181
+
182
+ class SoundwaveForCausalLM(LlamaForCausalLM):
183
+ config_class = SoundwaveConfig
184
+
185
+ def __init__(self, config):
186
+ super(LlamaForCausalLM, self).__init__(config)
187
+ self.model = SoundwaveModel(config)
188
+
189
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
190
+
191
+ # Initialize weights and apply final processing
192
+ self.post_init()
193
+
194
+ def get_model(self):
195
+ return self.model
196
+
197
+ def forward(
198
+ self,
199
+ input_ids: torch.LongTensor = None,
200
+ attention_mask: Optional[torch.Tensor] = None,
201
+ position_ids: Optional[torch.LongTensor] = None,
202
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
203
+ inputs_embeds: Optional[torch.FloatTensor] = None,
204
+ labels: Optional[torch.LongTensor] = None,
205
+ asr_targets: Optional[torch.LongTensor] = None,
206
+ use_cache: Optional[bool] = None,
207
+ output_attentions: Optional[bool] = None,
208
+ output_hidden_states: Optional[bool] = None,
209
+ audios: Optional[torch.FloatTensor] = None,
210
+ return_dict: Optional[bool] = None,
211
+ cache_position: Optional[torch.LongTensor] = None,
212
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
213
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
214
+ output_hidden_states = (
215
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
216
+ )
217
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
218
+
219
+ outputs = self.model(
220
+ input_ids=input_ids,
221
+ attention_mask=attention_mask,
222
+ past_key_values=past_key_values,
223
+ inputs_embeds=inputs_embeds,
224
+ use_cache=use_cache,
225
+ output_attentions=output_attentions,
226
+ output_hidden_states=output_hidden_states,
227
+ return_dict=return_dict,
228
+ audios=audios
229
+ )
230
+
231
+
232
+ hidden_states = outputs[0]
233
+ logits = self.lm_head(hidden_states)
234
+
235
+ loss = None
236
+ if labels is not None:
237
+ if asr_targets is not None:
238
+ mask_asr_targets = (asr_targets != IGNORE_TOKEN_ID)
239
+ target_lengths = mask_asr_targets.sum(1)
240
+ input_lengths = torch.full(size=(outputs["audio_features"].shape[0],), fill_value=outputs["audio_features"].shape[1], dtype=torch.long)
241
+ asr_logits = outputs["audio_features"]
242
+
243
+ log_probs = F.log_softmax(asr_logits, dim=-1).transpose(0, 1)
244
+
245
+ with torch.backends.cudnn.flags(enabled=False):
246
+ loss_asr = F.ctc_loss(
247
+ log_probs,
248
+ asr_targets,
249
+ input_lengths,
250
+ target_lengths,
251
+ blank=self.model.config.audio_patch_token,
252
+ reduction='mean',
253
+ zero_infinity=True,
254
+ )
255
+ else:
256
+ loss_asr=0
257
+
258
+ # Shift so that tokens < n predict n
259
+ shift_logits = logits[..., :-1, :].contiguous()
260
+ shift_labels = labels[..., 1:].contiguous()
261
+
262
+ if len(outputs["label_shift"]) >0:
263
+ if outputs["label_extend"] != -1:
264
+ new_shift_labels = torch.full(size=(shift_labels.shape[0], outputs["label_extend"]+shift_labels.shape[1]), fill_value=IGNORE_TOKEN_ID, dtype=torch.long).to(shift_labels.device)
265
+ for i in range(len(outputs["label_shift"])):
266
+ new_shift_labels[i][outputs["label_shift"][i]:outputs["label_shift"][i] + len(shift_labels[i])]= shift_labels[i]
267
+ shift_labels = new_shift_labels
268
+ else:
269
+ for i in range(len(outputs["label_shift"])):
270
+ shift_labels[i]= shift_labels[i].roll(-outputs["label_shift"][i])
271
+
272
+ loss_fct = CrossEntropyLoss()
273
+ # Flatten the tokens
274
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
275
+ shift_labels = shift_labels.view(-1)
276
+
277
+ # Enable model/pipeline parallelism
278
+ shift_labels = shift_labels.to(shift_logits.device)
279
+ loss = loss_fct(shift_logits, shift_labels)
280
+ loss = loss + 0.3*loss_asr
281
+
282
+ if not return_dict:
283
+ output = (logits,) + outputs[1:]
284
+ return (loss,) + output if loss is not None else output
285
+
286
+ return CausalLMOutputWithPast(
287
+ loss=loss,
288
+ logits=logits,
289
+ past_key_values=outputs.past_key_values,
290
+ hidden_states=outputs.hidden_states,
291
+ attentions=outputs.attentions,
292
+ )
293
+
294
+ def prepare_inputs_for_generation(
295
+ self,
296
+ input_ids,
297
+ past_key_values=None,
298
+ attention_mask=None,
299
+ inputs_embeds=None,
300
+ cache_position=None,
301
+ position_ids=None,
302
+ use_cache=True,
303
+ **kwargs,
304
+ ):
305
+ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
306
+ # Exception 1: when passing input_embeds, input_ids may be missing entries
307
+ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
308
+ if past_key_values is not None:
309
+ if inputs_embeds is not None: # Exception 1
310
+ input_ids = input_ids[:, -cache_position.shape[0] :]
311
+ elif input_ids.shape[1] != cache_position.shape[0]: # Default case (the "else", a no op, is Exception 2)
312
+ input_ids = input_ids[:, cache_position]
313
+
314
+ if attention_mask is not None and position_ids is None:
315
+ # create position_ids on the fly for batch generation
316
+ position_ids = attention_mask.long().cumsum(-1) - 1
317
+ position_ids.masked_fill_(attention_mask == 0, 1)
318
+ if past_key_values:
319
+ position_ids = position_ids[:, -input_ids.shape[1] :]
320
+
321
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
322
+ if inputs_embeds is not None and cache_position[0] == 0:
323
+ model_inputs = {"inputs_embeds": inputs_embeds}
324
+ else:
325
+ model_inputs = {"input_ids": input_ids.contiguous()} # `contiguous()` needed for compilation use cases
326
+
327
+ model_inputs.update(
328
+ {
329
+ "position_ids": position_ids,
330
+ "cache_position": cache_position,
331
+ "past_key_values": past_key_values,
332
+ "use_cache": use_cache,
333
+ "attention_mask": attention_mask,
334
+ }
335
+ )
336
+ model_inputs.update({"audios": kwargs["audios"]} if "audios" in kwargs.keys() else {})
337
+ return model_inputs
338
+
339
+
340
+ AutoConfig.register("Soundwave", SoundwaveConfig)
341
+ AutoModelForCausalLM.register(SoundwaveConfig, SoundwaveForCausalLM)
Soundwave/assets/audio/example_1.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f017d446358e8f9a0bd5d2cf209a3fa2c0e40025e2831142d591945b65fcd809
3
+ size 137166
Soundwave/assets/logo.png ADDED

Git LFS Details

  • SHA256: 2d9b0303000886849c010eacb32555d5f7afd88e40bd5de6706b55e7d7647371
  • Pointer size: 131 Bytes
  • Size of remote file: 283 kB
Soundwave/requirement.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ torch==2.3.0
2
+ gradio
3
+ librosa==0.10.2.post1
4
+ transformers==4.43.1
5
+ accelerate==0.34.2
Soundwave/run_inference.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import argparse
3
+ import librosa
4
+ from transformers import AutoTokenizer, WhisperProcessor
5
+ from .Soundwave import SoundwaveForCausalLM
6
+ import spaces
7
+ import gradio as gr
8
+
9
+ class BasicSetting:
10
+ def __init__(self):
11
+ self.sampling_rate = 16000
12
+ self.audio_token_len = 1
13
+ self.stop = "</s>"
14
+ CONFIG = BasicSetting()
15
+
16
+
17
+ def load_model(model_path, device):
18
+ # load model
19
+ model = SoundwaveForCausalLM.from_pretrained(
20
+ model_path,
21
+ device_map={"": device},
22
+ torch_dtype=torch.float16,
23
+ quantization_config=None,
24
+ # attn_implementation="flash_attention_2"
25
+ ).eval().to(device)
26
+
27
+ # load tokenizer
28
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
29
+
30
+ model.config.audio_patch_token = tokenizer.get_vocab()["<audio_patch>"]
31
+ model.config.llm_pad_token_id = tokenizer.pad_token_id
32
+ model.generation_config.pad_token_id = tokenizer.eos_token_id
33
+
34
+ # load audio preprocessor
35
+ audio_processor = WhisperProcessor.from_pretrained(model.config.audio_tower, torch_dtype=torch.float16)
36
+ return model, audio_processor, tokenizer
37
+
38
+ @spaces.GPU(duration=40, progress=gr.Progress(track_tqdm=True))
39
+ def gen_model_inputs(tokenizer, prompt, device):
40
+ system = "You are a helpful language and speech assistant. You are able to understand the speech content that the user provides, and assist the user with a variety of tasks using natural language."
41
+ DEFAULT_AUDIO_PATCH_TOKEN = "<audio_patch>"
42
+ audio_placeholder = DEFAULT_AUDIO_PATCH_TOKEN * CONFIG.audio_token_len
43
+ audio_placeholder = "\n"+audio_placeholder
44
+ audio_placeholder_ids = tokenizer(audio_placeholder).input_ids
45
+
46
+ begin_of_text_id = tokenizer.get_vocab()["<|begin_of_text|>"]
47
+ start_header_id = tokenizer.get_vocab()["<|start_header_id|>"]
48
+ end_header_id = tokenizer.get_vocab()["<|end_header_id|>"]
49
+ eot_id = tokenizer.get_vocab()["<|eot_id|>"]
50
+ nl_tokens = tokenizer('\n').input_ids
51
+ _system = tokenizer('system').input_ids
52
+ _user = tokenizer('user').input_ids
53
+ _assistant = tokenizer('assistant').input_ids
54
+
55
+ input_ids = []
56
+ input_id = []
57
+
58
+ system = [begin_of_text_id] + [start_header_id] + _system + [end_header_id] + nl_tokens + tokenizer(system).input_ids + [eot_id]
59
+ input_id += system
60
+
61
+ user_input_id = [start_header_id] + _user + [end_header_id] + audio_placeholder_ids + tokenizer(prompt).input_ids + [eot_id]
62
+ assistant_input_id = [start_header_id] + _assistant + [end_header_id] + nl_tokens
63
+
64
+ input_id += user_input_id
65
+ input_id += assistant_input_id
66
+
67
+ input_ids.append(input_id)
68
+ input_ids = torch.tensor(input_ids, dtype=torch.int).to(device)
69
+ attention_mask=input_ids.ne(tokenizer.pad_token_id)
70
+
71
+ return dict(input_ids=input_ids, attention_mask=attention_mask)
72
+
73
+ @spaces.GPU(duration=40, progress=gr.Progress(track_tqdm=True))
74
+ def inference(model, audio_processor, tokenizer, prompt, audio_path, device):
75
+ # apply chat template
76
+ model_inputs = gen_model_inputs(tokenizer, prompt, device)
77
+ model.cuda()
78
+ # audio preprocess
79
+ audio, _ = librosa.load(audio_path, sr=CONFIG.sampling_rate, mono=True)
80
+ audio_feat = audio_processor(
81
+ audio, sampling_rate=CONFIG.sampling_rate, return_tensors="pt"
82
+ ).input_features.to(device, dtype=torch.float16)
83
+ print(audio_feat)
84
+ output_ids = model.generate(
85
+ **model_inputs,
86
+ audios=audio_feat,
87
+ max_new_tokens=512,
88
+ eos_token_id=tokenizer.eos_token_id,
89
+ do_sample=True,
90
+ top_p=0.9,
91
+ temperature=0.2,
92
+ )
93
+
94
+ input_ids = model_inputs["input_ids"]
95
+ input_token_len = input_ids.shape[1]
96
+ n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
97
+ if n_diff_input_output > 0:
98
+ print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
99
+ outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
100
+
101
+ outputs = outputs.strip()
102
+ if outputs.endswith(CONFIG.stop):
103
+ outputs = outputs[:-len(CONFIG.stop)]
104
+ outputs = outputs.strip()
105
+
106
+ return outputs
107
+
108
+ if __name__ == "__main__":
109
+ parser = argparse.ArgumentParser()
110
+ parser.add_argument('--adapter_size', type=int, default=1280)
111
+ parser.add_argument('--model_path', type=str, default="FreedomIntelligence/Soundwave")
112
+ args = parser.parse_args()
113
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
114
+ model_path = args.model_path
115
+
116
+ model, audio_processor, tokenizer = load_model(model_path, device)
117
+
118
+ prompt = "Please transcribe the following audio and then answer based on the audio's transcription."
119
+ audio_path = "/mnt/nvme3n1/liuzhiheng/speech_copy/lzh/show_code/assets/audio/example_1.wav"
120
+
121
+ response = inference(model, audio_processor, tokenizer, prompt, audio_path, device)
122
+
123
+ print(f"{response}")
app.py CHANGED
@@ -1,7 +1,66 @@
1
  import gradio as gr
 
2
 
3
- def greet(name):
4
- return "Hello " + name + "!!"
5
 
6
- demo = gr.Interface(fn=greet, inputs="text", outputs="text")
7
- demo.launch()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import gradio as gr
2
+ from Soundwave.run_inference import *
3
 
 
 
4
 
5
+ device = 'cuda'
6
+
7
+ model, audio_processor, tokenizer = load_model("FreedomIntelligence/Soundwave", device)
8
+ model.cuda()
9
+
10
+ @spaces.GPU(duration=40, progress=gr.Progress(track_tqdm=True))
11
+ def process_audio_text(text, audio):
12
+ # ιŸ³ι’‘θ·―εΎ„ζ˜―δΌ ε…₯ηš„ζ–‡δ»Άθ·―εΎ„
13
+ audio_path = audio
14
+ print(audio_path)
15
+ system = "You are a helpful language and speech assistant. You are able to understand the speech content that the user provides, and assist the user with a variety of tasks using natural language."
16
+ if text == "" or text == " ":
17
+ text = "Please transcribe the following audio and then answer based on the audio's transcription."
18
+ response = inference(model, audio_processor, tokenizer, text, audio_path, device)
19
+ result = f"{response}"
20
+ return result
21
+
22
+ examples = [
23
+ ["Can you turn my English into German?", "./show_case/common_voice_en_19664034.mp3"], # En-De
24
+ ["Can you identify the initial word that connects to 'currency_name' in this audio clip?", "./show_case/audio-1434542201-headset.wav"], # ER
25
+ ["What do you think the speaker's message is intended to be in this audio?", "./show_case/audio-1434542201-headset.wav"], # IC
26
+ ["What does the person say?", "./show_case/p225_002.wav"], # DFake
27
+ # ["Assess whether this speech's pronunciation is Real or Fake.", "./show_case/Real.wav"], # DFake
28
+ ["Assess whether this speech's pronunciation is Real or Fake.", "./show_case/Fake.wav"], # DFake
29
+ ["What emotional weight does the speaker's tone carry?\nPick one answer from A, B, C, and D.\nA: fear\nB: sadness\nC: joy\nD: neutral", "./show_case/SER(emotion)_example.wav"], #SER(emotion)
30
+ # ["Assess whether this speech's pronunciation is Real or Fake.", "./show_case/SVD_14154_file31512.mp3.wav_16k.wav_norm.wav_mono.wav_silence.wav"], # SVD
31
+ ["Choose the most suitable answer from options A, B, C, and D to respond the question in next line, you may only choose A or B or C or D.\nThe number of speakers delivering this speech is what?\nA. 4\nB. 2\nC.1\nD. 3", "./show_case/SNV_example.wav"], #SNV
32
+ ["Identify the language of the conversation you just heard.","./show_case/SLR_example.wav"], #SLR
33
+ ["tell the gender of the speaker in this audio.","./show_case/SGR_018.wav"], #SGR
34
+ ["What's the sound we're hearing in this audio from?","./show_case/Sound_Vocal_example.wav"], #Sound_vocal
35
+ ["What is your best guess at the setting of this sound clip?","./show_case/Scene_example.wav"], #Sound_cochl
36
+ ["Choose the most suitable answer from options A, B, C, and D to respond the question in next line, Please think step by step and you may only choose A or B or C or D.\nRecognize the segment where 'project' is spoken by the speaker.\nA. [5.28, 5.39]\nB. [0.92, 1.39]\nC. [4.75, 5.28]\nD. [3.86, 4.23]","./show_case/SG_audio_1.wav"], #SG
37
+ ["What type of business does the first person's son have?","./show_case/SFT_Fisher_example.wav"] #SFT_Fisher
38
+ ]
39
+
40
+ with gr.Blocks() as demo:
41
+ gr.Markdown("""
42
+ <h1 style='text-align: center; color: #014377;'>πŸ”Š Soundwave Demo</h1>
43
+ <p style='text-align: center;'>Upload an audio file and provide an instruction for the AI to process.</p>
44
+ """)
45
+
46
+ with gr.Row():
47
+ with gr.Column(scale=1):
48
+ audio_input = gr.Audio(label="🎀 Upload Audio", type="filepath", value="./show_case/p225_002.wav")
49
+ with gr.Column(scale=1):
50
+ text_input = gr.Textbox(label="πŸ“ Enter text instruction", value="What does the person say?", lines=2)
51
+
52
+ with gr.Row():
53
+ submit_button = gr.Button("πŸš€ Process Audio", size="lg")
54
+
55
+ with gr.Row():
56
+ output_text = gr.Textbox(label="πŸ“œ Model output", lines=5, interactive=False)
57
+
58
+ def handle_submit(text, audio):
59
+ return process_audio_text(text, audio)
60
+
61
+ submit_button.click(fn=handle_submit, inputs=[text_input, audio_input], outputs=output_text)
62
+
63
+ gr.Examples(examples, inputs=[text_input, audio_input])
64
+
65
+ if __name__ == "__main__":
66
+ demo.launch()
show_case/Fake.wav ADDED
Binary file (54.2 kB). View file
 
show_case/Real.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d41d182a1a02e1402b4dba23978e769713982dcf39272aa0fd7451d21e312576
3
+ size 214622
show_case/SAR_common_voice_en_18730791.mp3 ADDED
Binary file (37.3 kB). View file
 
show_case/SER(emotion)_example.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76fd43b0094b18a3d40230db994f78207877560c6b9b0ba2be694ede48712363
3
+ size 884838
show_case/SFT_Fisher_example.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7afbc99ba77cb7d277154371c6a01a54a32a51b61097d5c0b7035061e4cf99ec
3
+ size 583486
show_case/SGR_018.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5fb25b832535382dd0bf82cfe337b3d1912ac0d0787b929ad1889b224246e16b
3
+ size 1708140
show_case/SG_audio_1.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:427ca8e0dc94adf602a01a3513b44efba8136c0b5536e34b50f006bcd4d1d366
3
+ size 197004
show_case/SLR_example.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6cb4b93187df340391151f6c9bf61401a866d1852d10ded839ce596a84a56a7a
3
+ size 204204
show_case/SNV_example.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:711a44be98d1fed2b38e9b6d2b4c8993a552f3c1f2cec9776f949bbb6aa8de9b
3
+ size 259888
show_case/SVD_14154_file31512.mp3.wav_16k.wav_norm.wav_mono.wav_silence.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e14fc85c11123eee823a4182113331fd307e41012ca441825b0a6165c6b2105d
3
+ size 108390
show_case/Scene_example.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dee9f053c78f21958507e073cf7e7f184d37897407125dc69a71a940c6f99587
3
+ size 882044
show_case/Sound_Vocal_example.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b83a3db215922aaa6a3ee9adb30fb04f0261b1956e16c4c1a3888809009a2b31
3
+ size 112002
show_case/audio-1434542201-headset.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42a946bbc783ab9411c52e9f2680919559fe31999ac9d3d65e24edadfd3a6d06
3
+ size 102444
show_case/common_voice_en_19664034.mp3 ADDED
Binary file (53.2 kB). View file
 
show_case/p225_002.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6716057eb1872e793eb5efe60f077cf19533fea634eea835f71a9ef180b0c2e2
3
+ size 378156