NeMo
NeMo
speech
audio
anteju commited on
Commit
8534c76
·
verified ·
1 Parent(s): 06677eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -3
README.md CHANGED
@@ -1,3 +1,158 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ library_name: NeMo
4
+ tags:
5
+ - NeMo
6
+ - speech
7
+ - audio
8
+ ---
9
+ # SE Dereverberation SB 16kHz Small
10
+
11
+ ## Model Overview
12
+
13
+ ### Description
14
+
15
+ The model extracts speech for human or machine listeners. This is a generative speech dereverberation model based on the Schrödinger bridge. The model is trained on a publicly available research dataset.
16
+
17
+ This model is for research and development only.
18
+
19
+ ### License/Terms of Use
20
+ License to use this model is covered by the [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0) license.
21
+
22
+ ## References
23
+
24
+ [1] [Schrödinger Bridge for Generative Speech Enhancement](https://arxiv.org/abs/2407.16074), Interspeech, 2024.
25
+
26
+ ## Model Architecture
27
+ **Architecture Type:** Schrödinger Bridge<br>
28
+ **Network Architecture:** U-Net with convolutional layers<br>
29
+
30
+ ## Input
31
+ **Input Type(s):** Audio <br>
32
+ **Input Format(s):** .wav files <br>
33
+ **Input Parameters:** One-Dimensional (1D) <br>
34
+ **Other Properties Related to Input:** 16000 Hz Mono-channel Audio <br>
35
+
36
+ ## Output
37
+ **Output Type(s):** Audio <br>
38
+ **Output Format:** .wav files <br>
39
+ **Output Parameters:** One-Dimensional (1D) <br>
40
+ **Other Properties Related to Output:** 16000 Hz Mono-channel Audio <br>
41
+
42
+ ## Software Integration
43
+ **Runtime Engine(s):**<br>
44
+ * NeMo-2.0.0 <br>
45
+
46
+ **Supported Hardware Microarchitecture Compatibility:** <br>
47
+ * NVIDIA Ampere<br>
48
+ * NVIDIA Blackwell<br>
49
+ * NVIDIA Jetson<br>
50
+ * NVIDIA Hopper<br>
51
+ * NVIDIA Lovelace<br>
52
+ * NVIDIA Turing<br>
53
+ * NVIDIA Volta<br>
54
+
55
+ **Preferred Operating System(s)** <br>
56
+ * Linux<br>
57
+ * Windows<br>
58
+
59
+ ## Model Version(s)
60
+ `se_der_sb_16k_small_v1.0`<br>
61
+
62
+ # Training, Testing, and Evaluation Datasets
63
+
64
+ ## Training Dataset
65
+ **Link:**
66
+ [WSJ0](https://catalog.ldc.upenn.edu/LDC93S6A)
67
+
68
+ **Data Collection Method by dataset:** Human <br>
69
+
70
+ **Labeling Method by dataset:** Human<br>
71
+
72
+ **Properties (Quantity, Dataset Descriptions, Sensor(s)):**
73
+ WSJ0 was used for clean speech signals. The observed signals are simulated with room impulse responses with reverberation times between 0.4 seconds and 1.0 seconds, and without any background noise. The total size of the training dataset was approximately 25 hours.<br>
74
+
75
+ ## Testing Dataset
76
+ **Link:**
77
+ [WSJ0](https://catalog.ldc.upenn.edu/LDC93S6A)
78
+
79
+ **Data Collection Method by dataset:** Human <br>
80
+
81
+ **Labeling Method by dataset:** Human<br>
82
+
83
+ **Properties (Quantity, Dataset Descriptions, Sensor(s)):**
84
+ WSJ0 was used for clean speech signals. The observed signals are simulated with room impulse responses with reverberation times between 0.4 seconds and 1.0 seconds, and without any background noise. The total size of the training dataset was approximately 2 hours.<br>
85
+
86
+ ## Evaluation Dataset
87
+ **Link:**
88
+ [WSJ0](https://catalog.ldc.upenn.edu/LDC93S6A)
89
+
90
+ **Data Collection Method by dataset:** Human <br>
91
+
92
+ **Labeling Method by dataset:** Human<br>
93
+
94
+ **Properties (Quantity, Dataset Descriptions, Sensor(s)):**
95
+ WSJ0 was used for clean speech signals. The observed signals are simulated with room impulse responses with reverberation times between 0.4 seconds and 1.0 seconds, and without any background noise. The total size of the training dataset was approximately 2 hours.<br>
96
+
97
+ ## Inference
98
+ **Engine:** NeMo 2.0 <br>
99
+
100
+ **Test Hardware:** NVIDIA v100<br>
101
+
102
+ # Performance
103
+
104
+ The model is trained on the training subset of the WSJ0-Reverb dataset using the auxiliary L1-norm loss [1].
105
+
106
+ The model is evaluated using several instrumental metrics: perceptual evaluation of speech quality (PESQ), extended short-term objective intelligibility (ESTOI) and scale-invariant signal-to-distortion ratio (SI-SDR). Word error rate (WER) is evaluated using the [FastConformer-Transducer-Large English ASR model](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_transducer_large).
107
+
108
+ Metrics are reported on the test set of WSJ0-Reverb dataset using either SDE or ODE sampler.
109
+
110
+ | Signal |PESQ | ESTOI | SI-SDR/dB | WER / % |
111
+ |:-------------:|:----:|:-----:|:---------:|:-------:|
112
+ | Input | 1.29 | 0.44 | -9.5 | 8.29 |
113
+ | Processed SDE | 2.79 | 0.89 | 7.4 | 4.27 |
114
+ | Processed ODE | 2.59 | 0.86 | 6.2 | 5.79 |
115
+
116
+ # How to use this model
117
+
118
+ The model is available for use in the NVIDIA NeMo toolkit, and can be used to process audio or for fine-tuning.
119
+
120
+ ## Load the model
121
+ ```
122
+ from nemo.collections.audio.models import AudioToAudioModel
123
+ model = AudioToAudioModel.from_pretrained('nvidia/se_der_sb_16k_small')
124
+ ```
125
+
126
+ ## Process audio
127
+ A single audio file can be processed as follows
128
+
129
+ ```
130
+ import librosa
131
+ audio_in, _ = librosa.load(path_to_input_audio, sr=model.sample_rate)
132
+ audio_in_signal = torch.from_numpy(audio_in).view(1, 1, -1).to(device)
133
+ audio_in_length = torch.tensor([audio_in_signal.size(-1)]).to(device)
134
+
135
+ audio_out_signal, _ = model(input_signal=audio_in_signal, input_length=audio_in_length)
136
+ ```
137
+
138
+ For processing several audio files at once, check the [process_audio script](https://github.com/NVIDIA/NeMo/blob/main/examples/audio/process_audio.py) in NeMo.
139
+
140
+ ## Listen to audio
141
+ ```
142
+ import soundfile as sf
143
+ audio_out = audio_out_signal.cpu().numpy().squeeze()
144
+ sf.write(path_to_output_audio, audio_out, samplerate=model.sample_rate)
145
+ ```
146
+
147
+ ## Change sampler configuration
148
+ ```
149
+ model.sampler.process = 'ode' # default sampler is 'sde'
150
+ model.sampler.num_steps = 10 # default is 50 steps
151
+
152
+ audio_out_signal, _ = model(input_signal=audio_in_signal, input_length=audio_in_length)
153
+ ```
154
+
155
+ # Ethical Considerations
156
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
157
+
158
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).