Transformers
Inference Endpoints
aaqibsaeed commited on
Commit
517f24b
·
1 Parent(s): 4143153

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md CHANGED
@@ -1,3 +1,139 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # [Plug-and-Play Multilingual Few-shot Spoken Words Recognition](https://arxiv.org/pdf/2305.03058.pdf)
6
+
7
+ ## Abstract
8
+ As technology advances and digital devices become prevalent, seamless human-machine communication is increasingly gaining significance. The growing adoption of mobile, wearable, and other Internet of Things (IoT) devices has changed how we interact with these smart devices, making accurate spoken words recognition a crucial component for effective interaction. However, building robust spoken words detection system that can handle novel keywords remains challenging, especially for low-resource languages with limited training data. Here, we propose PLiX, a multilingual and plug-and-play keyword spotting system that leverages few-shot learning to harness massive real-world data and enable the recognition of unseen spoken words at test-time. Our few-shot deep models are learned with millions of one-second audio clips across 20 languages, achieving state-of-the-art performance while being highly efficient. Extensive evaluations show that PLiX can generalize to novel spoken words given as few as just one support example and performs well on unseen languages out of the box. We release models and inference code to serve as a foundation for future research and voice-enabled user interface development for emerging devices.
9
+
10
+ ## Key Contributions
11
+ * We develop PLiX, a general-purpose, multilingual, and plug-and-play, few-shot keyword spotting system trained and evaluated with more than 12 million one-second audio clips sampled at 16kHz.
12
+ * Leverage state-of-the-art neural architectures to learn few-shot models that are high performant while being efficient with fewer learnable parameters.
13
+ * A wide-ranging set of evaluations to systematically quantify the efficacy of our system across 20 languages and thousands of classes (i.e., words or terms); showcasing generalization to unseen words at test-time given as few as one support example per class.
14
+ * We demonstrate that our model generalizes exceptionally well in a one-shot setting on 5 unseen languages. Further, in a cross-task transfer evaluation on a challenging FLEURS benchmark, our model performs well for language identification without any retraining.
15
+ * To serve as a building block for future research on spoken word detection with meta-learning and enable product development, we release model weights and inference code as a Python package.
16
+
17
+ ## Quick Start
18
+ We provide the library for our PLiX model:
19
+ ```bash
20
+ pip install plixkws
21
+ ```
22
+
23
+ Then you can follow the below usage or refer to [test_model.py](https://github.com/FewshotML/plix/blob/main/test_model.py).
24
+
25
+ ```python
26
+ import torch
27
+ from plixkws import model, util
28
+
29
+ support_examples = ["./test_clips/aandachtig.wav", "./test_clips/stroom.wav",
30
+ "./test_clips/persbericht.wav", "./test_clips/klinkers.wav",
31
+ "./test_clips/zinsbouw.wav"]
32
+ classes = ["aandachtig", "stroom", "persbericht", "klinkers", "zinsbouw"]
33
+ int_indices = [0,1,2,3,4]
34
+
35
+ fws_model = model.load(encoder_name="base", language="en", device="cpu")
36
+
37
+ support = {
38
+ "paths": support_examples,
39
+ "classes": classes,
40
+ "labels": torch.tensor(int_indices),
41
+ }
42
+ support["audio"] = torch.stack([util.load_clip(path) for path in support["paths"]])
43
+ support = util.batch_device(support, device="cpu")
44
+
45
+ query = {
46
+ "paths": ["./test_clips/query_klinkers.wav", "./test_clips/query_stroom.wav"]
47
+ }
48
+ query["audio"] = torch.stack([util.load_clip(path) for path in query["paths"]])
49
+ query = util.batch_device(query, device="cpu")
50
+
51
+ with torch.no_grad():
52
+ predictions = fws_model(support, query)
53
+ ```
54
+
55
+ ## Real-time Inference
56
+
57
+ ```python
58
+
59
+ # !pip install pyaudio
60
+ import numpy as np
61
+ import pyaudio
62
+ import torch
63
+ from plixkws import model, util
64
+
65
+ sample_rate = 16000
66
+ frames_per_buffer = 512
67
+ support_examples = ["./test_clips/aandachtig.wav", "./test_clips/stroom.wav",
68
+ "./test_clips/persbericht.wav", "./test_clips/klinkers.wav",
69
+ "./test_clips/zinsbouw.wav"]
70
+ classes = ["aandachtig", "stroom", "persbericht", "klinkers", "zinsbouw"]
71
+ int_indices = [0,1,2,3,4]
72
+
73
+ support = {
74
+ "paths": support_examples,
75
+ "classes": classes,
76
+ "labels": torch.tensor(int_indices)
77
+ }
78
+ support["audio"] = torch.stack([util.load_clip(path) for path in support["paths"]])
79
+ support = util.batch_device(support, device="cpu")
80
+
81
+ fws_model = model.load(encoder_name="small", language="nl", device="cpu")
82
+
83
+ p = pyaudio.PyAudio()
84
+ stream = p.open(format = pyaudio.paInt16, channels=1,
85
+ rate=sample_rate, input=True, frames_per_buffer=frames_per_buffer)
86
+
87
+ frames = []
88
+ while True:
89
+ data = stream.read(frames_per_buffer)
90
+ buffer = np.frombuffer(data, dtype=np.int16)
91
+ frames.append(buffer)
92
+ if len(frames) * frames_per_buffer / sample_rate >= 1:
93
+ audio = np.concatenate(frames)
94
+ audio = audio.astype(float) / np.iinfo(np.int16).max
95
+ query = {"audio":torch.tensor(audio[np.newaxis, np.newaxis,:], dtype=torch.float32)}
96
+ query = util.batch_device(query, device="cpu")
97
+ with torch.no_grad():
98
+ predictions = fws_model(support, query)
99
+ print(classes[predictions.item()])
100
+ frames = []
101
+ ```
102
+
103
+ ## Pretrained Model Weights
104
+ | Language | Encoder Name
105
+ | --- | --- |
106
+ | Multilingual | base_multi
107
+ | Multilingual | small_multi
108
+ | English | base_en
109
+ | Arabic | small_ar
110
+ | Czech | small_cs
111
+ | German | small_de
112
+ | Greek | small_el
113
+ | English | small_en
114
+ | Estonian | small_et
115
+ | Spanish | small_es
116
+ | Persian | small_fa
117
+ | French | small_fr
118
+ | Indonesian | small_id
119
+ | Italian | small_it
120
+ | Kyrgyz | small_ky
121
+ | Dutch | small_nl
122
+ | Polish | small_pl
123
+ | Portuguese | small_pt
124
+ | Russian | small_ru
125
+ | Kinyarwanda | small_rw
126
+ | Swedish | small_sv-SE
127
+ | Turkish | small_tr
128
+ | Tatar | small_tt |
129
+
130
+ ## Citation
131
+ If you find this work useful, please cite our paper:
132
+ ```
133
+ @article{saeed2023plix,
134
+ title={Plug-and-Play Multilingual Few-shot Spoken Words Recognition},
135
+ author={Saeed, Aaqib and Tsouvalas, Vasileios},
136
+ journal={arXiv preprint arXiv:2305.03058},
137
+ year={2023}
138
+ }
139
+ ```