Update README.md
Browse files
README.md
CHANGED
@@ -4,4 +4,64 @@ pipeline_tag: voice-activity-detection
|
|
4 |
tags:
|
5 |
- FunASR
|
6 |
- FSMN-VAD
|
7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
tags:
|
5 |
- FunASR
|
6 |
- FSMN-VAD
|
7 |
+
---
|
8 |
+
|
9 |
+
## Introduce
|
10 |
+
|
11 |
+
|
12 |
+
Voice activity detection (VAD) plays a important role in speech recognition systems by detecting the beginning and end of effective speech. FunASR provides an efficient VAD model based on the [FSMN structure](https://arxiv.org/abs/1803.05030). To improve model discrimination, we use monophones as modeling units, given the relatively rich speech information. During inference, the VAD system requires post-processing for improved robustness, including operations such as threshold settings and sliding windows.
|
13 |
+
|
14 |
+
This repository demonstrates how to leverage FSMN-VAD in conjunction with the funasr_onnx runtime. The underlying model is derived from [FunASR](https://github.com/alibaba-damo-academy/FunASR), which was trained on a massive 60,000-hour Mandarin dataset. Notably, Paraformer's performance secured the top spot on the [SpeechIO leaderboard](https://github.com/SpeechColab/Leaderboard), highlighting its exceptional capabilities in speech recognition.
|
15 |
+
|
16 |
+
We have relesed numerous industrial-grade models, including speech recognition, voice activity detection, punctuation restoration, speaker verification, speaker diarization, and timestamp prediction (force alignment). To learn more about these models, kindly refer to the [documentation](https://alibaba-damo-academy.github.io/FunASR/en/index.html) available on FunASR. If you are interested in leveraging advanced AI technology for your speech-related projects, we invite you to explore the possibilities offered by [FunASR](https://github.com/alibaba-damo-academy/FunASR).
|
17 |
+
|
18 |
+
## Install funasr_onnx
|
19 |
+
|
20 |
+
```shell
|
21 |
+
pip install -U funasr_onnx
|
22 |
+
# For the users in China, you could install with the command:
|
23 |
+
# pip install -U funasr_onnx -i https://mirror.sjtu.edu.cn/pypi/web/simple
|
24 |
+
```
|
25 |
+
|
26 |
+
## Download the model
|
27 |
+
|
28 |
+
```shell
|
29 |
+
git clone https://huggingface.co/funasr/paraformer-large
|
30 |
+
```
|
31 |
+
|
32 |
+
## Inference with runtime
|
33 |
+
|
34 |
+
### Voice Activity Detection
|
35 |
+
#### FSMN-VAD
|
36 |
+
```python
|
37 |
+
from funasr_onnx import Fsmn_vad
|
38 |
+
|
39 |
+
model_dir = "./FSMN-VAD"
|
40 |
+
model = Fsmn_vad(model_dir, quantize=True)
|
41 |
+
|
42 |
+
wav_path = "./FSMN-VAD/asr_example.wav"
|
43 |
+
|
44 |
+
result = model(wav_path)
|
45 |
+
print(result)
|
46 |
+
```
|
47 |
+
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
48 |
+
- `batch_size`: `1` (Default), the batch size duration inference
|
49 |
+
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
|
50 |
+
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
|
51 |
+
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU
|
52 |
+
|
53 |
+
Input: wav formt file, support formats: `str, np.ndarray, List[str]`
|
54 |
+
|
55 |
+
Output: `List[str]`: recognition result
|
56 |
+
|
57 |
+
|
58 |
+
## Citations
|
59 |
+
|
60 |
+
``` bibtex
|
61 |
+
@inproceedings{gao2022paraformer,
|
62 |
+
title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
|
63 |
+
author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie},
|
64 |
+
booktitle={INTERSPEECH},
|
65 |
+
year={2022}
|
66 |
+
}
|
67 |
+
```
|