Update README.md
Browse files
README.md
CHANGED
@@ -2,8 +2,148 @@
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
FireRedASR is designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:
|
8 |
- FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
|
9 |
-
- FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
+
<div align="center">
|
6 |
+
<h1>FireRedASR: Open-Source Industrial-Grade
|
7 |
+
<br>
|
8 |
+
Automatic Speech Recognition Models</h1>
|
9 |
+
|
10 |
+
Kai-Tuo Xu 路 Feng-Long Xie 路 Tang Xu 路 Yao Hu
|
11 |
+
|
12 |
+
</div>
|
13 |
+
|
14 |
+
[[Code]](https://github.com/FireRedTeam/FireRedASR)
|
15 |
+
[[Paper]](https://arxiv.org/pdf/2501.14350)
|
16 |
+
[[Model]](https://huggingface.co/fireredteam)
|
17 |
+
[[Blog]](https://fireredteam.github.io/demos/firered_asr/)
|
18 |
+
|
19 |
+
FireRedASR is a family of open-source industrial-grade automatic speech recognition (ASR) models supporting Mandarin, Chinese dialects and English, achieving a new state-of-the-art (SOTA) on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics recognition capability.
|
20 |
+
|
21 |
+
|
22 |
+
## 馃敟 News
|
23 |
+
- [2025/01/24] We release [techincal report](https://arxiv.org/pdf/2501.14350), [blog](https://fireredteam.github.io/demos/firered_asr/), and [FireRedASR-AED-L](https://huggingface.co/fireredteam/FireRedASR-AED-L/tree/main) model weights.
|
24 |
+
- [WIP] We plan to release FireRedASR-LLM-L and other model sizes after the Spring Festival.
|
25 |
+
|
26 |
+
|
27 |
+
## Method
|
28 |
|
29 |
FireRedASR is designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:
|
30 |
- FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
|
31 |
+
- FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
|
32 |
+
|
33 |
+
|
34 |
+
|
35 |
+
## Evaluation
|
36 |
+
Results are reported in Character Error Rate (CER%) for Chinese and Word Error Rate (WER%) for English.
|
37 |
+
|
38 |
+
### Evaluation on Public Mandarin ASR Benchmarks
|
39 |
+
| Model | #Params | aishell1 | aishell2 | ws\_net | ws\_meeting | Average-4 |
|
40 |
+
|:----------------:|:-------:|:--------:|:--------:|:--------:|:-----------:|:---------:|
|
41 |
+
| FireRedASR-LLM | 8.3B | 0.76 | 2.15 | 4.60 | 4.67 | 3.05 |
|
42 |
+
| FireRedASR-AED | 1.1B | 0.55 | 2.52 | 4.88 | 4.76 | 3.18 |
|
43 |
+
| Seed-ASR | 12B+ | 0.68 | 2.27 | 4.66 | 5.69 | 3.33 |
|
44 |
+
| Qwen-Audio | 8.4B | 1.30 | 3.10 | 9.50 | 10.87 | 6.19 |
|
45 |
+
| SenseVoice-L | 1.6B | 2.09 | 3.04 | 6.01 | 6.73 | 4.47 |
|
46 |
+
| Whisper-Large-v3 | 1.6B | 5.14 | 4.96 | 10.48 | 18.87 | 9.86 |
|
47 |
+
| Paraformer-Large | 0.2B | 1.68 | 2.85 | 6.74 | 6.97 | 4.56 |
|
48 |
+
|
49 |
+
`ws` means WenetSpeech.
|
50 |
+
|
51 |
+
### Evaluation on Public Chinese Dialect and English ASR Benchmarks
|
52 |
+
|Test Set | KeSpeech | LibriSpeech test-clean | LibriSpeech test-other |
|
53 |
+
| :------------:| :------: | :--------------------: | :----------------------:|
|
54 |
+
|FireRedASR-LLM | 3.56 | 1.73 | 3.67 |
|
55 |
+
|FireRedASR-AED | 4.48 | 1.93 | 4.44 |
|
56 |
+
|Previous SOTA Results | 6.70 | 1.82 | 3.50 |
|
57 |
+
|
58 |
+
|
59 |
+
## Usage
|
60 |
+
Download model files from [huggingface](https://huggingface.co/fireredteam) and place them in the folder `pretrained_models`.
|
61 |
+
|
62 |
+
|
63 |
+
### Setup
|
64 |
+
Create a Python environment and install dependencies
|
65 |
+
```bash
|
66 |
+
$ git clone https://github.com/FireRedTeam/FireRedASR.git
|
67 |
+
$ conda create --name fireredasr python=3.10
|
68 |
+
$ pip install -r requirements.txt
|
69 |
+
```
|
70 |
+
|
71 |
+
Set up Linux PATH and PYTHONPATH
|
72 |
+
```
|
73 |
+
$ export PATH=$PWD/fireredasr/:$PWD/fireredasr/utils/:$PATH
|
74 |
+
$ export PYTHONPATH=$PWD/:$PYTHONPATH
|
75 |
+
```
|
76 |
+
|
77 |
+
Convert audio to 16kHz 16-bit PCM format
|
78 |
+
```
|
79 |
+
ffmpeg -i input_audio -ar 16000 -ac 1 -acodec pcm_s16le -f wav output.wav
|
80 |
+
```
|
81 |
+
|
82 |
+
### Quick Start
|
83 |
+
```bash
|
84 |
+
$ cd examples/
|
85 |
+
$ bash inference_fireredasr_aed.sh
|
86 |
+
$ bash inference_fireredasr_llm.sh
|
87 |
+
```
|
88 |
+
|
89 |
+
### Command-line Usage
|
90 |
+
```bash
|
91 |
+
$ speech2text.py --help
|
92 |
+
$ speech2text.py --wav_path examples/wav/BAC009S0764W0121.wav --asr_type "aed" --model_dir pretrained_models/FireRedASR-AED-L
|
93 |
+
$ speech2text.py --wav_path examples/wav/BAC009S0764W0121.wav --asr_type "llm" --model_dir pretrained_models/FireRedASR-LLM-L
|
94 |
+
```
|
95 |
+
|
96 |
+
### Python Usage
|
97 |
+
```python
|
98 |
+
from fireredasr.models.fireredasr import FireRedAsr
|
99 |
+
|
100 |
+
batch_uttid = ["BAC009S0764W0121"]
|
101 |
+
batch_wav_path = ["examples/wav/BAC009S0764W0121.wav"]
|
102 |
+
|
103 |
+
# FireRedASR-AED
|
104 |
+
model = FireRedAsr.from_pretrained("aed", "pretrained_models/FireRedASR-AED-L")
|
105 |
+
results = model.transcribe(
|
106 |
+
batch_uttid,
|
107 |
+
batch_wav_path,
|
108 |
+
{
|
109 |
+
"use_gpu": 1,
|
110 |
+
"beam_size": 3,
|
111 |
+
"nbest": 1,
|
112 |
+
"decode_max_len": 0,
|
113 |
+
"softmax_smoothing": 1.0,
|
114 |
+
"aed_length_penalty": 0.0,
|
115 |
+
"eos_penalty": 1.0
|
116 |
+
}
|
117 |
+
)
|
118 |
+
print(results)
|
119 |
+
|
120 |
+
|
121 |
+
# FireRedASR-LLM
|
122 |
+
model = FireRedAsr.from_pretrained("llm", "pretrained_models/FireRedASR-LLM-L")
|
123 |
+
results = model.transcribe(
|
124 |
+
batch_uttid,
|
125 |
+
batch_wav_path,
|
126 |
+
{
|
127 |
+
"use_gpu": 1,
|
128 |
+
"beam_size": 3,
|
129 |
+
"decode_max_len": 0,
|
130 |
+
"decode_min_len": 0,
|
131 |
+
"repetition_penalty": 1.0,
|
132 |
+
"llm_length_penalty": 0.0,
|
133 |
+
"temperature": 1.0
|
134 |
+
}
|
135 |
+
)
|
136 |
+
print(results)
|
137 |
+
```
|
138 |
+
|
139 |
+
### Input Length Limitations
|
140 |
+
- FireRedASR-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
|
141 |
+
- FireRedASR-LLM supports audio input up to 30s. The behavior for longer input is currently unknown.
|
142 |
+
|
143 |
+
|
144 |
+
## Acknowledgements
|
145 |
+
Thanks to the following open-source works:
|
146 |
+
- [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
|
147 |
+
- [icefall/ASR_LLM](https://github.com/k2-fsa/icefall/tree/master/egs/speech_llm/ASR_LLM)
|
148 |
+
- [WeNet](https://github.com/wenet-e2e/wenet)
|
149 |
+
- [Speech-Transformer](https://github.com/kaituoxu/Speech-Transformer)
|