FireRedTeam commited on
Commit
7fee934
verified
1 Parent(s): 85fed29

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -2
README.md CHANGED
@@ -2,8 +2,148 @@
2
  license: apache-2.0
3
  ---
4
 
5
- FireRedASR is a family of large-scale automatic speech recognition (ASR) models supporting Mandarin, Chinese dialects and English, while also offering singing lyrics recognition capability, achieving a new state-of-the-art on public Mandarin ASR benchmarks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  FireRedASR is designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:
8
  - FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
9
- - FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
4
 
5
+ <div align="center">
6
+ <h1>FireRedASR: Open-Source Industrial-Grade
7
+ <br>
8
+ Automatic Speech Recognition Models</h1>
9
+
10
+ Kai-Tuo Xu 路 Feng-Long Xie 路 Tang Xu 路 Yao Hu
11
+
12
+ </div>
13
+
14
+ [[Code]](https://github.com/FireRedTeam/FireRedASR)
15
+ [[Paper]](https://arxiv.org/pdf/2501.14350)
16
+ [[Model]](https://huggingface.co/fireredteam)
17
+ [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
18
+
19
+ FireRedASR is a family of open-source industrial-grade automatic speech recognition (ASR) models supporting Mandarin, Chinese dialects and English, achieving a new state-of-the-art (SOTA) on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics recognition capability.
20
+
21
+
22
+ ## 馃敟 News
23
+ - [2025/01/24] We release [techincal report](https://arxiv.org/pdf/2501.14350), [blog](https://fireredteam.github.io/demos/firered_asr/), and [FireRedASR-AED-L](https://huggingface.co/fireredteam/FireRedASR-AED-L/tree/main) model weights.
24
+ - [WIP] We plan to release FireRedASR-LLM-L and other model sizes after the Spring Festival.
25
+
26
+
27
+ ## Method
28
 
29
  FireRedASR is designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:
30
  - FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
31
+ - FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
32
+
33
+
34
+
35
+ ## Evaluation
36
+ Results are reported in Character Error Rate (CER%) for Chinese and Word Error Rate (WER%) for English.
37
+
38
+ ### Evaluation on Public Mandarin ASR Benchmarks
39
+ | Model | #Params | aishell1 | aishell2 | ws\_net | ws\_meeting | Average-4 |
40
+ |:----------------:|:-------:|:--------:|:--------:|:--------:|:-----------:|:---------:|
41
+ | FireRedASR-LLM | 8.3B | 0.76 | 2.15 | 4.60 | 4.67 | 3.05 |
42
+ | FireRedASR-AED | 1.1B | 0.55 | 2.52 | 4.88 | 4.76 | 3.18 |
43
+ | Seed-ASR | 12B+ | 0.68 | 2.27 | 4.66 | 5.69 | 3.33 |
44
+ | Qwen-Audio | 8.4B | 1.30 | 3.10 | 9.50 | 10.87 | 6.19 |
45
+ | SenseVoice-L | 1.6B | 2.09 | 3.04 | 6.01 | 6.73 | 4.47 |
46
+ | Whisper-Large-v3 | 1.6B | 5.14 | 4.96 | 10.48 | 18.87 | 9.86 |
47
+ | Paraformer-Large | 0.2B | 1.68 | 2.85 | 6.74 | 6.97 | 4.56 |
48
+
49
+ `ws` means WenetSpeech.
50
+
51
+ ### Evaluation on Public Chinese Dialect and English ASR Benchmarks
52
+ |Test Set | KeSpeech | LibriSpeech test-clean | LibriSpeech test-other |
53
+ | :------------:| :------: | :--------------------: | :----------------------:|
54
+ |FireRedASR-LLM | 3.56 | 1.73 | 3.67 |
55
+ |FireRedASR-AED | 4.48 | 1.93 | 4.44 |
56
+ |Previous SOTA Results | 6.70 | 1.82 | 3.50 |
57
+
58
+
59
+ ## Usage
60
+ Download model files from [huggingface](https://huggingface.co/fireredteam) and place them in the folder `pretrained_models`.
61
+
62
+
63
+ ### Setup
64
+ Create a Python environment and install dependencies
65
+ ```bash
66
+ $ git clone https://github.com/FireRedTeam/FireRedASR.git
67
+ $ conda create --name fireredasr python=3.10
68
+ $ pip install -r requirements.txt
69
+ ```
70
+
71
+ Set up Linux PATH and PYTHONPATH
72
+ ```
73
+ $ export PATH=$PWD/fireredasr/:$PWD/fireredasr/utils/:$PATH
74
+ $ export PYTHONPATH=$PWD/:$PYTHONPATH
75
+ ```
76
+
77
+ Convert audio to 16kHz 16-bit PCM format
78
+ ```
79
+ ffmpeg -i input_audio -ar 16000 -ac 1 -acodec pcm_s16le -f wav output.wav
80
+ ```
81
+
82
+ ### Quick Start
83
+ ```bash
84
+ $ cd examples/
85
+ $ bash inference_fireredasr_aed.sh
86
+ $ bash inference_fireredasr_llm.sh
87
+ ```
88
+
89
+ ### Command-line Usage
90
+ ```bash
91
+ $ speech2text.py --help
92
+ $ speech2text.py --wav_path examples/wav/BAC009S0764W0121.wav --asr_type "aed" --model_dir pretrained_models/FireRedASR-AED-L
93
+ $ speech2text.py --wav_path examples/wav/BAC009S0764W0121.wav --asr_type "llm" --model_dir pretrained_models/FireRedASR-LLM-L
94
+ ```
95
+
96
+ ### Python Usage
97
+ ```python
98
+ from fireredasr.models.fireredasr import FireRedAsr
99
+
100
+ batch_uttid = ["BAC009S0764W0121"]
101
+ batch_wav_path = ["examples/wav/BAC009S0764W0121.wav"]
102
+
103
+ # FireRedASR-AED
104
+ model = FireRedAsr.from_pretrained("aed", "pretrained_models/FireRedASR-AED-L")
105
+ results = model.transcribe(
106
+ batch_uttid,
107
+ batch_wav_path,
108
+ {
109
+ "use_gpu": 1,
110
+ "beam_size": 3,
111
+ "nbest": 1,
112
+ "decode_max_len": 0,
113
+ "softmax_smoothing": 1.0,
114
+ "aed_length_penalty": 0.0,
115
+ "eos_penalty": 1.0
116
+ }
117
+ )
118
+ print(results)
119
+
120
+
121
+ # FireRedASR-LLM
122
+ model = FireRedAsr.from_pretrained("llm", "pretrained_models/FireRedASR-LLM-L")
123
+ results = model.transcribe(
124
+ batch_uttid,
125
+ batch_wav_path,
126
+ {
127
+ "use_gpu": 1,
128
+ "beam_size": 3,
129
+ "decode_max_len": 0,
130
+ "decode_min_len": 0,
131
+ "repetition_penalty": 1.0,
132
+ "llm_length_penalty": 0.0,
133
+ "temperature": 1.0
134
+ }
135
+ )
136
+ print(results)
137
+ ```
138
+
139
+ ### Input Length Limitations
140
+ - FireRedASR-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
141
+ - FireRedASR-LLM supports audio input up to 30s. The behavior for longer input is currently unknown.
142
+
143
+
144
+ ## Acknowledgements
145
+ Thanks to the following open-source works:
146
+ - [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
147
+ - [icefall/ASR_LLM](https://github.com/k2-fsa/icefall/tree/master/egs/speech_llm/ASR_LLM)
148
+ - [WeNet](https://github.com/wenet-e2e/wenet)
149
+ - [Speech-Transformer](https://github.com/kaituoxu/Speech-Transformer)