FireRedTeam commited on
Commit
6ff2daf
verified
1 Parent(s): 8e215b4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -8
README.md CHANGED
@@ -7,11 +7,10 @@ license: apache-2.0
7
  <br>
8
  Automatic Speech Recognition Models</h1>
9
 
10
- Kai-Tuo Xu 路 Feng-Long Xie 路 Xu Tang 路 Yao Hu
11
 
12
  </div>
13
 
14
- [[Code]](https://github.com/FireRedTeam/FireRedASR)
15
  [[Paper]](https://arxiv.org/pdf/2501.14350)
16
  [[Model]](https://huggingface.co/fireredteam)
17
  [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
@@ -30,6 +29,7 @@ FireRedASR is designed to meet diverse requirements in superior performance and
30
  - FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
31
  - FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
32
 
 
33
 
34
 
35
  ## Evaluation
@@ -59,6 +59,8 @@ Results are reported in Character Error Rate (CER%) for Chinese and Word Error R
59
  ## Usage
60
  Download model files from [huggingface](https://huggingface.co/fireredteam) and place them in the folder `pretrained_models`.
61
 
 
 
62
 
63
  ### Setup
64
  Create a Python environment and install dependencies
@@ -81,7 +83,7 @@ ffmpeg -i input_audio -ar 16000 -ac 1 -acodec pcm_s16le -f wav output.wav
81
 
82
  ### Quick Start
83
  ```bash
84
- $ cd examples/
85
  $ bash inference_fireredasr_aed.sh
86
  $ bash inference_fireredasr_llm.sh
87
  ```
@@ -110,8 +112,8 @@ results = model.transcribe(
110
  "beam_size": 3,
111
  "nbest": 1,
112
  "decode_max_len": 0,
113
- "softmax_smoothing": 1.0,
114
- "aed_length_penalty": 0.0,
115
  "eos_penalty": 1.0
116
  }
117
  )
@@ -128,14 +130,18 @@ results = model.transcribe(
128
  "beam_size": 3,
129
  "decode_max_len": 0,
130
  "decode_min_len": 0,
131
- "repetition_penalty": 1.0,
132
- "llm_length_penalty": 0.0,
133
  "temperature": 1.0
134
  }
135
  )
136
  print(results)
137
  ```
138
 
 
 
 
 
139
  ### Input Length Limitations
140
  - FireRedASR-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
141
  - FireRedASR-LLM supports audio input up to 30s. The behavior for longer input is currently unknown.
@@ -146,4 +152,4 @@ Thanks to the following open-source works:
146
  - [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
147
  - [icefall/ASR_LLM](https://github.com/k2-fsa/icefall/tree/master/egs/speech_llm/ASR_LLM)
148
  - [WeNet](https://github.com/wenet-e2e/wenet)
149
- - [Speech-Transformer](https://github.com/kaituoxu/Speech-Transformer)
 
7
  <br>
8
  Automatic Speech Recognition Models</h1>
9
 
10
+ [Kai-Tuo Xu](https://github.com/kaituoxu)[Feng-Long Xie](https://scholar.google.com/citations?user=bi8ExI4AAAAJ&hl=zh-CN&oi=sra)[Xu Tang](https://scholar.google.com/citations?user=grP24aAAAAAJ&hl=zh-CN&oi=sra)[Yao Hu](https://scholar.google.com/citations?user=LIu7k7wAAAAJ&hl=zh-CN)
11
 
12
  </div>
13
 
 
14
  [[Paper]](https://arxiv.org/pdf/2501.14350)
15
  [[Model]](https://huggingface.co/fireredteam)
16
  [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
 
29
  - FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
30
  - FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
31
 
32
+ ![Model](/assets/FireRedASR_model.png)
33
 
34
 
35
  ## Evaluation
 
59
  ## Usage
60
  Download model files from [huggingface](https://huggingface.co/fireredteam) and place them in the folder `pretrained_models`.
61
 
62
+ If you want to use `FireRedASR-LLM-L`, you also need to download [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and place it in the folder `pretrained_models`. Then, go to folder `FireRedASR-LLM-L` and run `$ ln -s ../Qwen2-7B-Instruct`
63
+
64
 
65
  ### Setup
66
  Create a Python environment and install dependencies
 
83
 
84
  ### Quick Start
85
  ```bash
86
+ $ cd examples
87
  $ bash inference_fireredasr_aed.sh
88
  $ bash inference_fireredasr_llm.sh
89
  ```
 
112
  "beam_size": 3,
113
  "nbest": 1,
114
  "decode_max_len": 0,
115
+ "softmax_smoothing": 1.25,
116
+ "aed_length_penalty": 0.6,
117
  "eos_penalty": 1.0
118
  }
119
  )
 
130
  "beam_size": 3,
131
  "decode_max_len": 0,
132
  "decode_min_len": 0,
133
+ "repetition_penalty": 3.0,
134
+ "llm_length_penalty": 1.0,
135
  "temperature": 1.0
136
  }
137
  )
138
  print(results)
139
  ```
140
 
141
+ ## Usage Tips
142
+ ### Batch Beam Search
143
+ - When performing batch beam search with FireRedASR-LLM, please ensure that the input lengths of the utterances are similar. If there are significant differences in utterance lengths, shorter utterances may experience repetition issues. You can either sort your dataset by length or set `batch_size` to 1 to avoid the repetition issue.
144
+
145
  ### Input Length Limitations
146
  - FireRedASR-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
147
  - FireRedASR-LLM supports audio input up to 30s. The behavior for longer input is currently unknown.
 
152
  - [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
153
  - [icefall/ASR_LLM](https://github.com/k2-fsa/icefall/tree/master/egs/speech_llm/ASR_LLM)
154
  - [WeNet](https://github.com/wenet-e2e/wenet)
155
+ - [Speech-Transformer](https://github.com/kaituoxu/Speech-Transformer)