FireRedTeam
/

FireRedASR-LLM-L

@@ -7,11 +7,10 @@ license: apache-2.0
 <br>
 Automatic Speech Recognition Models</h1>
-Kai-Tuo Xu · Feng-Long Xie · Xu Tang · Yao Hu
 </div>
-[[Code]](https://github.com/FireRedTeam/FireRedASR)
 [[Paper]](https://arxiv.org/pdf/2501.14350)
 [[Model]](https://huggingface.co/fireredteam)
 [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
@@ -30,6 +29,7 @@ FireRedASR is designed to meet diverse requirements in superior performance and
 - FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
 - FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
 ## Evaluation
@@ -59,6 +59,8 @@ Results are reported in Character Error Rate (CER%) for Chinese and Word Error R
 ## Usage
 Download model files from [huggingface](https://huggingface.co/fireredteam) and place them in the folder `pretrained_models`.
 ### Setup
 Create a Python environment and install dependencies
@@ -81,7 +83,7 @@ ffmpeg -i input_audio -ar 16000 -ac 1 -acodec pcm_s16le -f wav output.wav
 ### Quick Start
 ```bash
-$ cd examples/
 $ bash inference_fireredasr_aed.sh
 $ bash inference_fireredasr_llm.sh
 ```
@@ -110,8 +112,8 @@ results = model.transcribe(
         "beam_size": 3,
         "nbest": 1,
         "decode_max_len": 0,
-        "softmax_smoothing": 1.0,
-        "aed_length_penalty": 0.0,
         "eos_penalty": 1.0
     }
 )
@@ -128,14 +130,18 @@ results = model.transcribe(
         "beam_size": 3,
         "decode_max_len": 0,
         "decode_min_len": 0,
-        "repetition_penalty": 1.0,
-        "llm_length_penalty": 0.0,
         "temperature": 1.0
     }
 )
 print(results)
 ```
 ### Input Length Limitations
 - FireRedASR-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
 - FireRedASR-LLM supports audio input up to 30s. The behavior for longer input is currently unknown.
@@ -146,4 +152,4 @@ Thanks to the following open-source works:
 - [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
 - [icefall/ASR_LLM](https://github.com/k2-fsa/icefall/tree/master/egs/speech_llm/ASR_LLM)
 - [WeNet](https://github.com/wenet-e2e/wenet)
-- [Speech-Transformer](https://github.com/kaituoxu/Speech-Transformer)

 <br>
 Automatic Speech Recognition Models</h1>
+[Kai-Tuo Xu](https://github.com/kaituoxu) · [Feng-Long Xie](https://scholar.google.com/citations?user=bi8ExI4AAAAJ&hl=zh-CN&oi=sra) · [Xu Tang](https://scholar.google.com/citations?user=grP24aAAAAAJ&hl=zh-CN&oi=sra) · [Yao Hu](https://scholar.google.com/citations?user=LIu7k7wAAAAJ&hl=zh-CN)
 </div>
 [[Paper]](https://arxiv.org/pdf/2501.14350)
 [[Model]](https://huggingface.co/fireredteam)
 [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
 - FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
 - FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
+![Model](/assets/FireRedASR_model.png)
 ## Evaluation
 ## Usage
 Download model files from [huggingface](https://huggingface.co/fireredteam) and place them in the folder `pretrained_models`.
+If you want to use `FireRedASR-LLM-L`, you also need to download [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and place it in the folder `pretrained_models`. Then, go to folder `FireRedASR-LLM-L` and run `$ ln -s ../Qwen2-7B-Instruct`
 ### Setup
 Create a Python environment and install dependencies
 ### Quick Start
 ```bash
+$ cd examples
 $ bash inference_fireredasr_aed.sh
 $ bash inference_fireredasr_llm.sh
 ```
         "beam_size": 3,
         "nbest": 1,
         "decode_max_len": 0,
+        "softmax_smoothing": 1.25,
+        "aed_length_penalty": 0.6,
         "eos_penalty": 1.0
     }
 )
         "beam_size": 3,
         "decode_max_len": 0,
         "decode_min_len": 0,
+        "repetition_penalty": 3.0,
+        "llm_length_penalty": 1.0,
         "temperature": 1.0
     }
 )
 print(results)
 ```
+## Usage Tips
+### Batch Beam Search
+- When performing batch beam search with FireRedASR-LLM, please ensure that the input lengths of the utterances are similar. If there are significant differences in utterance lengths, shorter utterances may experience repetition issues. You can either sort your dataset by length or set `batch_size` to 1 to avoid the repetition issue.
 ### Input Length Limitations
 - FireRedASR-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
 - FireRedASR-LLM supports audio input up to 30s. The behavior for longer input is currently unknown.
 - [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
 - [icefall/ASR_LLM](https://github.com/k2-fsa/icefall/tree/master/egs/speech_llm/ASR_LLM)
 - [WeNet](https://github.com/wenet-e2e/wenet)
+- [Speech-Transformer](https://github.com/kaituoxu/Speech-Transformer)