lukecq commited on
Commit
0ae8e70
·
verified ·
1 Parent(s): 579c130

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +216 -5
README.md CHANGED
@@ -14,17 +14,228 @@ tags:
14
  - chat,
15
  - audio
16
  ---
 
 
 
17
 
18
- # SeaLLMs-Audio: Large Audio Language Models for Southeast Asia
19
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- ## Introduction
22
 
23
- ## Quickstart
24
 
25
- ### Voice Chat Inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- ### Audio Analysis Inference
28
 
29
  ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
 
14
  - chat,
15
  - audio
16
  ---
17
+ <p align="center">
18
+ <img src="https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/images/seallm-audio-logo.png" alt="SeaLLMs-Audio" width="20%">
19
+ </p>
20
 
21
+ # SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia
22
 
23
+ <p align="center">
24
+ <a href="https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/" target="_blank" rel="noopener">Website</a>
25
+ &nbsp;&nbsp;
26
+ <a href="https://huggingface.co/spaces/SeaLLMs/SeaLLMs-Audio-Demo" target="_blank" rel="noopener"> 🤗 DEMO</a>
27
+ &nbsp;&nbsp;
28
+ <a href="https://github.com/DAMO-NLP-SG/SeaLLMs-Audio" target="_blank" rel="noopener">Github</a>
29
+ &nbsp;&nbsp;
30
+ <a href="https://huggingface.co/SeaLLMs/SeaLLMs-Audio-7B" target="_blank" rel="noopener">🤗 Model</a>
31
+ &nbsp;&nbsp;
32
+ <!-- <a href="https://arxiv.org/pdf/2407.19672" target="_blank" rel="noopener">[NEW] Technical Report</a> -->
33
+ </p>
34
 
35
+ We introduce **SeaLLMs-Audio**, the multimodal (audio) extension of the [SeaLLMs](https://damo-nlp-sg.github.io/DAMO-SeaLLMs/) (Large Language Models for Southeast Asian languages) family. It is the first large audio-language model (LALM) designed to support multiple Southeast Asian languages, including **Indonesian (id), Thai (th), and Vietnamese (vi), alongside English (en) and Chinese (zh)**.
36
 
37
+ Trained on a large-scale audio dataset, SeaLLMs-Audio demonstrates strong performance across various audio-related tasks, such as audio analysis tasks and voice-based interactions. As a significant step toward advancing audio LLMs in Southeast Asia, we hope SeaLLMs-Audio will benefit both the research community and industry in the region.
38
 
39
+ ### Key Features of SeaLLMs-Audio:
40
+
41
+ - **Multilingual**: The model mainly supports 5 languages, including 🇮🇩 Indonesian, 🇹🇭 Thai, 🇻🇳 Vietnamese, 🇬🇧 English, and 🇨🇳 Chinese.
42
+ - **Multimodal**: The model supports flexible input formats, such as **audio only, text only, and audio with text**.
43
+ - **Multi-task**: The model supports a variety of tasks, including audio analysis tasks such as audio captioning, automatic speech recognition, speech-to-text translation, speech emotion recognition, speech question answering, and speech summarization. Additionally, it handles voice chat tasks, including answering factual, mathematical, and other general questions.
44
+
45
+ We open-weight [SeaLLMs-Audio](https://huggingface.co/SeaLLMs/SeaLLMs-Audio-7B) on Hugging Face, and we have built a [demo](https://huggingface.co/spaces/SeaLLMs/SeaLLMs-Audio-Demo) for users to interact with.
46
+
47
+
48
+ # Training information:
49
+ SeaLLMs-Audio builts upon [Qwen2-Audio-7B](https://huggingface.co/Qwen/Qwen2-Audio-7B) and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct). We replaced the LLM module in Qwen2-Audio-7B by Qwen2.5-7B-Instruct. After that, we do full-parameter fine-tuning on a large-scale audio dataset. This dataset contains 1.58M conversations for multiple tasks, in which 93% are single turn. The tasks can be roughly classified as following task categories: automatic speech recognition (ASR), audio captioning (AC), speech-to-text translation (S2TT), question answering (QA), speech summarization (SS), speech question answering (SQA), chat, math, and fact and mixed tasks (mixed).
50
+
51
+ The distribution of data across languages and tasks are:
52
+
53
+ <p align="center">
54
+ <strong>Distribution of SeaLLMs-Audio training data across languages and tasks</strong>
55
+ </p>
56
+
57
+ <p align="center">
58
+ <img src="https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/data_distribution/dist_lang.png" alt="Distribution of SeaLLMs-Audio training data across languages" width="70%">
59
+ <img src="https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/data_distribution/dist_task.png" alt="Distribution of SeaLLMs-Audio training data across tasks" width="70%">
60
+ </p>
61
+
62
+ The training dataset was curated from multiple data sources, including public datasets and in-house data. Public datasets includes: [gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech), [gigaspeech2](https://huggingface.co/datasets/speechcolab/gigaspeech2), [common voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), [AudioCaps](https://huggingface.co/datasets/OpenSound/AudioCaps), [VoiceAssistant-400K](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K), [YODAS2](https://huggingface.co/datasets/espnet/yodas2), and [Multitask-National-Speech-Corpus](https://huggingface.co/datasets/MERaLiON/Multitask-National-Speech-Corpus-v1). We would like to thank the authors of these datasets for their contributions to the community!
63
+
64
+ We train the model on the dataset for 1 epoch, which took ~6 days to complete on 32 A800 GPUs.
65
+
66
+
67
+ # Performance
68
+ Due to the absence of standard audio benchmarks for evaluating audio LLMs in Southeast Asia, we have manually created a benchmark called **SeaBench-Audio**. It comprises nine tasks:
69
+
70
+ - **Tasks with both audio and text inputs:** Audio Captioning (AC), Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Recognition (SER), Speech Question Answering (SQA), and Speech Summarization (SS).
71
+ - **Tasks with only audio inputs:** Factuality, Math, and General.
72
+
73
+ We manually annotated 15 questions per task per language. For evaluation, qualified native speakers rated each response on a scale of 1 to 5, with 5 representing the highest quality.
74
+
75
+ Due to the lack of LALMs for all the three Southeast Asian languages, we compare the performance of SeaLLMs-Audio with relevant LALMs with similar sizes, including: [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) (Qwen2-Audio), [MERaLiON-AudioLLM-Whisper-SEA-LION](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION) (MERaLiON), [llama3.1-typhoon2-audio-8b-instruct](https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct) (typhoon2-audio), and [DiVA-llama-3-v0-8b](https://huggingface.co/WillHeld/DiVA-llama-3-v0-8b) (DiVA).
76
+ All the LALMs can accept audio with text as input. The results are shown in the figure below.
77
+
78
+ <center>
79
+
80
+ **Average scores of SeaLLMs-Audio vs. Other LALMs on SeaBench-Audio**
81
+
82
+ ![Performance of SeaLLMs-Audio vs. Other Audio LLMs](https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/images/scores_lang.png)
83
+
84
+ </center>
85
+
86
+ The results shows that SeaLLMs-Audio achieve state-of-the-art performance in all the five langauges, demonstrating its effectiveness in supporting audio-related tasks in Southeast Asia.
87
+
88
+
89
+ # Quickstart
90
+ Our model is available on Hugging Face, and you can easily use it with the `transformers` library or `vllm` library. Below are some examples to get you started.
91
+
92
+ ## Get started with `transformers`
93
+
94
+ ```python
95
+ from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
96
+ import librosa
97
+ import os
98
+
99
+ model = Qwen2AudioForConditionalGeneration.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B", device_map="auto")
100
+ processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")
101
+
102
+ def response_to_audio(conversation, model=None, processor=None):
103
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
104
+ audios = []
105
+ for message in conversation:
106
+ if isinstance(message["content"], list):
107
+ for ele in message["content"]:
108
+ if ele["type"] == "audio":
109
+ if ele['audio_url'] != None:
110
+ audios.append(librosa.load(
111
+ ele['audio_url'],
112
+ sr=processor.feature_extractor.sampling_rate)[0]
113
+ )
114
+ if audios != []:
115
+ inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True,sampling_rate=16000)
116
+ else:
117
+ inputs = processor(text=text, return_tensors="pt", padding=True)
118
+ inputs.input_ids = inputs.input_ids.to("cuda")
119
+ inputs = {k: v.to("cuda") for k, v in inputs.items() if v is not None}
120
+ generate_ids = model.generate(**inputs, max_new_tokens=2048, temperature = 0, do_sample=False)
121
+ generate_ids = generate_ids[:, inputs["input_ids"].size(1):]
122
+ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
123
+ return response
124
+
125
+ # Voice Chat
126
+ os.system(f"wget -O fact_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
127
+ os.system(f"wget -O general_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
128
+ conversation = [
129
+ {"role": "user", "content": [
130
+ {"type": "audio", "audio_url": "fact_en.wav"},
131
+ ]},
132
+ {"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
133
+ {"role": "user", "content": [
134
+ {"type": "audio", "audio_url": "general_en.wav"},
135
+ ]},
136
+ ]
137
+
138
+ response = response_to_audio(conversation, model=model, processor=processor)
139
+ print(response)
140
+
141
+ # Audio Analysis
142
+ os.system(f"wget -O ASR_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
143
+ conversation = [
144
+ {"role": "user", "content": [
145
+ {"type": "audio", "audio_url": "ASR_en.wav"},
146
+ {"type": "text", "text": "Please write down what is spoken in the audio file."},
147
+ ]},
148
+ ]
149
+
150
+ response = response_to_audio(conversation, model=model, processor=processor)
151
+ print(response)
152
+ ```
153
+
154
+ ## Inference with `vllm`
155
+
156
+ ```python
157
+ from vllm import LLM, SamplingParams
158
+ import librosa, os
159
+ from transformers import AutoProcessor
160
+
161
+ processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")
162
+ llm = LLM(
163
+ model="SeaLLMs/SeaLLMs-Audio-7B", trust_remote_code=True, gpu_memory_utilization=0.5,
164
+ enforce_eager=True, device = "cuda",
165
+ limit_mm_per_prompt={"audio": 5},
166
+ )
167
+
168
+ def response_to_audio(conversation, model=None, processor=None, temperature = 0.1,repetition_penalty=1.1, top_p = 0.9,max_new_tokens = 4096):
169
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
170
+ audios = []
171
+ for message in conversation:
172
+ if isinstance(message["content"], list):
173
+ for ele in message["content"]:
174
+ if ele["type"] == "audio":
175
+ if ele['audio_url'] != None:
176
+ audios.append(librosa.load(
177
+ ele['audio_url'],
178
+ sr=processor.feature_extractor.sampling_rate)[0]
179
+ )
180
+
181
+ sampling_params = SamplingParams(
182
+ temperature=temperature, max_tokens=max_new_tokens, repetition_penalty=repetition_penalty, top_p=top_p, top_k=20,
183
+ stop_token_ids=[],
184
+ )
185
+
186
+ input = {
187
+ 'prompt': text,
188
+ 'multi_modal_data': {
189
+ 'audio': [(audio, 16000) for audio in audios]
190
+ }
191
+ }
192
+
193
+ output = model.generate([input], sampling_params=sampling_params)[0]
194
+ response = output.outputs[0].text
195
+ return response
196
+
197
+ # Voice Chat
198
+ os.system(f"wget -O fact_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
199
+ os.system(f"wget -O general_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
200
+ conversation = [
201
+ {"role": "user", "content": [
202
+ {"type": "audio", "audio_url": "fact_en.wav"},
203
+ ]},
204
+ {"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
205
+ {"role": "user", "content": [
206
+ {"type": "audio", "audio_url": "general_en.wav"},
207
+ ]},
208
+ ]
209
+
210
+ response = response_to_audio(conversation, model=llm, processor=processor)
211
+ print(response)
212
+
213
+ # Audio Analysis
214
+ os.system(f"wget -O ASR_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
215
+ conversation = [
216
+ {"role": "user", "content": [
217
+ {"type": "audio", "audio_url": "ASR_en.wav"},
218
+ {"type": "text", "text": "Please write down what is spoken in the audio file."},
219
+ ]},
220
+ ]
221
+
222
+ response = response_to_audio(conversation, model=llm, processor=processor)
223
+ print(response)
224
+ ```
225
 
 
226
 
227
  ## Citation
228
+ If you find our project useful, we hope you would kindly star our [repo](https://github.com/DAMO-NLP-SG/SeaLLMs-Audio) and cite our work as follows.
229
+ Corresponding Author: Wenxuan Zhang ([[email protected]](mailto:[email protected]))
230
+ ```
231
+ @misc{SeaLLMs-Audio,
232
+ author = {Chaoqun Liu and Mahani Aljunied and Guizhen Chen and Hou Pong Chan and Weiwen Xu and Yu Rong and Wenxuan Zhang},
233
+ title = {SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia},
234
+ year = {2025},
235
+ publisher = {GitHub},
236
+ journal = {GitHub repository},
237
+ howpublished = {\url{https://github.com/DAMO-NLP-SG/SeaLLMs-Audio}},
238
+ }
239
+ ```
240
+
241