junnei
/

gemma-3-4b-it-speech

@@ -11,12 +11,12 @@ base_model: google/gemma-3-4b-it
 ## Model Summary
-Gemma-3-MM is a open multimodal instruction models that extend the
-capabilities of the original Gemma-3 models to include speech processing.
 These models leverage the language and vision research used in the
-original Gemma-3 models and incorporate additional speech processing
-capabilities through a Speech Adapter.
 The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
@@ -41,26 +41,26 @@ Inspiration: [Phi-4-multimodal-instruct]
 ## Training Details
-- The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.
-- Due to limited computational resources, the model was only trained for 1 epoch on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks.
-- The training data was limited to English and Korean languages from the Covost2 Dataset within 2-15 seconds in duration.
 ## Limitations
-Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use.
 To improve the model's performance and reliability, the following areas need further development:
-More computational resources for extended training needed.
-For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST).
-As mentioned above, due to the lack of computing resources,
-this model primarily recognizes audio files within 2-15 seconds in duration.
 As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
-If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
 ### Usage

 ## Model Summary
+**Gemma-3-MM** is a open multimodal instruction models that extend the
+capabilities of the original Gemma-3 models to **include speech processing.**
 These models leverage the language and vision research used in the
+original Gemma-3 models and incorporate **additional speech processing
+capabilities** through a Speech Adapter.
 The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
 ## Training Details
+- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
+- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
+- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **2-15 seconds in duration.**
 ## Limitations
+Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use.
 To improve the model's performance and reliability, the following areas need further development:
+- More computational resources for extended training needed.
+- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
+- Due to the lack of computing resources,
+this model **primarily recognizes audio files within 2-15 seconds** in duration.
 As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
+- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
 ### Usage