junnei commited on
Commit
4237419
·
verified ·
1 Parent(s): 959264a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -13
README.md CHANGED
@@ -11,12 +11,12 @@ base_model: google/gemma-3-4b-it
11
 
12
  ## Model Summary
13
 
14
- Gemma-3-MM is a open multimodal instruction models that extend the
15
- capabilities of the original Gemma-3 models to include speech processing.
16
 
17
  These models leverage the language and vision research used in the
18
- original Gemma-3 models and incorporate additional speech processing
19
- capabilities through a Speech Adapter.
20
 
21
  The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
22
 
@@ -41,26 +41,26 @@ Inspiration: [Phi-4-multimodal-instruct]
41
 
42
  ## Training Details
43
 
44
- - The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.
45
 
46
- - Due to limited computational resources, the model was only trained for 1 epoch on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks.
47
 
48
- - The training data was limited to English and Korean languages from the Covost2 Dataset within 2-15 seconds in duration.
49
 
50
  ## Limitations
51
 
52
- Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use.
53
  To improve the model's performance and reliability, the following areas need further development:
54
 
55
- More computational resources for extended training needed.
56
 
57
- For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST).
58
 
59
- As mentioned above, due to the lack of computing resources,
60
- this model primarily recognizes audio files within 2-15 seconds in duration.
61
  As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
62
 
63
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
64
 
65
  ### Usage
66
 
 
11
 
12
  ## Model Summary
13
 
14
+ **Gemma-3-MM** is a open multimodal instruction models that extend the
15
+ capabilities of the original Gemma-3 models to **include speech processing.**
16
 
17
  These models leverage the language and vision research used in the
18
+ original Gemma-3 models and incorporate **additional speech processing
19
+ capabilities** through a Speech Adapter.
20
 
21
  The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
22
 
 
41
 
42
  ## Training Details
43
 
44
+ - The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
45
 
46
+ - Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
47
 
48
+ - The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **2-15 seconds in duration.**
49
 
50
  ## Limitations
51
 
52
+ Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use.
53
  To improve the model's performance and reliability, the following areas need further development:
54
 
55
+ - More computational resources for extended training needed.
56
 
57
+ - For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
58
 
59
+ - Due to the lack of computing resources,
60
+ this model **primarily recognizes audio files within 2-15 seconds** in duration.
61
  As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
62
 
63
+ - If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
64
 
65
  ### Usage
66