Update README.md
Browse files
README.md
CHANGED
@@ -11,12 +11,12 @@ base_model: google/gemma-3-4b-it
|
|
11 |
|
12 |
## Model Summary
|
13 |
|
14 |
-
Gemma-3-MM is a open multimodal instruction models that extend the
|
15 |
-
capabilities of the original Gemma-3 models to include speech processing
|
16 |
|
17 |
These models leverage the language and vision research used in the
|
18 |
-
original Gemma-3 models and incorporate additional speech processing
|
19 |
-
capabilities through a Speech Adapter.
|
20 |
|
21 |
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
|
22 |
|
@@ -41,26 +41,26 @@ Inspiration: [Phi-4-multimodal-instruct]
|
|
41 |
|
42 |
## Training Details
|
43 |
|
44 |
-
- The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.
|
45 |
|
46 |
-
- Due to limited computational resources, the model was only trained for 1 epoch on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks.
|
47 |
|
48 |
-
- The training data was limited to English and Korean languages from the Covost2 Dataset within 2-15 seconds in duration
|
49 |
|
50 |
## Limitations
|
51 |
|
52 |
-
Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use.
|
53 |
To improve the model's performance and reliability, the following areas need further development:
|
54 |
|
55 |
-
More computational resources for extended training needed.
|
56 |
|
57 |
-
For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST)
|
58 |
|
59 |
-
|
60 |
-
this model primarily recognizes audio files within 2-15 seconds in duration.
|
61 |
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
|
62 |
|
63 |
-
If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
|
64 |
|
65 |
### Usage
|
66 |
|
|
|
11 |
|
12 |
## Model Summary
|
13 |
|
14 |
+
**Gemma-3-MM** is a open multimodal instruction models that extend the
|
15 |
+
capabilities of the original Gemma-3 models to **include speech processing.**
|
16 |
|
17 |
These models leverage the language and vision research used in the
|
18 |
+
original Gemma-3 models and incorporate **additional speech processing
|
19 |
+
capabilities** through a Speech Adapter.
|
20 |
|
21 |
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
|
22 |
|
|
|
41 |
|
42 |
## Training Details
|
43 |
|
44 |
+
- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
|
45 |
|
46 |
+
- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
|
47 |
|
48 |
+
- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **2-15 seconds in duration.**
|
49 |
|
50 |
## Limitations
|
51 |
|
52 |
+
Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use.
|
53 |
To improve the model's performance and reliability, the following areas need further development:
|
54 |
|
55 |
+
- More computational resources for extended training needed.
|
56 |
|
57 |
+
- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
|
58 |
|
59 |
+
- Due to the lack of computing resources,
|
60 |
+
this model **primarily recognizes audio files within 2-15 seconds** in duration.
|
61 |
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
|
62 |
|
63 |
+
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
|
64 |
|
65 |
### Usage
|
66 |
|