File size: 8,878 Bytes
1a54f12
625cd64
1a54f12
625cd64
 
1a54f12
625cd64
c67bd0f
1a54f12
d5772df
 
625cd64
 
 
 
 
 
 
 
 
1a54f12
 
625cd64
1a54f12
625cd64
1a54f12
c67bd0f
 
 
 
 
1a54f12
2ba3230
1a54f12
2ba3230
1a54f12
c67bd0f
 
 
2ba3230
625cd64
c67bd0f
1a54f12
2ba3230
1a54f12
d5772df
c67bd0f
2ba3230
1a54f12
2ba3230
 
 
 
c67bd0f
1a54f12
c67bd0f
 
 
 
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
c67bd0f
 
 
 
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
c67bd0f
 
625cd64
c67bd0f
1a54f12
2ba3230
 
c67bd0f
1a54f12
2ba3230
1a54f12
c67bd0f
 
d5772df
c67bd0f
d5772df
c67bd0f
 
 
 
 
 
 
 
 
d5772df
 
c67bd0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
c67bd0f
1a54f12
c67bd0f
 
 
1a54f12
c67bd0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a54f12
c67bd0f
1a54f12
c67bd0f
1a54f12
c67bd0f
 
 
1a54f12
d5772df
c67bd0f
d5772df
 
c67bd0f
 
 
 
 
 
 
 
 
 
 
d5772df
 
c67bd0f
 
 
d5772df
c67bd0f
d5772df
 
1a54f12
c67bd0f
1a54f12
c67bd0f
1a54f12
c67bd0f
 
 
 
 
1a54f12
2ba3230
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
625cd64
 
6ae71cb
 
 
625cd64
2ba3230
1a54f12
c67bd0f
 
d5772df
 
 
 
 
 
 
 
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
2ba3230
d5772df
c67bd0f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
license: other
datasets:
- Argobell/gek408
- Argobell/gek408-dpo
language:
- en
base_model: google/gemma-3n-E2B-it
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- gemma3n
- sft
- dpo
- unsloth
- instruction-tuning
- text-generation
- multimodal
- education
- reasoning
---

# ๐Ÿง  Model Card for `gemma-3n-gek408-dpo`

`gemma-3n-gek408-dpo` is a high-performance, fine-tuned version of [`google/gemma-3n-E2B-it`](https://huggingface.co/google/gemma-3n-E2B-it), meticulously optimized for educational and scientific reasoning. This model was trained leveraging the **Unsloth** library for significantly faster training and reduced memory usage.

The training followed a two-stage process:
1.  **Supervised Fine-Tuning (SFT):** To teach the model the desired instruction-following behavior on scientific and mathematical tasks.
2.  **Direct Preference Optimization (DPO):** To align the model's responses with human preferences for clarity, accuracy, and helpfulness.

This model was developed for the **[Google - The Gemma 3n Impact Challenge](https://www.kaggle.com/competitions/google-gemma-3n-hackathon)** competition.

## ๐Ÿ“Œ Model Details

### ๐Ÿงพ Model Description

- **Developed by:** Argobell
- **Shared by:** Argobell
- **Model type:** Multimodal model, capable of processing **text image and audio inputs**.
- **Finetuned from:** [`google/gemma-3n-E2B-it`](https://huggingface.co/google/gemma-3n-E2B-it)
- **License:** This model is subject to the **Gemma Terms of Use**. Users must agree to and comply with the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
- **Primary Domain:** Education, STEM, Visual Reasoning

### ๐Ÿ“‚ Model Sources

- **Repository:** [Argobell/gemma-3n-gek408-dpo](https://huggingface.co/Argobell/gemma-3n-gek408-dpo)
- **Competition:** [Google - The Gemma 3n Impact Challenge](https://www.kaggle.com/competitions/google-gemma-3n-hackathon)
- **Demo:** [GitHub Demo Link](https://github.com/Argobell/kaggle408)

## ๐ŸŽฏ Uses

### โœ… Direct Use

This model is ideal for:

- ๐Ÿงฎ **Math Tutoring Agents:** Guiding students through complex math problems.
- ๐Ÿง‘โ€๐Ÿซ **Educational AI Assistants:** Answering questions based on educational materials.
- ๐Ÿ“Š **Diagram-based Question Answering:** Interpreting charts, graphs, and scientific diagrams.
- ๐Ÿ” **Visual Reasoning & Explanation:** Explaining logical steps from a visual prompt.

### ๐Ÿงฉ Downstream Use

This model serves as a strong foundation for:

- **Create interactive, offline-ready learning experiences for students in low-connectivity regions.**
- Advanced multimodal AI systems for educational platforms.
- Domain-specific reasoning tools for science and engineering.
- Interactive learning applications in STEM fields.

## โš ๏ธ Bias, Risks, and Limitations

This model inherits limitations common to most LLMs and has specific risks related to its application:

- **Hallucination:** The model can generate incorrect or fabricated information.
- **Prompt Sensitivity:** The phrasing of a prompt can significantly affect the output quality.
- **Inherited Biases:** It may reflect biases present in the `gemma-3n-E2B-it` base model and the `gek408` dataset.
- **Risk of "Fluent Nonsense"**: In educational contexts, the model might generate explanations that sound logical and correct but contain subtle mathematical or scientific inaccuracies. **Human verification is crucial for factual and educational use cases.**

### ๐Ÿ’ก Recommendations

Always critically evaluate the model's output before use in any real-world application. For educational purposes, outputs should be reviewed by a subject matter expert.

## ๐Ÿš€ Getting Started

The model was trained with Unsloth, so using it for inference is recommended for maximum performance.

```python
from unsloth import FastModel
import torch
from transformers import TextStreamer
import gc

# Load the model and tokenizer with 4-bit quantization
model, tokenizer = FastModel.from_pretrained(
    model_name = "Argobell/gemma-3n-gek408-dpo", 
    max_seq_length = 1024, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    # token = "hf_...", # use one if using gated models
)

# Helper function for inference
def do_gemma_3n_inference(model, messages, max_new_tokens = 128):
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt = True, # Must add for generation
        tokenize = True,
        return_dict = True,
        return_tensors = "pt",
    ).to("cuda")
    _ = model.generate(
        **inputs,
        max_new_tokens = max_new_tokens,
        temperature = 1.0, top_p = 0.95, top_k = 64,
        streamer = TextStreamer(tokenizer, skip_prompt = True),
    )
    # Cleanup to reduce VRAM usage
    del inputs
    torch.cuda.empty_cache()
    gc.collect()

sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"

messages = [{
    "role" : "user",
    "content": [
        { "type": "image", "image" : sloth_link },
        { "type": "text",  "text" : "Which films does this animal feature in?" }
    ]
}]
# You might have to wait 1 minute for Unsloth's auto compiler
do_gemma_3n_inference(model, messages, max_new_tokens = 256)
```

## ๐Ÿ› ๏ธ Training Details

The training was conducted in two distinct phases, using a LoRA-based approach accelerated by Unsloth.

### ๐Ÿ“š Phase 1: Supervised Fine-Tuning (SFT)

- **Goal:** To teach the model the fundamental structure of responding to mathematical prompts.
- **Dataset:** [`Argobell/gek408`](https://huggingface.co/datasets/Argobell/gek408)
- **Key Hyperparameters:** The following parameters were used to tune both the vision and language components of the model.

```bash
# SFT Stage Configuration
--max_seq_length 2048
--max_steps 320
--learning_rate 2e-4
--lr_scheduler_type "cosine"
--optim "adamw_torch_fused"

# LoRA Configuration
--tune_vision                
--tune_language_layers       
--tune_attention_modules     
--tune_mlp_modules           
--r 16                       
--alpha 16                   
--lora_dropout 0.05

# Batching & Memory
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8 
--gradient_checkpointing

```

### ๐Ÿ“š Phase 2: Direct Preference Optimization (DPO)

- **Goal:** To refine the SFT model by training it to prefer helpful, accurate responses over less desirable ones.
- **Dataset:** [`Argobell/gek408-dpo`](https://huggingface.co/datasets/Argobell/gek408-dpo)
- **Key Hyperparameters:** Starting from the SFT-tuned model, DPO training was performed with the following settings.

```bash
# DPO Stage Configuration
--max_seq_length 2048
--max_prompt_length 1024
--max_steps 100
--learning_rate 5e-6         
--optim "adamw_torch_fused"
--warmup_ration 0.1
--weight_decay 0.01

# LoRA Configuration
--tune_vision                
--tune_language_layers       
--tune_attention_modules     
--tune_mlp_modules           
--r 4
--alpha 4
--lora_dropout 0.1

# Batching & Memory
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--gradient_accumulation_steps 4
--gradient_checkpointing

```

### ๐Ÿ’ป Infrastructure & Software

- **Hardware:** 1ร— NVIDIA RTX 5880 Ada Generation
- **Key Software:**
    - **Unsloth:** Used for 2-3x faster training and ~60% less memory usage, enabling more extensive experimentation.
    - **Hugging Face TRL:** For implementing the SFT and DPO training loops.
    - **Hugging Face Transformers & Datasets.**

## ๐Ÿงฐ Technical Specifications

### Architecture

Gemma-3n utilizes a Matryoshka Transformer (MatFormer) architecture, which nests smaller, self-contained models within a larger one.

## ๐Ÿ™ Acknowledgements
This work would not have been possible without the foundational models and libraries developed by the open-source community. We would like to extend our gratitude to:
- Google: For developing and releasing the powerful gemma-3n-E2B-it base model.
- The Unsloth AI team: For creating the Unsloth library, which was instrumental in accelerating the training process and reducing computational costs.
- Hugging Face: For providing the transformers, datasets, and TRL libraries that formed the backbone of our training and experimentation pipeline.

## ๐Ÿ“– Citation

If you use this model in your work, please cite it as follows:

```bibtex
@misc{gemma3ngek408dpo,
  author = {Argobell},
  title = {gemma-3n-gek408-dpo},
  howpublished = {\url{https://huggingface.co/Argobell/gemma-3n-gek408-dpo}},
  year = {2025}
}
```

## ๐Ÿ‘ฅ Model Card Authors

- Argobell

## ๐Ÿ“ฌ Contact

For questions, feedback, or collaboration, please reach out via email: [[email protected]](mailto:[email protected])