File size: 8,878 Bytes

1a54f12
625cd64
1a54f12
625cd64
 
1a54f12
625cd64
c67bd0f
1a54f12
d5772df
 
625cd64
 
 
 
 
 
 
 
 
1a54f12
 
625cd64
1a54f12
625cd64
1a54f12
c67bd0f
 
 
 
 
1a54f12
2ba3230
1a54f12
2ba3230
1a54f12
c67bd0f
 
 
2ba3230
625cd64
c67bd0f
1a54f12
2ba3230
1a54f12
d5772df
c67bd0f
2ba3230
1a54f12
2ba3230
 
 
 
c67bd0f
1a54f12
c67bd0f
 
 
 
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
c67bd0f
 
 
 
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
c67bd0f
 
625cd64
c67bd0f
1a54f12
2ba3230
 
c67bd0f
1a54f12
2ba3230
1a54f12
c67bd0f
 
d5772df
c67bd0f
d5772df
c67bd0f
 
 
 
 
 
 
 
 
d5772df
 
c67bd0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
c67bd0f
1a54f12
c67bd0f
 
 
1a54f12
c67bd0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a54f12
c67bd0f
1a54f12
c67bd0f
1a54f12
c67bd0f
 
 
1a54f12
d5772df
c67bd0f
d5772df
 
c67bd0f
 
 
 
 
 
 
 
 
 
 
d5772df
 
c67bd0f
 
 
d5772df
c67bd0f
d5772df
 
1a54f12
c67bd0f
1a54f12
c67bd0f
1a54f12
c67bd0f
 
 
 
 
1a54f12
2ba3230
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
625cd64
 
6ae71cb
 
 
625cd64
2ba3230
1a54f12
c67bd0f
 
d5772df
 
 
 
 
 
 
 
1a54f12
2ba3230
1a54f12
c67bd0f
1a54f12
2ba3230
d5772df
c67bd0f

---
license: other
datasets:
- Argobell/gek408
- Argobell/gek408-dpo
language:
- en
base_model: google/gemma-3n-E2B-it
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- gemma3n
- sft
- dpo
- unsloth
- instruction-tuning
- text-generation
- multimodal
- education
- reasoning
---

# 🧠 Model Card for `gemma-3n-gek408-dpo`

`gemma-3n-gek408-dpo` is a high-performance, fine-tuned version of [`google/gemma-3n-E2B-it`](https://huggingface.co/google/gemma-3n-E2B-it), meticulously optimized for educational and scientific reasoning. This model was trained leveraging the **Unsloth** library for significantly faster training and reduced memory usage.

The training followed a two-stage process:
1.  **Supervised Fine-Tuning (SFT):** To teach the model the desired instruction-following behavior on scientific and mathematical tasks.
2.  **Direct Preference Optimization (DPO):** To align the model's responses with human preferences for clarity, accuracy, and helpfulness.

This model was developed for the **[Google - The Gemma 3n Impact Challenge](https://www.kaggle.com/competitions/google-gemma-3n-hackathon)** competition.

## 📌 Model Details

### 🧾 Model Description

- **Developed by:** Argobell
- **Shared by:** Argobell
- **Model type:** Multimodal model, capable of processing **text image and audio inputs**.
- **Finetuned from:** [`google/gemma-3n-E2B-it`](https://huggingface.co/google/gemma-3n-E2B-it)
- **License:** This model is subject to the **Gemma Terms of Use**. Users must agree to and comply with the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
- **Primary Domain:** Education, STEM, Visual Reasoning

### 📂 Model Sources

- **Repository:** [Argobell/gemma-3n-gek408-dpo](https://huggingface.co/Argobell/gemma-3n-gek408-dpo)
- **Competition:** [Google - The Gemma 3n Impact Challenge](https://www.kaggle.com/competitions/google-gemma-3n-hackathon)
- **Demo:** [GitHub Demo Link](https://github.com/Argobell/kaggle408)

## 🎯 Uses

### ✅ Direct Use

This model is ideal for:

- 🧮 **Math Tutoring Agents:** Guiding students through complex math problems.
- 🧑‍🏫 **Educational AI Assistants:** Answering questions based on educational materials.
- 📊 **Diagram-based Question Answering:** Interpreting charts, graphs, and scientific diagrams.
- 🔍 **Visual Reasoning & Explanation:** Explaining logical steps from a visual prompt.

### 🧩 Downstream Use

This model serves as a strong foundation for:

- **Create interactive, offline-ready learning experiences for students in low-connectivity regions.**
- Advanced multimodal AI systems for educational platforms.
- Domain-specific reasoning tools for science and engineering.
- Interactive learning applications in STEM fields.

## ⚠️ Bias, Risks, and Limitations

This model inherits limitations common to most LLMs and has specific risks related to its application:

- **Hallucination:** The model can generate incorrect or fabricated information.
- **Prompt Sensitivity:** The phrasing of a prompt can significantly affect the output quality.
- **Inherited Biases:** It may reflect biases present in the `gemma-3n-E2B-it` base model and the `gek408` dataset.
- **Risk of "Fluent Nonsense"**: In educational contexts, the model might generate explanations that sound logical and correct but contain subtle mathematical or scientific inaccuracies. **Human verification is crucial for factual and educational use cases.**

### 💡 Recommendations

Always critically evaluate the model's output before use in any real-world application. For educational purposes, outputs should be reviewed by a subject matter expert.

## 🚀 Getting Started

The model was trained with Unsloth, so using it for inference is recommended for maximum performance.

```python
from unsloth import FastModel
import torch
from transformers import TextStreamer
import gc

# Load the model and tokenizer with 4-bit quantization
model, tokenizer = FastModel.from_pretrained(
    model_name = "Argobell/gemma-3n-gek408-dpo", 
    max_seq_length = 1024, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    # token = "hf_...", # use one if using gated models
)

# Helper function for inference
def do_gemma_3n_inference(model, messages, max_new_tokens = 128):
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt = True, # Must add for generation
        tokenize = True,
        return_dict = True,
        return_tensors = "pt",
    ).to("cuda")
    _ = model.generate(
        **inputs,
        max_new_tokens = max_new_tokens,
        temperature = 1.0, top_p = 0.95, top_k = 64,
        streamer = TextStreamer(tokenizer, skip_prompt = True),
    )
    # Cleanup to reduce VRAM usage
    del inputs
    torch.cuda.empty_cache()
    gc.collect()

sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"

messages = [{
    "role" : "user",
    "content": [
        { "type": "image", "image" : sloth_link },
        { "type": "text",  "text" : "Which films does this animal feature in?" }
    ]
}]
# You might have to wait 1 minute for Unsloth's auto compiler
do_gemma_3n_inference(model, messages, max_new_tokens = 256)
```

## 🛠️ Training Details

The training was conducted in two distinct phases, using a LoRA-based approach accelerated by Unsloth.

### 📚 Phase 1: Supervised Fine-Tuning (SFT)

- **Goal:** To teach the model the fundamental structure of responding to mathematical prompts.
- **Dataset:** [`Argobell/gek408`](https://huggingface.co/datasets/Argobell/gek408)
- **Key Hyperparameters:** The following parameters were used to tune both the vision and language components of the model.

```bash
# SFT Stage Configuration
--max_seq_length 2048
--max_steps 320
--learning_rate 2e-4
--lr_scheduler_type "cosine"
--optim "adamw_torch_fused"

# LoRA Configuration
--tune_vision                
--tune_language_layers       
--tune_attention_modules     
--tune_mlp_modules           
--r 16                       
--alpha 16                   
--lora_dropout 0.05

# Batching & Memory
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8 
--gradient_checkpointing

```

### 📚 Phase 2: Direct Preference Optimization (DPO)

- **Goal:** To refine the SFT model by training it to prefer helpful, accurate responses over less desirable ones.
- **Dataset:** [`Argobell/gek408-dpo`](https://huggingface.co/datasets/Argobell/gek408-dpo)
- **Key Hyperparameters:** Starting from the SFT-tuned model, DPO training was performed with the following settings.

```bash
# DPO Stage Configuration
--max_seq_length 2048
--max_prompt_length 1024
--max_steps 100
--learning_rate 5e-6         
--optim "adamw_torch_fused"
--warmup_ration 0.1
--weight_decay 0.01

# LoRA Configuration
--tune_vision                
--tune_language_layers       
--tune_attention_modules     
--tune_mlp_modules           
--r 4
--alpha 4
--lora_dropout 0.1

# Batching & Memory
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--gradient_accumulation_steps 4
--gradient_checkpointing

```

### 💻 Infrastructure & Software

- **Hardware:** 1× NVIDIA RTX 5880 Ada Generation
- **Key Software:**
    - **Unsloth:** Used for 2-3x faster training and ~60% less memory usage, enabling more extensive experimentation.
    - **Hugging Face TRL:** For implementing the SFT and DPO training loops.
    - **Hugging Face Transformers & Datasets.**

## 🧰 Technical Specifications

### Architecture

Gemma-3n utilizes a Matryoshka Transformer (MatFormer) architecture, which nests smaller, self-contained models within a larger one.

## 🙏 Acknowledgements
This work would not have been possible without the foundational models and libraries developed by the open-source community. We would like to extend our gratitude to:
- Google: For developing and releasing the powerful gemma-3n-E2B-it base model.
- The Unsloth AI team: For creating the Unsloth library, which was instrumental in accelerating the training process and reducing computational costs.
- Hugging Face: For providing the transformers, datasets, and TRL libraries that formed the backbone of our training and experimentation pipeline.

## 📖 Citation

If you use this model in your work, please cite it as follows:

```bibtex
@misc{gemma3ngek408dpo,
  author = {Argobell},
  title = {gemma-3n-gek408-dpo},
  howpublished = {\url{https://huggingface.co/Argobell/gemma-3n-gek408-dpo}},
  year = {2025}
}
```

## 👥 Model Card Authors

- Argobell

## 📬 Contact

For questions, feedback, or collaboration, please reach out via email: [[email protected]](mailto:[email protected])