File size: 1,477 Bytes
0824754
 
 
 
 
 
 
 
 
 
 
7e099a9
 
 
0824754
7e099a9
0824754
7e099a9
 
 
0824754
7e099a9
 
0824754
7e099a9
0824754
7e099a9
0824754
7e099a9
0824754
7e099a9
0824754
7e099a9
0824754
7e099a9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
base_model: google/gemma-3-12b-it
license: gemma
tags:
- gemma3
- gemma
- google
pipeline_tag: image-text-to-text
library_name: transformers
---

<p align="left">
  <img width="65%" src="Fornax.jpg">
</p>

### Gemma 3 12B V3 Fornax

Gemma Fornax is a distillation of the updated R1 05/28 onto Gemma 3 12B, with a particualar focus on timely and generalizable reasoning beyond coding and math.
Most other open source thinking models, especially on the smaller side, fail to generalize their reasoning to tasks other than coding or math due to an overly large focus on 
GRPO zero for CoT which only generalizes for coding and math.

Instead of using GRPO, this model aims to SFT a wide variety of high quality, diverse reasoning traces from Deepseek R1 05/28 onto Gemma 3 to force the model to learn to effectively
generalize its reasoning capabilites to a large number of tasks as an extension of the LiMO paper's approach to Math/Coding CoT.

Varying CoT length in conjuction with explicit noise regularization during training also prevents the characteristic length overfitting of GRPO, which tends to manifest as waffling, where the model reasons to a set length even when it has already reached an answer.

Training off the QAT checkpoint also allows for this model to be used without a drop in quality at Q4_0, requiring only ~6GiB of memory. 

## Recommended Settings

Temp .7 + Nsigma 1

## Special Thanks:

Google for open sourcing the excellent Gemma 3 model line.