Update README.md (#2)
Browse files- Update README.md (9ae0062cf2dc7fe23b7caceacb7e112a8791d518)
Co-authored-by: Troy Schultz <[email protected]>
README.md
CHANGED
@@ -77,6 +77,27 @@ down_proj: [5120, 25600] → [8192, 29568]
|
|
77 |
- Group Query Attention (GQA) maintained with 8 KV heads
|
78 |
- All interpolations preserve the mathematical properties of the original weights
|
79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
80 |
## Usage
|
81 |
|
82 |
|
|
|
77 |
- Group Query Attention (GQA) maintained with 8 KV heads
|
78 |
- All interpolations preserve the mathematical properties of the original weights
|
79 |
|
80 |
+
## Evaluation Results
|
81 |
+
|
82 |
+
To answer the question "is it smarter or dumber than the original?", the model was evaluated on the **IFEval** (Instruction Following Evaluation) benchmark and compared directly against its base model, `Qwen/Qwen3-32B`.
|
83 |
+
|
84 |
+
### IFEval: Instruction Following Comparison
|
85 |
+
|
86 |
+
Evaluation was performed using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) in a 0-shot setting. The results show that while the raw interpolated model is not yet as capable as the highly polished base model, it has successfully retained a significant portion of its instruction-following ability.
|
87 |
+
|
88 |
+
| Metric (Higher is Better) | 🥇 **Base Model (Qwen3-32B)** | **Embiggened Model (This Model)** | Performance Change |
|
89 |
+
| :--- | :---: | :---: | :---: |
|
90 |
+
| **Prompt-level Strict Accuracy** | **81.25%** | 68.75% | **-12.5 pts** |
|
91 |
+
| **Instruction-level Strict Accuracy**| **87.50%** | 75.00% | **-12.5 pts** |
|
92 |
+
| Prompt-level Loose Accuracy | **87.50%** | 68.75% | **-18.75 pts** |
|
93 |
+
| Instruction-level Loose Accuracy | **91.67%** | 75.00% | **-16.67 pts** |
|
94 |
+
|
95 |
+
### Analysis of Results
|
96 |
+
|
97 |
+
* **Expected Performance Drop:** The drop in performance is an expected and normal consequence of the architectural expansion. The interpolation process, while structure-aware, cannot perfectly preserve the intricate balance of a fine-tuned model's weights.
|
98 |
+
* **Success in Retaining Capability:** The key takeaway is not the performance drop, but how much capability the model **retained**. Achieving ~85% of the original's strict accuracy (68.75% vs 81.25%) without any post-expansion training is a strong indicator of a successful architectural merge. The model remained coherent and functional.
|
99 |
+
* **Strong Foundation for Fine-Tuning:** These results establish a powerful baseline. The model is now a larger, coherent architecture that serves as an excellent starting point for further fine-tuning, which would likely recover and ultimately exceed the performance of the original 32B model.
|
100 |
+
|
101 |
## Usage
|
102 |
|
103 |
|