Safetensors
qwen3
ehartford TroyDoesAI commited on
Commit
0932868
·
verified ·
1 Parent(s): e958432

Update README.md (#2)

Browse files

- Update README.md (9ae0062cf2dc7fe23b7caceacb7e112a8791d518)


Co-authored-by: Troy Schultz <[email protected]>

Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -77,6 +77,27 @@ down_proj: [5120, 25600] → [8192, 29568]
77
  - Group Query Attention (GQA) maintained with 8 KV heads
78
  - All interpolations preserve the mathematical properties of the original weights
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ## Usage
81
 
82
 
 
77
  - Group Query Attention (GQA) maintained with 8 KV heads
78
  - All interpolations preserve the mathematical properties of the original weights
79
 
80
+ ## Evaluation Results
81
+
82
+ To answer the question "is it smarter or dumber than the original?", the model was evaluated on the **IFEval** (Instruction Following Evaluation) benchmark and compared directly against its base model, `Qwen/Qwen3-32B`.
83
+
84
+ ### IFEval: Instruction Following Comparison
85
+
86
+ Evaluation was performed using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) in a 0-shot setting. The results show that while the raw interpolated model is not yet as capable as the highly polished base model, it has successfully retained a significant portion of its instruction-following ability.
87
+
88
+ | Metric (Higher is Better) | 🥇 **Base Model (Qwen3-32B)** | **Embiggened Model (This Model)** | Performance Change |
89
+ | :--- | :---: | :---: | :---: |
90
+ | **Prompt-level Strict Accuracy** | **81.25%** | 68.75% | **-12.5 pts** |
91
+ | **Instruction-level Strict Accuracy**| **87.50%** | 75.00% | **-12.5 pts** |
92
+ | Prompt-level Loose Accuracy | **87.50%** | 68.75% | **-18.75 pts** |
93
+ | Instruction-level Loose Accuracy | **91.67%** | 75.00% | **-16.67 pts** |
94
+
95
+ ### Analysis of Results
96
+
97
+ * **Expected Performance Drop:** The drop in performance is an expected and normal consequence of the architectural expansion. The interpolation process, while structure-aware, cannot perfectly preserve the intricate balance of a fine-tuned model's weights.
98
+ * **Success in Retaining Capability:** The key takeaway is not the performance drop, but how much capability the model **retained**. Achieving ~85% of the original's strict accuracy (68.75% vs 81.25%) without any post-expansion training is a strong indicator of a successful architectural merge. The model remained coherent and functional.
99
+ * **Strong Foundation for Fine-Tuning:** These results establish a powerful baseline. The model is now a larger, coherent architecture that serves as an excellent starting point for further fine-tuning, which would likely recover and ultimately exceed the performance of the original 32B model.
100
+
101
  ## Usage
102
 
103