Update README.md (#2)

Browse files

- Update README.md (9ae0062cf2dc7fe23b7caceacb7e112a8791d518)

Co-authored-by: Troy Schultz <[email protected]>

Files changed (1) hide show

README.md +21 -0

README.md CHANGED Viewed

@@ -77,6 +77,27 @@ down_proj: [5120, 25600] → [8192, 29568]
 - Group Query Attention (GQA) maintained with 8 KV heads
 - All interpolations preserve the mathematical properties of the original weights
 ## Usage

 - Group Query Attention (GQA) maintained with 8 KV heads
 - All interpolations preserve the mathematical properties of the original weights
+## Evaluation Results
+To answer the question "is it smarter or dumber than the original?", the model was evaluated on the **IFEval** (Instruction Following Evaluation) benchmark and compared directly against its base model, `Qwen/Qwen3-32B`.
+### IFEval: Instruction Following Comparison
+Evaluation was performed using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) in a 0-shot setting. The results show that while the raw interpolated model is not yet as capable as the highly polished base model, it has successfully retained a significant portion of its instruction-following ability.
+| Metric (Higher is Better) | 🥇 **Base Model (Qwen3-32B)** | **Embiggened Model (This Model)** | Performance Change |
+| :--- | :---: | :---: | :---: |
+| **Prompt-level Strict Accuracy** | **81.25%** | 68.75% | **-12.5 pts** |
+| **Instruction-level Strict Accuracy**| **87.50%** | 75.00% | **-12.5 pts** |
+| Prompt-level Loose Accuracy | **87.50%** | 68.75% | **-18.75 pts** |
+| Instruction-level Loose Accuracy | **91.67%** | 75.00% | **-16.67 pts** |
+### Analysis of Results
+*   **Expected Performance Drop:** The drop in performance is an expected and normal consequence of the architectural expansion. The interpolation process, while structure-aware, cannot perfectly preserve the intricate balance of a fine-tuned model's weights.
+*   **Success in Retaining Capability:** The key takeaway is not the performance drop, but how much capability the model **retained**. Achieving ~85% of the original's strict accuracy (68.75% vs 81.25%) without any post-expansion training is a strong indicator of a successful architectural merge. The model remained coherent and functional.
+*   **Strong Foundation for Fine-Tuning:** These results establish a powerful baseline. The model is now a larger, coherent architecture that serves as an excellent starting point for further fine-tuning, which would likely recover and ultimately exceed the performance of the original 32B model.
 ## Usage