Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- Qwen/Qwen3-8B
|
5 |
+
- Qwen/Qwen3-235B-A22B-Thinking-2507
|
6 |
+
---
|
7 |
+
# frankenqwen3-8B-235B-dense-conversion-interleaved-untuned
|
8 |
+
|
9 |
+
- Base architecture: Qwen3 (dense)
|
10 |
+
- Sources:
|
11 |
+
- Qwen/Qwen3-8B (base)
|
12 |
+
- Qwen/Qwen3-235B-A22B-Instruct-2507 or Qwen/Qwen3-235B-A22-Thinking-2507 (MoE)
|
13 |
+
- Construction: Interleaved composite (passthrough-style), with MoE converted to dense first
|
14 |
+
- Intended use: Research and experimentation with layer-interleaved composites
|
15 |
+
|
16 |
+
Summary
|
17 |
+
- This model interleaves transformer blocks from Qwen3-8B with blocks derived from a Qwen3-235B-A22* MoE model converted to a dense format. The goal is to build a larger-than-8B dense model, while retaining tokenizer and base configuration from Qwen3-8B.
|
18 |
+
- MoE→Dense conversion is lossy; we use a safer approach than concatenating experts to reduce degradation.
|
19 |
+
- Final depth: 64 layers
|
20 |
+
- Hidden size: 4096
|
21 |
+
- FFN intermediate size: 12288
|
22 |
+
- Attention heads: 32 total, 8 KV heads (GQA)
|
23 |
+
- RoPE settings: inherited from Qwen3-8B base
|
24 |
+
|
25 |
+
NOTE: This model has not been fine-tuned or changed in any way after the merge. A recovery fine-tune would most likely increase performance significantly.
|
26 |
+
|
27 |
+
Model details
|
28 |
+
- Model type: qwen3
|
29 |
+
- Architectures: Qwen3ForCausalLM
|
30 |
+
- Tokenizer: Qwen3 tokenizer (copied from Qwen/Qwen3-8B)
|
31 |
+
- Tying: Embedding and lm_head tied (as in base)
|
32 |
+
- Positional encoding: RoPE (θ and scaling from base)
|
33 |
+
- Dtype of shards: bfloat16 (weights saved in bf16; can be loaded in fp16/bf16/fp32)
|
34 |
+
|
35 |
+
How it was built
|
36 |
+
1) MoE → Dense conversion
|
37 |
+
- Source: Qwen3-235B-A22* MoE.
|
38 |
+
- Method: Expert averaging (recommended: average or router_weighted; not concat).
|
39 |
+
- FFN orientation fixes:
|
40 |
+
- up_proj, gate_proj: [intermediate, hidden]
|
41 |
+
- down_proj: [hidden, intermediate]
|
42 |
+
- Attention (GQA) head-safe remap:
|
43 |
+
- q_proj out = num_attention_heads × head_dim = hidden_size
|
44 |
+
- k_proj/v_proj out = num_key_value_heads × head_dim
|
45 |
+
- o_proj in = hidden_size
|
46 |
+
- Heads are remapped from source head counts to target (average when reducing heads, repeat when increasing), with head_dim adjusted by trunc/pad if needed.
|
47 |
+
- Non-MoE and non-router tensors preserved.
|
48 |
+
|
49 |
+
2) Composite interleaving
|
50 |
+
- Base: Qwen/Qwen3-8B (tokenizer, non-layer tensors, config baseline).
|
51 |
+
- Layers: Interleaved from base and converted MoE-dense into a new stack of 64 blocks.
|
52 |
+
- Strategy: Even distribution across the depth.
|
53 |
+
- Config: Copied from base; only num_hidden_layers updated to 64.
|
54 |
+
|
55 |
+
Intended use and limitations
|
56 |
+
- Intended for research on interleaved composite models and MoE→dense techniques.
|
57 |
+
- Not intended for production without evaluation and safety review.
|
58 |
+
- Known limitations:
|
59 |
+
- MoE→dense is inherently lossy. Removing routers changes the function learned by the MoE experts.
|
60 |
+
- Head remapping and FFN averaging are approximations.
|
61 |
+
- Interleaving layers from different training distributions may degrade instruction-following or safety alignment.
|
62 |
+
- Quality can often be improved with a brief recovery finetune (e.g., bf16, LR ~1e-5, a few thousand steps). This has not been performed on this model.
|
63 |
+
|
64 |
+
Evaluation
|
65 |
+
- No formal benchmarks provided with this release.
|
66 |
+
- Suggested sanity checks:
|
67 |
+
- Perplexity on small corpora (should be in a reasonable range, not exploding).
|
68 |
+
- Basic instruction-following and chat safety probes.
|
69 |
+
- Compare generations against Qwen3-8B baseline.
|
70 |
+
|
71 |
+
Usage
|
72 |
+
|
73 |
+
Python (Transformers)
|
74 |
+
```python
|
75 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
76 |
+
import torch
|
77 |
+
|
78 |
+
model_id = "snwy/frankenqwen3-8B-235B-dense-conversion-interleaved-untuned"
|
79 |
+
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
80 |
+
model = AutoModelForCausalLM.from_pretrained(
|
81 |
+
model_id,
|
82 |
+
torch_dtype=torch.bfloat16,
|
83 |
+
device_map="auto",
|
84 |
+
trust_remote_code=True,
|
85 |
+
)
|
86 |
+
|
87 |
+
# ...
|
88 |
+
```
|
89 |
+
|
90 |
+
CLI (text-generation-inference or transformers-cli)
|
91 |
+
- Any standard Qwen3-8B serving setup should work, as config/arch match the base.
|
92 |
+
|
93 |
+
Reproducibility (high-level)
|
94 |
+
- Convert MoE to dense (recommended method: average or router_weighted)
|
95 |
+
- Build composite interleaving with desired depth
|
96 |
+
- Validate load on meta device
|
97 |
+
|
98 |
+
Example commands
|
99 |
+
```bash
|
100 |
+
# 1) Convert MoE -> dense (average is safer than concat)
|
101 |
+
python moe_to_dense.py \
|
102 |
+
--model_id Qwen/Qwen3-235B-A22B-Instruct-2507 \
|
103 |
+
--target_model Qwen/Qwen3-8B \
|
104 |
+
--output_path ./qwen3-235b-dense-avg \
|
105 |
+
--method average \
|
106 |
+
--low_memory
|
107 |
+
|
108 |
+
# 2) Build composite (example: 48 layers)
|
109 |
+
python moe_to_dense.py \
|
110 |
+
--compose_interleaved \
|
111 |
+
--base_model Qwen/Qwen3-8B \
|
112 |
+
--moe_converted ./qwen3-235b-dense-avg \
|
113 |
+
--composite_output_path ./qwen3-8b-plus-moe-48L \
|
114 |
+
--final_layers 48 \
|
115 |
+
--interleave_strategy even \
|
116 |
+
--cast_dtype bfloat16 \
|
117 |
+
--low_memory
|
118 |
+
|
119 |
+
# 3) Validate shapes/load
|
120 |
+
python moe_to_dense.py --validate_model ./qwen3-8b-plus-moe-48L
|
121 |
+
```
|
122 |
+
|
123 |
+
Safety, bias, and limitations
|
124 |
+
- May produce inaccurate or biased content inherited from source models.
|
125 |
+
- Not safe for deployment without additional alignment and filtering.
|
126 |
+
- Do not use for high-stakes or vulnerable-domain applications.
|
127 |
+
|
128 |
+
Versioning
|
129 |
+
- Version: v0.1 (first public composite)
|
130 |
+
- Changes from base:
|
131 |
+
- Increased depth: 64 vs 32
|
132 |
+
- Interleaved MoE-derived blocks after MoE→dense conversion
|
133 |
+
- RoPE and tokenizer inherited from base model
|
134 |
+
|
135 |
+
Licenses
|
136 |
+
- This composite inherits licenses from the source models. Please refer to:
|
137 |
+
- Qwen/Qwen3-8B license
|
138 |
+
- Qwen/Qwen3-235B-A22* license
|
139 |
+
- If redistributing, ensure compliance with both upstream licenses.
|
140 |
+
|
141 |
+
Acknowledgments
|
142 |
+
- Qwen team for Qwen3 models and tokenizer.
|
143 |
+
- Community tools: Hugging Face Transformers, safetensors, huggingface_hub.
|
144 |
+
|
145 |
+
Citation
|
146 |
+
- If you use this model, please cite the upstream Qwen3 models and this composite:
|
147 |
+
|
148 |
+
```
|
149 |
+
@misc{qwen3-8b-plus-moe-<FINAL_LAYERS>L,
|
150 |
+
title = {Qwen3 8B + MoE Interleaved Composite (64 Layers)},
|
151 |
+
author = {snwy},
|
152 |
+
year = {2025},
|
153 |
+
url = {https://huggingface.co/snwy/frankenqwen3-8B-235B-dense-conversion-interleaved-untuned}
|
154 |
+
}
|
155 |
+
```
|
156 |
+
|
157 |
+
Notes for practitioners
|
158 |
+
- Smaller composites (e.g., 40–48 layers) tend to be more stable than very deep mixes without finetuning.
|
159 |
+
- If quality is marginal, try:
|
160 |
+
- Reducing the fraction of MoE layers,
|
161 |
+
- Router-weighted averaging during conversion,
|
162 |
+
- Short recovery finetuning (bf16, LR ~1e-5).
|
163 |
+
- Ensure RoPE settings match the base across the entire composite (they do if you keep the base config).
|