snwy commited on
Commit
ecc0b2d
·
verified ·
1 Parent(s): db7550f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -0
README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen3-8B
5
+ - Qwen/Qwen3-235B-A22B-Thinking-2507
6
+ ---
7
+ # frankenqwen3-8B-235B-dense-conversion-interleaved-untuned
8
+
9
+ - Base architecture: Qwen3 (dense)
10
+ - Sources:
11
+ - Qwen/Qwen3-8B (base)
12
+ - Qwen/Qwen3-235B-A22B-Instruct-2507 or Qwen/Qwen3-235B-A22-Thinking-2507 (MoE)
13
+ - Construction: Interleaved composite (passthrough-style), with MoE converted to dense first
14
+ - Intended use: Research and experimentation with layer-interleaved composites
15
+
16
+ Summary
17
+ - This model interleaves transformer blocks from Qwen3-8B with blocks derived from a Qwen3-235B-A22* MoE model converted to a dense format. The goal is to build a larger-than-8B dense model, while retaining tokenizer and base configuration from Qwen3-8B.
18
+ - MoE→Dense conversion is lossy; we use a safer approach than concatenating experts to reduce degradation.
19
+ - Final depth: 64 layers
20
+ - Hidden size: 4096
21
+ - FFN intermediate size: 12288
22
+ - Attention heads: 32 total, 8 KV heads (GQA)
23
+ - RoPE settings: inherited from Qwen3-8B base
24
+
25
+ NOTE: This model has not been fine-tuned or changed in any way after the merge. A recovery fine-tune would most likely increase performance significantly.
26
+
27
+ Model details
28
+ - Model type: qwen3
29
+ - Architectures: Qwen3ForCausalLM
30
+ - Tokenizer: Qwen3 tokenizer (copied from Qwen/Qwen3-8B)
31
+ - Tying: Embedding and lm_head tied (as in base)
32
+ - Positional encoding: RoPE (θ and scaling from base)
33
+ - Dtype of shards: bfloat16 (weights saved in bf16; can be loaded in fp16/bf16/fp32)
34
+
35
+ How it was built
36
+ 1) MoE → Dense conversion
37
+ - Source: Qwen3-235B-A22* MoE.
38
+ - Method: Expert averaging (recommended: average or router_weighted; not concat).
39
+ - FFN orientation fixes:
40
+ - up_proj, gate_proj: [intermediate, hidden]
41
+ - down_proj: [hidden, intermediate]
42
+ - Attention (GQA) head-safe remap:
43
+ - q_proj out = num_attention_heads × head_dim = hidden_size
44
+ - k_proj/v_proj out = num_key_value_heads × head_dim
45
+ - o_proj in = hidden_size
46
+ - Heads are remapped from source head counts to target (average when reducing heads, repeat when increasing), with head_dim adjusted by trunc/pad if needed.
47
+ - Non-MoE and non-router tensors preserved.
48
+
49
+ 2) Composite interleaving
50
+ - Base: Qwen/Qwen3-8B (tokenizer, non-layer tensors, config baseline).
51
+ - Layers: Interleaved from base and converted MoE-dense into a new stack of 64 blocks.
52
+ - Strategy: Even distribution across the depth.
53
+ - Config: Copied from base; only num_hidden_layers updated to 64.
54
+
55
+ Intended use and limitations
56
+ - Intended for research on interleaved composite models and MoE→dense techniques.
57
+ - Not intended for production without evaluation and safety review.
58
+ - Known limitations:
59
+ - MoE→dense is inherently lossy. Removing routers changes the function learned by the MoE experts.
60
+ - Head remapping and FFN averaging are approximations.
61
+ - Interleaving layers from different training distributions may degrade instruction-following or safety alignment.
62
+ - Quality can often be improved with a brief recovery finetune (e.g., bf16, LR ~1e-5, a few thousand steps). This has not been performed on this model.
63
+
64
+ Evaluation
65
+ - No formal benchmarks provided with this release.
66
+ - Suggested sanity checks:
67
+ - Perplexity on small corpora (should be in a reasonable range, not exploding).
68
+ - Basic instruction-following and chat safety probes.
69
+ - Compare generations against Qwen3-8B baseline.
70
+
71
+ Usage
72
+
73
+ Python (Transformers)
74
+ ```python
75
+ from transformers import AutoTokenizer, AutoModelForCausalLM
76
+ import torch
77
+
78
+ model_id = "snwy/frankenqwen3-8B-235B-dense-conversion-interleaved-untuned"
79
+ tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
80
+ model = AutoModelForCausalLM.from_pretrained(
81
+ model_id,
82
+ torch_dtype=torch.bfloat16,
83
+ device_map="auto",
84
+ trust_remote_code=True,
85
+ )
86
+
87
+ # ...
88
+ ```
89
+
90
+ CLI (text-generation-inference or transformers-cli)
91
+ - Any standard Qwen3-8B serving setup should work, as config/arch match the base.
92
+
93
+ Reproducibility (high-level)
94
+ - Convert MoE to dense (recommended method: average or router_weighted)
95
+ - Build composite interleaving with desired depth
96
+ - Validate load on meta device
97
+
98
+ Example commands
99
+ ```bash
100
+ # 1) Convert MoE -> dense (average is safer than concat)
101
+ python moe_to_dense.py \
102
+ --model_id Qwen/Qwen3-235B-A22B-Instruct-2507 \
103
+ --target_model Qwen/Qwen3-8B \
104
+ --output_path ./qwen3-235b-dense-avg \
105
+ --method average \
106
+ --low_memory
107
+
108
+ # 2) Build composite (example: 48 layers)
109
+ python moe_to_dense.py \
110
+ --compose_interleaved \
111
+ --base_model Qwen/Qwen3-8B \
112
+ --moe_converted ./qwen3-235b-dense-avg \
113
+ --composite_output_path ./qwen3-8b-plus-moe-48L \
114
+ --final_layers 48 \
115
+ --interleave_strategy even \
116
+ --cast_dtype bfloat16 \
117
+ --low_memory
118
+
119
+ # 3) Validate shapes/load
120
+ python moe_to_dense.py --validate_model ./qwen3-8b-plus-moe-48L
121
+ ```
122
+
123
+ Safety, bias, and limitations
124
+ - May produce inaccurate or biased content inherited from source models.
125
+ - Not safe for deployment without additional alignment and filtering.
126
+ - Do not use for high-stakes or vulnerable-domain applications.
127
+
128
+ Versioning
129
+ - Version: v0.1 (first public composite)
130
+ - Changes from base:
131
+ - Increased depth: 64 vs 32
132
+ - Interleaved MoE-derived blocks after MoE→dense conversion
133
+ - RoPE and tokenizer inherited from base model
134
+
135
+ Licenses
136
+ - This composite inherits licenses from the source models. Please refer to:
137
+ - Qwen/Qwen3-8B license
138
+ - Qwen/Qwen3-235B-A22* license
139
+ - If redistributing, ensure compliance with both upstream licenses.
140
+
141
+ Acknowledgments
142
+ - Qwen team for Qwen3 models and tokenizer.
143
+ - Community tools: Hugging Face Transformers, safetensors, huggingface_hub.
144
+
145
+ Citation
146
+ - If you use this model, please cite the upstream Qwen3 models and this composite:
147
+
148
+ ```
149
+ @misc{qwen3-8b-plus-moe-<FINAL_LAYERS>L,
150
+ title = {Qwen3 8B + MoE Interleaved Composite (64 Layers)},
151
+ author = {snwy},
152
+ year = {2025},
153
+ url = {https://huggingface.co/snwy/frankenqwen3-8B-235B-dense-conversion-interleaved-untuned}
154
+ }
155
+ ```
156
+
157
+ Notes for practitioners
158
+ - Smaller composites (e.g., 40–48 layers) tend to be more stable than very deep mixes without finetuning.
159
+ - If quality is marginal, try:
160
+ - Reducing the fraction of MoE layers,
161
+ - Router-weighted averaging during conversion,
162
+ - Short recovery finetuning (bf16, LR ~1e-5).
163
+ - Ensure RoPE settings match the base across the entire composite (they do if you keep the base config).