snwy
/

frankenqwen3-8B-235B-dense-conversion-interleaved-untuned

Safetensors

qwen3

Model card Files Files and versions

xet

Community

snwy commited on 10 days ago

Commit

ecc0b2d

verified ·

1 Parent(s): db7550f

Create README.md

Browse files

Files changed (1) hide show

README.md +163 -0

README.md ADDED Viewed

	@@ -0,0 +1,163 @@

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen3-8B
+- Qwen/Qwen3-235B-A22B-Thinking-2507
+---
+# frankenqwen3-8B-235B-dense-conversion-interleaved-untuned
+- Base architecture: Qwen3 (dense)
+- Sources:
+  - Qwen/Qwen3-8B (base)
+  - Qwen/Qwen3-235B-A22B-Instruct-2507 or Qwen/Qwen3-235B-A22-Thinking-2507 (MoE)
+- Construction: Interleaved composite (passthrough-style), with MoE converted to dense first
+- Intended use: Research and experimentation with layer-interleaved composites
+Summary
+- This model interleaves transformer blocks from Qwen3-8B with blocks derived from a Qwen3-235B-A22* MoE model converted to a dense format. The goal is to build a larger-than-8B dense model, while retaining tokenizer and base configuration from Qwen3-8B.
+- MoE→Dense conversion is lossy; we use a safer approach than concatenating experts to reduce degradation.
+- Final depth: 64 layers
+- Hidden size: 4096
+- FFN intermediate size: 12288
+- Attention heads: 32 total, 8 KV heads (GQA)
+- RoPE settings: inherited from Qwen3-8B base
+NOTE: This model has not been fine-tuned or changed in any way after the merge. A recovery fine-tune would most likely increase performance significantly.
+Model details
+- Model type: qwen3
+- Architectures: Qwen3ForCausalLM
+- Tokenizer: Qwen3 tokenizer (copied from Qwen/Qwen3-8B)
+- Tying: Embedding and lm_head tied (as in base)
+- Positional encoding: RoPE (θ and scaling from base)
+- Dtype of shards: bfloat16 (weights saved in bf16; can be loaded in fp16/bf16/fp32)
+How it was built
+1) MoE → Dense conversion
+- Source: Qwen3-235B-A22* MoE.
+- Method: Expert averaging  (recommended: average or router_weighted; not concat).
+- FFN orientation fixes:
+  - up_proj, gate_proj: [intermediate, hidden]
+  - down_proj: [hidden, intermediate]
+- Attention (GQA) head-safe remap:
+  - q_proj out = num_attention_heads × head_dim = hidden_size
+  - k_proj/v_proj out = num_key_value_heads × head_dim
+  - o_proj in = hidden_size
+- Heads are remapped from source head counts to target (average when reducing heads, repeat when increasing), with head_dim adjusted by trunc/pad if needed.
+- Non-MoE and non-router tensors preserved.
+2) Composite interleaving
+- Base: Qwen/Qwen3-8B (tokenizer, non-layer tensors, config baseline).
+- Layers: Interleaved from base and converted MoE-dense into a new stack of 64 blocks.
+- Strategy: Even distribution across the depth.
+- Config: Copied from base; only num_hidden_layers updated to 64.
+Intended use and limitations
+- Intended for research on interleaved composite models and MoE→dense techniques.
+- Not intended for production without evaluation and safety review.
+- Known limitations:
+  - MoE→dense is inherently lossy. Removing routers changes the function learned by the MoE experts.
+  - Head remapping and FFN averaging are approximations.
+  - Interleaving layers from different training distributions may degrade instruction-following or safety alignment.
+  - Quality can often be improved with a brief recovery finetune (e.g., bf16, LR ~1e-5, a few thousand steps). This has not been performed on this model.
+Evaluation
+- No formal benchmarks provided with this release.
+- Suggested sanity checks:
+  - Perplexity on small corpora (should be in a reasonable range, not exploding).
+  - Basic instruction-following and chat safety probes.
+  - Compare generations against Qwen3-8B baseline.
+Usage
+Python (Transformers)
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "snwy/frankenqwen3-8B-235B-dense-conversion-interleaved-untuned"
+tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+# ...
+```
+CLI (text-generation-inference or transformers-cli)
+- Any standard Qwen3-8B serving setup should work, as config/arch match the base.
+Reproducibility (high-level)
+- Convert MoE to dense (recommended method: average or router_weighted)
+- Build composite interleaving with desired depth
+- Validate load on meta device
+Example commands
+```bash
+# 1) Convert MoE -> dense (average is safer than concat)
+python moe_to_dense.py \
+  --model_id Qwen/Qwen3-235B-A22B-Instruct-2507 \
+  --target_model Qwen/Qwen3-8B \
+  --output_path ./qwen3-235b-dense-avg \
+  --method average \
+  --low_memory
+# 2) Build composite (example: 48 layers)
+python moe_to_dense.py \
+  --compose_interleaved \
+  --base_model Qwen/Qwen3-8B \
+  --moe_converted ./qwen3-235b-dense-avg \
+  --composite_output_path ./qwen3-8b-plus-moe-48L \
+  --final_layers 48 \
+  --interleave_strategy even \
+  --cast_dtype bfloat16 \
+  --low_memory
+# 3) Validate shapes/load
+python moe_to_dense.py --validate_model ./qwen3-8b-plus-moe-48L
+```
+Safety, bias, and limitations
+- May produce inaccurate or biased content inherited from source models.
+- Not safe for deployment without additional alignment and filtering.
+- Do not use for high-stakes or vulnerable-domain applications.
+Versioning
+- Version: v0.1 (first public composite)
+- Changes from base:
+  - Increased depth: 64 vs 32
+  - Interleaved MoE-derived blocks after MoE→dense conversion
+- RoPE and tokenizer inherited from base model
+Licenses
+- This composite inherits licenses from the source models. Please refer to:
+  - Qwen/Qwen3-8B license
+  - Qwen/Qwen3-235B-A22* license
+- If redistributing, ensure compliance with both upstream licenses.
+Acknowledgments
+- Qwen team for Qwen3 models and tokenizer.
+- Community tools: Hugging Face Transformers, safetensors, huggingface_hub.
+Citation
+- If you use this model, please cite the upstream Qwen3 models and this composite:
+```
+@misc{qwen3-8b-plus-moe-<FINAL_LAYERS>L,
+  title  = {Qwen3 8B + MoE Interleaved Composite (64 Layers)},
+  author = {snwy},
+  year   = {2025},
+  url    = {https://huggingface.co/snwy/frankenqwen3-8B-235B-dense-conversion-interleaved-untuned}
+}
+```
+Notes for practitioners
+- Smaller composites (e.g., 40–48 layers) tend to be more stable than very deep mixes without finetuning.
+- If quality is marginal, try:
+  - Reducing the fraction of MoE layers,
+  - Router-weighted averaging during conversion,
+  - Short recovery finetuning (bf16, LR ~1e-5).
+- Ensure RoPE settings match the base across the entire composite (they do if you keep the base config).