--- license: apache-2.0 base_model: - Qwen/Qwen3-8B - Qwen/Qwen3-235B-A22B-Thinking-2507 --- # frankenqwen3-8B-235B-dense-conversion-interleaved-untuned - Base architecture: Qwen3 (dense) - Sources: - Qwen/Qwen3-8B (base) - Qwen/Qwen3-235B-A22B-Instruct-2507 or Qwen/Qwen3-235B-A22-Thinking-2507 (MoE) - Construction: Interleaved composite (passthrough-style), with MoE converted to dense first - Intended use: Research and experimentation with layer-interleaved composites Summary - This model interleaves transformer blocks from Qwen3-8B with blocks derived from a Qwen3-235B-A22* MoE model converted to a dense format. The goal is to build a larger-than-8B dense model, while retaining tokenizer and base configuration from Qwen3-8B. - MoE→Dense conversion is lossy; we use a safer approach than concatenating experts to reduce degradation. - Final depth: 64 layers - Hidden size: 4096 - FFN intermediate size: 12288 - Attention heads: 32 total, 8 KV heads (GQA) - RoPE settings: inherited from Qwen3-8B base NOTE: This model has not been fine-tuned or changed in any way after the merge. A recovery fine-tune would most likely increase performance significantly. Model details - Model type: qwen3 - Architectures: Qwen3ForCausalLM - Tokenizer: Qwen3 tokenizer (copied from Qwen/Qwen3-8B) - Tying: Embedding and lm_head tied (as in base) - Positional encoding: RoPE (θ and scaling from base) - Dtype of shards: bfloat16 (weights saved in bf16; can be loaded in fp16/bf16/fp32) How it was built 1) MoE → Dense conversion - Source: Qwen3-235B-A22* MoE. - Method: Expert averaging (recommended: average or router_weighted; not concat). - FFN orientation fixes: - up_proj, gate_proj: [intermediate, hidden] - down_proj: [hidden, intermediate] - Attention (GQA) head-safe remap: - q_proj out = num_attention_heads × head_dim = hidden_size - k_proj/v_proj out = num_key_value_heads × head_dim - o_proj in = hidden_size - Heads are remapped from source head counts to target (average when reducing heads, repeat when increasing), with head_dim adjusted by trunc/pad if needed. - Non-MoE and non-router tensors preserved. 2) Composite interleaving - Base: Qwen/Qwen3-8B (tokenizer, non-layer tensors, config baseline). - Layers: Interleaved from base and converted MoE-dense into a new stack of 64 blocks. - Strategy: Even distribution across the depth. - Config: Copied from base; only num_hidden_layers updated to 64. Intended use and limitations - Intended for research on interleaved composite models and MoE→dense techniques. - Not intended for production without evaluation and safety review. - Known limitations: - MoE→dense is inherently lossy. Removing routers changes the function learned by the MoE experts. - Head remapping and FFN averaging are approximations. - Interleaving layers from different training distributions may degrade instruction-following or safety alignment. - Quality can often be improved with a brief recovery finetune (e.g., bf16, LR ~1e-5, a few thousand steps). This has not been performed on this model. Evaluation - No formal benchmarks provided with this release. - Suggested sanity checks: - Perplexity on small corpora (should be in a reasonable range, not exploding). - Basic instruction-following and chat safety probes. - Compare generations against Qwen3-8B baseline. Usage Python (Transformers) ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "snwy/frankenqwen3-8B-235B-dense-conversion-interleaved-untuned" tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) # ... ``` CLI (text-generation-inference or transformers-cli) - Any standard Qwen3-8B serving setup should work, as config/arch match the base. Reproducibility (high-level) - Convert MoE to dense (recommended method: average or router_weighted) - Build composite interleaving with desired depth - Validate load on meta device Example commands ```bash # 1) Convert MoE -> dense (average is safer than concat) python moe_to_dense.py \ --model_id Qwen/Qwen3-235B-A22B-Instruct-2507 \ --target_model Qwen/Qwen3-8B \ --output_path ./qwen3-235b-dense-avg \ --method average \ --low_memory # 2) Build composite (example: 48 layers) python moe_to_dense.py \ --compose_interleaved \ --base_model Qwen/Qwen3-8B \ --moe_converted ./qwen3-235b-dense-avg \ --composite_output_path ./qwen3-8b-plus-moe-48L \ --final_layers 48 \ --interleave_strategy even \ --cast_dtype bfloat16 \ --low_memory # 3) Validate shapes/load python moe_to_dense.py --validate_model ./qwen3-8b-plus-moe-48L ``` Safety, bias, and limitations - May produce inaccurate or biased content inherited from source models. - Not safe for deployment without additional alignment and filtering. - Do not use for high-stakes or vulnerable-domain applications. Versioning - Version: v0.1 (first public composite) - Changes from base: - Increased depth: 64 vs 32 - Interleaved MoE-derived blocks after MoE→dense conversion - RoPE and tokenizer inherited from base model Licenses - This composite inherits licenses from the source models. Please refer to: - Qwen/Qwen3-8B license - Qwen/Qwen3-235B-A22* license - If redistributing, ensure compliance with both upstream licenses. Acknowledgments - Qwen team for Qwen3 models and tokenizer. - Community tools: Hugging Face Transformers, safetensors, huggingface_hub. Citation - If you use this model, please cite the upstream Qwen3 models and this composite: ``` @misc{qwen3-8b-plus-moe-L, title = {Qwen3 8B + MoE Interleaved Composite (64 Layers)}, author = {snwy}, year = {2025}, url = {https://huggingface.co/snwy/frankenqwen3-8B-235B-dense-conversion-interleaved-untuned} } ``` Notes for practitioners - Smaller composites (e.g., 40–48 layers) tend to be more stable than very deep mixes without finetuning. - If quality is marginal, try: - Reducing the fraction of MoE layers, - Router-weighted averaging during conversion, - Short recovery finetuning (bf16, LR ~1e-5). - Ensure RoPE settings match the base across the entire composite (they do if you keep the base config).