Daemontatox/Zirel-3

Model Description

Zirel-3 is a specialized finetune of cerebras/GLM-4.5-Air-REAP-82B-A12B, a memory-efficient 82B active parameter Mixture-of-Experts (MoE) model compressed using the novel REAP (Router-weighted Expert Activation Pruning) technique.

Base Model: GLM-4.5-Air-REAP-82B-A12B

The base model is a compressed variant of GLM-4.5-Air that:

  • Maintains near-identical performance while being 25% lighter (compressed from 110B to 82B total parameters)
  • Uses 82B parameters (~12B activated per forward pass)
  • Employs the REAP pruning method which outperforms expert merging, especially on generative tasks
  • Retains full capabilities: code generation, agentic workflows, repository-scale understanding, and function calling
  • Achieves drop-in compatibility with vanilla vLLM (no custom patches required)

REAP Technology

REAP (Router-weighted Expert Activation Pruning) is a one-shot MoE compression method that:

  • Prunes low-impact experts based on router gate values and expert activation norms
  • Preserves the router's independent control over remaining experts
  • Significantly outperforms expert merging on generative benchmarks (code, creative writing, math)
  • Maintains 95-97% of baseline model quality even at high compression ratios

Paper: REAP the Experts: Why Pruning Prevails for One-Shot MoE compression (Lasby et al., 2025)

Zirel-3 Finetune

This finetune was trained on a custom curated dataset designed to enhance the model's overall capabilities across multiple domains including instruction following, reasoning, and domain-specific knowledge. The training process builds upon the strong foundation of the REAP-compressed GLM-4.5-Air base model.

Model Specifications

  • Total Parameters: 82B parameters (12 active)
  • Architecture: Sparse Mixture-of-Experts (SMoE)
  • Context Length: 128K tokens
  • Precision: BF16/FP16 compatible
  • License: MIT

Usage

Installation

pip install transformers torch vllm

Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Prepare conversation
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the REAP pruning method in simple terms."}
]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.1
)

# Decode
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Inference with vLLM (Recommended for Production)

vLLM provides significantly faster inference with built-in optimizations for MoE models:

# Serve the model
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --max-num-seqs 64 \
    --dtype bfloat16

Python Client:

from openai import OpenAI

# Connect to vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Create completion
response = client.chat.completions.create(
    model="Daemontatox/Zirel-3",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a Python function to implement binary search."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

Streaming Response

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread

model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "user", "content": "Explain quantum computing"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)

generation_kwargs = dict(
    inputs=inputs['input_ids'],
    streamer=streamer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

for new_text in streamer:
    print(new_text, end='', flush=True)

vLLM Advanced Configuration

# Multi-GPU setup with expert parallelism
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --max-num-seqs 64 \
    --max-model-len 32768 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --swap-space 16 \
    --disable-log-requests

# For low memory situations
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --max-num-seqs 32 \
    --max-model-len 16384 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.85

Batch Processing Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Batch of prompts
prompts = [
    "Explain machine learning",
    "Write a sorting algorithm",
    "What is the capital of France?"
]

# Convert to chat format
conversations = [
    [{"role": "user", "content": prompt}] for prompt in prompts
]

# Apply chat template to all
texts = [
    tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
    for conv in conversations
]

# Tokenize with padding
inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=2048
).to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id
)

# Decode all responses
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for prompt, response in zip(prompts, responses):
    print(f"Q: {prompt}\nA: {response}\n{'-'*50}")

Limitations

  • This is a large MoE model requiring substantial compute resources
  • Performance may vary based on hardware and optimization settings
  • May inherit biases present in training data
  • Requires careful prompt engineering for optimal results

Citation

If you use this model, please cite both the base model and the REAP paper:

@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

@misc{zirel3,
  title={Zirel-3: A Specialized Finetune of GLM-4.5-Air-REAP},
  author={Daemontatox},
  year={2025},
  howpublished={\url{https://huggingface.co/Daemontatox/Zirel-3}}
}

Acknowledgments

This model builds upon:

  • Cerebras Research for the REAP compression method and GLM-4.5-Air-REAP base model
  • Original GLM-4.5-Air by Zhipu AI
  • The open-source AI community for tooling and infrastructure

License

MIT License - Same as the base model.

Downloads last month
277
Safetensors
Model size
82B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Daemontatox/Zirel-3

Finetuned
(1)
this model
Quantizations
1 model