Daemontatox/Zirel-3

Model Description

Zirel-3 is a specialized finetune of cerebras/GLM-4.5-Air-REAP-82B-A12B, a memory-efficient 82B active parameter Mixture-of-Experts (MoE) model compressed using the novel REAP (Router-weighted Expert Activation Pruning) technique.

Base Model: GLM-4.5-Air-REAP-82B-A12B

The base model is a compressed variant of GLM-4.5-Air that:

Maintains near-identical performance while being 25% lighter (compressed from 110B to 82B total parameters)
Uses 82B parameters (~12B activated per forward pass)
Employs the REAP pruning method which outperforms expert merging, especially on generative tasks
Retains full capabilities: code generation, agentic workflows, repository-scale understanding, and function calling
Achieves drop-in compatibility with vanilla vLLM (no custom patches required)

REAP Technology

REAP (Router-weighted Expert Activation Pruning) is a one-shot MoE compression method that:

Prunes low-impact experts based on router gate values and expert activation norms
Preserves the router's independent control over remaining experts
Significantly outperforms expert merging on generative benchmarks (code, creative writing, math)
Maintains 95-97% of baseline model quality even at high compression ratios

Paper: REAP the Experts: Why Pruning Prevails for One-Shot MoE compression (Lasby et al., 2025)

Zirel-3 Finetune

This finetune was trained on a custom curated dataset designed to enhance the model's overall capabilities across multiple domains including instruction following, reasoning, and domain-specific knowledge. The training process builds upon the strong foundation of the REAP-compressed GLM-4.5-Air base model.

Model Specifications

Total Parameters: 82B parameters (12 active)
Architecture: Sparse Mixture-of-Experts (SMoE)
Context Length: 128K tokens
Precision: BF16/FP16 compatible
License: MIT

Usage

Installation

pip install transformers torch vllm

Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Prepare conversation
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the REAP pruning method in simple terms."}
]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.1
)

# Decode
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Inference with vLLM (Recommended for Production)

vLLM provides significantly faster inference with built-in optimizations for MoE models:

# Serve the model
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --max-num-seqs 64 \
    --dtype bfloat16

Python Client:

from openai import OpenAI

# Connect to vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Create completion
response = client.chat.completions.create(
    model="Daemontatox/Zirel-3",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a Python function to implement binary search."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

Streaming Response

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread

model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "user", "content": "Explain quantum computing"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)

generation_kwargs = dict(
    inputs=inputs['input_ids'],
    streamer=streamer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

for new_text in streamer:
    print(new_text, end='', flush=True)

vLLM Advanced Configuration

# Multi-GPU setup with expert parallelism
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --max-num-seqs 64 \
    --max-model-len 32768 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --swap-space 16 \
    --disable-log-requests

# For low memory situations
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --max-num-seqs 32 \
    --max-model-len 16384 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.85

Batch Processing Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Batch of prompts
prompts = [
    "Explain machine learning",
    "Write a sorting algorithm",
    "What is the capital of France?"
]

# Convert to chat format
conversations = [
    [{"role": "user", "content": prompt}] for prompt in prompts
]

# Apply chat template to all
texts = [
    tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
    for conv in conversations
]

# Tokenize with padding
inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=2048
).to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id
)

# Decode all responses
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for prompt, response in zip(prompts, responses):
    print(f"Q: {prompt}\nA: {response}\n{'-'*50}")

Limitations

This is a large MoE model requiring substantial compute resources
Performance may vary based on hardware and optimization settings
May inherit biases present in training data
Requires careful prompt engineering for optimal results

Citation

If you use this model, please cite both the base model and the REAP paper:

@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

@misc{zirel3,
  title={Zirel-3: A Specialized Finetune of GLM-4.5-Air-REAP},
  author={Daemontatox},
  year={2025},
  howpublished={\url{https://huggingface.co/Daemontatox/Zirel-3}}
}

Acknowledgments

This model builds upon:

Cerebras Research for the REAP compression method and GLM-4.5-Air-REAP base model
Original GLM-4.5-Air by Zhipu AI
The open-source AI community for tooling and infrastructure

License

MIT License - Same as the base model.

Downloads last month: 277

Safetensors

Model size

82B params

Tensor type

F32

BF16

Model tree for Daemontatox/Zirel-3

Base model

zai-org/GLM-4.5-Air

Finetuned

cerebras/GLM-4.5-Air-REAP-82B-A12B

Finetuned

(1)

this model

Quantizations

1 model