Daemontatox/Zirel-3
Model Description
Zirel-3 is a specialized finetune of cerebras/GLM-4.5-Air-REAP-82B-A12B, a memory-efficient 82B active parameter Mixture-of-Experts (MoE) model compressed using the novel REAP (Router-weighted Expert Activation Pruning) technique.
Base Model: GLM-4.5-Air-REAP-82B-A12B
The base model is a compressed variant of GLM-4.5-Air that:
- Maintains near-identical performance while being 25% lighter (compressed from 110B to 82B total parameters)
- Uses 82B parameters (~12B activated per forward pass)
- Employs the REAP pruning method which outperforms expert merging, especially on generative tasks
- Retains full capabilities: code generation, agentic workflows, repository-scale understanding, and function calling
- Achieves drop-in compatibility with vanilla vLLM (no custom patches required)
REAP Technology
REAP (Router-weighted Expert Activation Pruning) is a one-shot MoE compression method that:
- Prunes low-impact experts based on router gate values and expert activation norms
- Preserves the router's independent control over remaining experts
- Significantly outperforms expert merging on generative benchmarks (code, creative writing, math)
- Maintains 95-97% of baseline model quality even at high compression ratios
Paper: REAP the Experts: Why Pruning Prevails for One-Shot MoE compression (Lasby et al., 2025)
Zirel-3 Finetune
This finetune was trained on a custom curated dataset designed to enhance the model's overall capabilities across multiple domains including instruction following, reasoning, and domain-specific knowledge. The training process builds upon the strong foundation of the REAP-compressed GLM-4.5-Air base model.
Model Specifications
- Total Parameters: 82B parameters (12 active)
- Architecture: Sparse Mixture-of-Experts (SMoE)
- Context Length: 128K tokens
- Precision: BF16/FP16 compatible
- License: MIT
Usage
Installation
pip install transformers torch vllm
Inference with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Prepare conversation
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain the REAP pruning method in simple terms."}
]
# Apply chat template
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Tokenize
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
repetition_penalty=1.1
)
# Decode
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
Inference with vLLM (Recommended for Production)
vLLM provides significantly faster inference with built-in optimizations for MoE models:
# Serve the model
vllm serve Daemontatox/Zirel-3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-seqs 64 \
--dtype bfloat16
Python Client:
from openai import OpenAI
# Connect to vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Create completion
response = client.chat.completions.create(
model="Daemontatox/Zirel-3",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Write a Python function to implement binary search."}
],
temperature=0.7,
max_tokens=512
)
print(response.choices[0].message.content)
Streaming Response
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
messages = [
{"role": "user", "content": "Explain quantum computing"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)
generation_kwargs = dict(
inputs=inputs['input_ids'],
streamer=streamer,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for new_text in streamer:
print(new_text, end='', flush=True)
vLLM Advanced Configuration
# Multi-GPU setup with expert parallelism
vllm serve Daemontatox/Zirel-3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-seqs 64 \
--max-model-len 32768 \
--dtype bfloat16 \
--gpu-memory-utilization 0.9 \
--swap-space 16 \
--disable-log-requests
# For low memory situations
vllm serve Daemontatox/Zirel-3 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--max-num-seqs 32 \
--max-model-len 16384 \
--dtype bfloat16 \
--gpu-memory-utilization 0.85
Batch Processing Example
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Batch of prompts
prompts = [
"Explain machine learning",
"Write a sorting algorithm",
"What is the capital of France?"
]
# Convert to chat format
conversations = [
[{"role": "user", "content": prompt}] for prompt in prompts
]
# Apply chat template to all
texts = [
tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
for conv in conversations
]
# Tokenize with padding
inputs = tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=2048
).to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
# Decode all responses
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for prompt, response in zip(prompts, responses):
print(f"Q: {prompt}\nA: {response}\n{'-'*50}")
Limitations
- This is a large MoE model requiring substantial compute resources
- Performance may vary based on hardware and optimization settings
- May inherit biases present in training data
- Requires careful prompt engineering for optimal results
Citation
If you use this model, please cite both the base model and the REAP paper:
@article{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
@misc{zirel3,
title={Zirel-3: A Specialized Finetune of GLM-4.5-Air-REAP},
author={Daemontatox},
year={2025},
howpublished={\url{https://huggingface.co/Daemontatox/Zirel-3}}
}
Acknowledgments
This model builds upon:
- Cerebras Research for the REAP compression method and GLM-4.5-Air-REAP base model
- Original GLM-4.5-Air by Zhipu AI
- The open-source AI community for tooling and infrastructure
License
MIT License - Same as the base model.
- Downloads last month
- 277