Homunculus-12B-exl2

Original model: Homunculus by Arcee AI
Based on: Mistral-Nemo-Base-2407 by Mistral AI and Qwen3-235B-A22B by Qwen

Quants

4bpw h6 (main)
4.5bpw h6
5bpw h6
6bpw h6
8bpw h8

Quantization notes

Made with Exllamav2 0.3.1 with default dataset.
These quants can be used with RTX GPU (Windows) or RTX/ROCm GPUs (Linux) with TabbyAPI or Text-Generation-WebUI.
Ensure you have enough VRAM to use it. I used to run 6bpw Mistral-Nemo quants with 12GB VRAM at 16k context/Q6 or Q4 cache.
If you have old GPUs (e.g. GTX/P40) or low VRAM, try using GGUF quants instead.

Original model card

Arcee Homunculus-12B

Homunculus is a 12 billion-parameter instruction model distilled from Qwen3-235B onto the Mistral-Nemo backbone. It was purpose-built to preserve Qwen’s two-mode interaction style—/think (deliberate chain-of-thought) and /nothink (concise answers)—while running on a single consumer GPU.

✨ What’s special?

Feature	Detail
Reasoning-trace transfer	Instead of copying just final probabilities, we align full logit trajectories, yielding more faithful reasoning.
Total-Variation-Distance loss	To better match the teacher’s confidence distribution and smooth the loss landscape.
Tokenizer replacement	The original Mistral tokenizer was swapped for Qwen3's tokenizer.
Dual interaction modes	Use `/think` when you want transparent step-by-step reasoning (good for analysis & debugging). Use `/nothink` for terse, production-ready answers. Most reliable in the system role field.

Benchmark results

Benchmark	Score
GPQADiamond (average of 3)	57.1%
mmlu	67.5%

🔧 Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "arcee-ai/Homunculus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto",
    device_map="auto"
)

# /think mode - Chain-of-thought reasoning
messages = [
    {"role": "system", "content": "You are a helpful assistant. /think"},
    {"role": "user", "content": "Why is the sky blue?"},
]
output = model.generate(
    tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
    max_new_tokens=512,
    temperature=0.7
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

# /nothink mode - Direct answers
messages = [
    {"role": "system", "content": "You are a helpful assistant. /nothink"},
    {"role": "user", "content": "Summarize the plot of Hamlet in two sentences."},
]
output = model.generate(
    tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
    max_new_tokens=128,
    temperature=0.7
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

💡 Intended Use & Limitations

Homunculus is designed for:

Research on reasoning-trace distillation, Logit Imitation, and mode-switchable assistants.
Lightweight production deployments that need strong reasoning at <12 GB VRAM.

Known limitations

May inherit biases from the Qwen3 teacher and internet-scale pretraining data.
Long-context (>32 k tokens) use is experimental—expect latency & memory overhead.

cgus
/

Homunculus-exl2