Homunculus-12B-exl2
Original model: Homunculus by Arcee AI
Based on: Mistral-Nemo-Base-2407 by Mistral AI and Qwen3-235B-A22B by Qwen
Quants
4bpw h6 (main)
4.5bpw h6
5bpw h6
6bpw h6
8bpw h8
Quantization notes
Made with Exllamav2 0.3.1 with default dataset.
These quants can be used with RTX GPU (Windows) or RTX/ROCm GPUs (Linux) with TabbyAPI or Text-Generation-WebUI.
Ensure you have enough VRAM to use it. I used to run 6bpw Mistral-Nemo quants with 12GB VRAM at 16k context/Q6 or Q4 cache.
If you have old GPUs (e.g. GTX/P40) or low VRAM, try using GGUF quants instead.
Original model card
Arcee Homunculus-12B
Homunculus is a 12 billion-parameter instruction model distilled from Qwen3-235B onto the Mistral-Nemo backbone.
It was purpose-built to preserve Qwen’s two-mode interaction style—/think
(deliberate chain-of-thought) and /nothink
(concise answers)—while running on a single consumer GPU.
✨ What’s special?
Feature | Detail |
---|---|
Reasoning-trace transfer | Instead of copying just final probabilities, we align full logit trajectories, yielding more faithful reasoning. |
Total-Variation-Distance loss | To better match the teacher’s confidence distribution and smooth the loss landscape. |
Tokenizer replacement | The original Mistral tokenizer was swapped for Qwen3's tokenizer. |
Dual interaction modes | Use /think when you want transparent step-by-step reasoning (good for analysis & debugging). Use /nothink for terse, production-ready answers. Most reliable in the system role field. |
Benchmark results
Benchmark | Score |
---|---|
GPQADiamond (average of 3) | 57.1% |
mmlu | 67.5% |
🔧 Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "arcee-ai/Homunculus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
# /think mode - Chain-of-thought reasoning
messages = [
{"role": "system", "content": "You are a helpful assistant. /think"},
{"role": "user", "content": "Why is the sky blue?"},
]
output = model.generate(
tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
max_new_tokens=512,
temperature=0.7
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# /nothink mode - Direct answers
messages = [
{"role": "system", "content": "You are a helpful assistant. /nothink"},
{"role": "user", "content": "Summarize the plot of Hamlet in two sentences."},
]
output = model.generate(
tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
max_new_tokens=128,
temperature=0.7
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
💡 Intended Use & Limitations
Homunculus is designed for:
- Research on reasoning-trace distillation, Logit Imitation, and mode-switchable assistants.
- Lightweight production deployments that need strong reasoning at <12 GB VRAM.
Known limitations
- May inherit biases from the Qwen3 teacher and internet-scale pretraining data.
- Long-context (>32 k tokens) use is experimental—expect latency & memory overhead.
- Downloads last month
- 8