cgus commited on
Commit
c166459
·
verified ·
1 Parent(s): 389c6a5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: exllamav2
6
+ base_model:
7
+ - arcee-ai/Homunculus
8
+ tags:
9
+ - distillation
10
+ - /think
11
+ - /nothink
12
+ - reasoning-transfer
13
+ - arcee-ai
14
+ ---
15
+ # Homunculus-12B-exl2
16
+ Original model: [Homunculus](https://huggingface.co/arcee-ai/Homunculus) by [Arcee AI](https://huggingface.co/arcee-ai)
17
+ Based on: [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) by [Mistral AI](https://huggingface.co/mistralai)
18
+
19
+ ## Quants
20
+ [4bpw h6 (main)](https://huggingface.co/cgus/Homunculus-exl2/tree/main)
21
+ [4.5bpw h6](https://huggingface.co/cgus/Homunculus-exl2/tree/4.5bpw-h6)
22
+ [5bpw h6](https://huggingface.co/cgus/Homunculus-exl2/tree/5bpw-h6)
23
+ [6bpw h6](https://huggingface.co/cgus/Homunculus-exl2/tree/6bpw-h6)
24
+ [8bpw h8](https://huggingface.co/cgus/Homunculus-exl2/tree/8bpw-h8)
25
+
26
+ ## Quantization notes
27
+ Made with Exllamav2 0.3.1 with default dataset.
28
+ These quants can be used with RTX GPU (Windows) or RTX/ROCm GPUs (Linux) with TabbyAPI or Text-Generation-WebUI.
29
+ Ensure you have enough VRAM to use it. I used to run 6bpw Mistral-Nemo quants with 12GB VRAM at 16k context/Q6 or Q4 cache.
30
+ If you have old GPUs (e.g. GTX/P40) or low VRAM, try using GGUF quants instead.
31
+ # Original model card
32
+ ![Homunculus Logo](https://huggingface.co/arcee-ai/Homunculus/resolve/main/logo.jpg)
33
+
34
+ # Arcee **Homunculus-12B**
35
+
36
+ **Homunculus** is a 12 billion-parameter instruction model distilled from **Qwen3-235B** onto the **Mistral-Nemo** backbone.
37
+ It was purpose-built to preserve Qwen’s two-mode interaction style—`/think` (deliberate chain-of-thought) and `/nothink` (concise answers)—while running on a single consumer GPU.
38
+
39
+ ---
40
+
41
+ ## ✨ What’s special?
42
+
43
+ | Feature | Detail |
44
+ | --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
45
+ | **Reasoning-trace transfer** | Instead of copying just final probabilities, we align *full* logit trajectories, yielding more faithful reasoning. |
46
+ | **Total-Variation-Distance loss** | To better match the teacher’s confidence distribution and smooth the loss landscape. |
47
+ | **Tokenizer replacement** | The original Mistral tokenizer was swapped for Qwen3's tokenizer. |
48
+ | **Dual interaction modes** | Use `/think` when you want transparent step-by-step reasoning (good for analysis & debugging). Use `/nothink` for terse, production-ready answers. Most reliable in the system role field. | |
49
+
50
+ ---
51
+
52
+ ## Benchmark results
53
+
54
+ | Benchmark | Score |
55
+ | --------- | ----- |
56
+ | GPQADiamond (average of 3) | 57.1% |
57
+ | mmlu | 67.5% |
58
+
59
+ ## 🔧 Quick Start
60
+
61
+ ```python
62
+ from transformers import AutoTokenizer, AutoModelForCausalLM
63
+
64
+ model_id = "arcee-ai/Homunculus"
65
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
66
+ model = AutoModelForCausalLM.from_pretrained(
67
+ model_id,
68
+ torch_dtype="auto",
69
+ device_map="auto"
70
+ )
71
+
72
+ # /think mode - Chain-of-thought reasoning
73
+ messages = [
74
+ {"role": "system", "content": "You are a helpful assistant. /think"},
75
+ {"role": "user", "content": "Why is the sky blue?"},
76
+ ]
77
+ output = model.generate(
78
+ tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
79
+ max_new_tokens=512,
80
+ temperature=0.7
81
+ )
82
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
83
+
84
+ # /nothink mode - Direct answers
85
+ messages = [
86
+ {"role": "system", "content": "You are a helpful assistant. /nothink"},
87
+ {"role": "user", "content": "Summarize the plot of Hamlet in two sentences."},
88
+ ]
89
+ output = model.generate(
90
+ tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
91
+ max_new_tokens=128,
92
+ temperature=0.7
93
+ )
94
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
95
+ ```
96
+
97
+ ## 💡 Intended Use & Limitations
98
+
99
+ Homunculus is designed for:
100
+
101
+ * **Research** on reasoning-trace distillation, Logit Imitation, and mode-switchable assistants.
102
+ * **Lightweight production** deployments that need strong reasoning at <12 GB VRAM.
103
+
104
+ ### Known limitations
105
+
106
+ * May inherit biases from the Qwen3 teacher and internet-scale pretraining data.
107
+ * Long-context (>32 k tokens) use is experimental—expect latency & memory overhead.
108
+
109
+ ---