huihui-ai/Huihui-MoE-0.8B-2E
Model Overview
Huihui-MoE-0.8B-2E is a Mixture of Experts (MoE) language model developed by huihui.ai, built upon the Qwen/Qwen3-0.6B base model. It enhances the standard Transformer architecture by replacing MLP layers with MoE layers, each containing 2 experts, to achieve high performance with efficient inference. The model is designed for natural language processing tasks, including text generation, question answering, and conversational applications.
Huihui-MoE-0.8B-2E is currently the smallest MoE model and can be scaled to include more experts. It has not been fine-tuned and can be fine-tuned according to your specific requirements.
If you do not perform fine-tuning, you can use it in the same way as the original model Qwen/Qwen3-0.6B.
After testing,
with 64 experts based on Qwen3-0.6B, the model is approximately at a 17B parameter level,
with 128 experts based on Qwen3-0.6B, the model is approximately at a 34B parameter level.
- Architecture: Qwen3MoeForCausalLM model with 2 experts per layer (num_experts=2), activating 1 expert per token (num_experts_per_tok=1).
- Total Parameters: ~0.88 billion (0.8B)
- Activated Parameters: ~0.62 billion (0.6B) during inference, comparable to Qwen3-0.6B
- Developer: huihui.ai
- Release Date: June 2025
- License: Inherits the license of the Qwen3 base model (apache-2.0)
Training
- Base Model: Qwen3-0.6B, pre-trained by the Qwen team.
- Conversion: The model copies embeddings, self-attention, and normalization weights from Qwen3-0.6B, replacing MLP layers with MoE layers (2 experts). Gating weights are randomly initialized.
- Fine-Tuning: Not fine-tuned; users are recommended to fine-tune for specific tasks to optimize expert routing.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer
import torch
import os
import signal
cpu_count = os.cpu_count()
print(f"Number of CPU cores in the system: {cpu_count}")
half_cpu_count = cpu_count // 2
os.environ["MKL_NUM_THREADS"] = str(half_cpu_count)
os.environ["OMP_NUM_THREADS"] = str(half_cpu_count)
torch.set_num_threads(half_cpu_count)
print(f"PyTorch threads: {torch.get_num_threads()}")
print(f"MKL threads: {os.getenv('MKL_NUM_THREADS')}")
print(f"OMP threads: {os.getenv('OMP_NUM_THREADS')}")
# Load the model and tokenizer
NEW_MODEL_ID = "huihui-ai/Huihui-MoE-0.8B-2E"
print(f"Load Model {NEW_MODEL_ID} ... ")
quant_config_4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
llm_int8_enable_fp32_cpu_offload=True,
)
model = AutoModelForCausalLM.from_pretrained(
NEW_MODEL_ID,
device_map="auto",
trust_remote_code=True,
#quantization_config=quant_config_4,
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(NEW_MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
initial_messages = [{"role": "system", "content": "You are a helpful assistant."}]
messages = initial_messages.copy()
enable_thinking = True
skip_prompt=True
skip_special_tokens=True
class CustomTextStreamer(TextStreamer):
def __init__(self, tokenizer, skip_prompt=True, skip_special_tokens=True):
super().__init__(tokenizer, skip_prompt=skip_prompt, skip_special_tokens=skip_special_tokens)
self.generated_text = ""
self.stop_flag = False
def on_finalized_text(self, text: str, stream_end: bool = False):
self.generated_text += text
print(text, end="", flush=True)
if self.stop_flag:
raise StopIteration
def stop_generation(self):
self.stop_flag = True
def generate_stream(model, tokenizer, messages, enable_thinking, skip_prompt, skip_special_tokens, max_new_tokens):
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
enable_thinking = enable_thinking,
add_generation_prompt=True,
return_tensors="pt"
)
attention_mask = torch.ones_like(input_ids, dtype=torch.long)
tokens = input_ids.to(model.device)
attention_mask = attention_mask.to(model.device)
streamer = CustomTextStreamer(tokenizer, skip_prompt=skip_prompt, skip_special_tokens=skip_special_tokens)
def signal_handler(sig, frame):
streamer.stop_generation()
print("\n[Generation stopped by user with Ctrl+C]")
signal.signal(signal.SIGINT, signal_handler)
print("Response: ", end="", flush=True)
try:
generated_ids = model.generate(
tokens,
attention_mask=attention_mask,
use_cache=False,
max_new_tokens=max_new_tokens,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
streamer=streamer
)
del generated_ids
except StopIteration:
print("\n[Stopped by user]")
del input_ids, attention_mask
torch.cuda.empty_cache()
signal.signal(signal.SIGINT, signal.SIG_DFL)
return streamer.generated_text, streamer.stop_flag
while True:
user_input = input("User: ").strip()
if user_input.lower() == "/exit":
print("Exiting chat.")
break
if user_input.lower() == "/clear":
messages = initial_messages.copy()
print("Chat history cleared. Starting a new conversation.")
continue
if user_input.lower() == "/nothink":
if enable_thinking:
enable_thinking = False
print("Thinking = False.")
else:
enable_thinking = True
print("Thinking = True.")
continue
if user_input.lower() == "/skip_prompt":
if skip_prompt:
skip_prompt = False
print("skip_prompt = False.")
else:
skip_prompt = True
print("skip_prompt = True.")
continue
if user_input.lower() == "/skip_special_tokens":
if skip_special_tokens:
skip_special_tokens = False
print("skip_special_tokens = False.")
else:
skip_special_tokens = True
print("skip_special_tokens = True.")
continue
if not user_input:
print("Input cannot be empty. Please enter something.")
continue
messages.append({"role": "user", "content": user_input})
response, stop_flag = generate_stream(model, tokenizer, messages, enable_thinking, skip_prompt, skip_special_tokens, 14192)
print("", flush=True)
if stop_flag:
continue
messages.append({"role": "assistant", "content": response})
Applications
- Text Generation: Articles, dialogues, and creative writing.
- Question Answering: Information retrieval and query resolution.
- Conversational AI: Multi-turn dialogues for chatbots.
- Research: Exploration of MoE architectures and efficient model scaling.
Limitations
- Fine-Tuning Required: Randomly initialized gating weights may lead to suboptimal expert utilization without fine-tuning.
- Compatibility: Developed with transformers 4.52.4; ensure matching versions to avoid loading issues.
- Inference Speed: While efficient for an MoE model, performance depends on hardware (GPU recommended).
Ethical Considerations
- Bias: Inherits potential biases from the Qwen3-0.6B base model; users should evaluate outputs for fairness.
- Usage: Intended for research and responsible applications; avoid generating harmful or misleading content.
Contact
- Developer: huihui.ai
- Repository: huihui-ai/Huihui-MoE-0.8B-2E (available locally or on Hugging Face)
- Issues: Report bugs or request features via the repository or please send an email to [email protected]
Acknowledgments
- Built upon the Qwen3-0.6B model by the Qwen team.
- Powered by the Hugging Face transformers library.
- Downloads last month
- 70