metadata

license: apache-2.0
language:
  - en
pipeline_tag: text-generation
tags:
  - MoE

LLaMA-MoE-v1-3.5B (4/16)

[💻 Code] | [📜 Technical Report]

👋 Very nice to meet you here~

❤️ This repo contains the model LLaMA-MoE-v1-3.5B (4/16), which activates 4 out of 16 experts (3.5B parameters). This model is NOT fine-tuned by instruction pairs, so it may not be good enough to act like a chatbot.

📢 LLaMA-MoE is a series of Mixture-of-Expert (MoE) models based on LLaMA-2. You can find the code for training this model at this repo.

💎 This series of models are obtained by partitioning original LLaMA FFNs into experts and further continual pre-training. The total model size is only 6.7B parameters, which is very convenient for deployment and research usage. More details could be found at our technical report.

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-4_16"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

📊 Performance

Model	#Activated Experts	#Experts	#Activated Params	Links
LLaMA-MoE-3.0B	2	16	3.0B	[🤗 HF Weights]
LLaMA-MoE-3.5B (4/16)	4	16	3.5B	[🤗 HF Weights]
LLaMA-MoE-3.5B (2/8)	2	8	3.5B	[🤗 HF Weights]

Model	SciQ	PIQA	WinoGrande	ARC-e	ARC-c (25)	HellaSwag (10)	LogiQA	BoolQ (32)	LAMBADA	NQ (32)	MMLU (5)	Average
OPT-2.7B	78.9	74.8	60.8	54.4	34.0	61.4	25.8	63.3	63.6	10.7	25.8	50.3
Pythia-2.8B	83.2	73.6	59.6	58.8	36.7	60.7	28.1	65.9	64.6	8.7	26.8	51.5
INCITE-BASE-3B	85.6	73.9	63.5	61.7	40.3	64.7	27.5	65.8	65.4	15.2	27.2	53.7
Open-LLaMA-3B-v2	88.0	77.9	63.1	63.3	40.1	71.4	28.1	69.2	67.4	16.0	26.8	55.6
Sheared-LLaMA-2.7B	87.5	76.9	65.0	63.3	41.6	71.0	28.3	73.6	68.3	17.6	27.3	56.4
LLaMA-MoE-3.0B	84.2	77.5	63.6	60.2	40.9	70.8	30.6	71.9	66.6	17.0	26.8	55.5
LLaMA-MoE-3.5B (4/16)	87.6	77.9	65.5	65.6	44.2	73.3	29.7	75.0	69.5	20.3	26.8	57.7
LLaMA-MoE-3.5B (2/8)	88.4	77.6	66.7	65.3	43.1	73.3	29.6	73.9	69.4	19.8	27.0	57.6

📖 Details

Training Data: 200B tokens from SlimPajama with the same data sampling weights as Sheared LLaMA.

📃 Citation

@article{llama-moe,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={Tong Zhu and Xiaoye Qu and Daize Dong and Jiacheng Ruan and Jingqi Tong and Conghui He and Yu Cheng},
  journal={arXiv preprint arXiv:2406.16554},
  year={2024},
  url={https://arxiv.org/abs/2406.16554},
}