LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage. Please refer to our technical report for more details.

Model	#Activated Experts	#Experts	#Activated Params	Links
LLaMA-MoE-3.0B	2	16	3.0B	[🤗 HF Weights]
LLaMA-MoE-3.5B (4/16)	4	16	3.5B	[🤗 HF Weights]
LLaMA-MoE-3.5B (2/8)	2	8	3.5B	[🤗 HF Weights]

Model	Average	SciQ	PIQA	WinoGrande	ARC-e	ARC-c (25)	HellaSwag (10)	LogiQA	BoolQ (32)	LAMBADA	NQ (32)	MMLU (5)
OPT-2.7B	50.3	78.9	74.8	60.8	54.4	34.0	61.4	25.8	63.3	63.6	10.7	25.8
Pythia-2.8B	51.5	83.2	73.6	59.6	58.8	36.7	60.7	28.1	65.9	64.6	8.7	26.8
INCITE-BASE-3B	53.7	85.6	73.9	63.5	61.7	40.3	64.7	27.5	65.8	65.4	15.2	27.2
Open-LLaMA-3B-v2	55.6	88.0	77.9	63.1	63.3	40.1	71.4	28.1	69.2	67.4	16.0	26.8
Sheared-LLaMA-2.7B	56.4	87.5	76.9	65.0	63.3	41.6	71.0	28.3	73.6	68.3	17.6	27.3
LLaMA-MoE-3.0B	55.5	84.2	77.5	63.6	60.2	40.9	70.8	30.6	71.9	66.6	17.0	26.8
LLaMA-MoE-3.5B (4/16)	57.7	87.6	77.9	65.5	65.6	44.2	73.3	29.7	75.0	69.5	20.3	26.8
LLaMA-MoE-3.5B (2/8)	57.6	88.4	77.6	66.7	65.3	43.1	73.3	29.6	73.9	69.4	19.8	27.0

models 8

datasets 0

None public yet

AI & ML interests

Recent Activity

Team members 6

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

models 8 Sort: Recently updated

datasets 0

models 8