chuanguo/TamedLlama-70B-Instruct

TamedLlama-70B-Instruct

Repository for TamedLlama-70B-Instruct, a fine-tuned variant of Llama-3.3-70B-Instruct that is robust against prompt injection attacks. See our TamedLlama paper for more information.

We also release a smaller TamedLlama-8B-Instruct model, fine-tuned from Llama-3-8B-Instruct, for use under resource-constrained settings.

Utility Evaluation (higher is better)

Category	Benchmark	Metric	Llama 3.3 70B Instruct	TamedLlama 70B Instruct	GPT-4o-mini	GPT-4o (2024-11-20)
General Knowledge	MMLU (0-shot, CoT)	macro_avg/acc	86.2	85.0	82.0^[1]	85.7^[2]
	MMLU Pro (5-shot, CoT)	macro_avg/acc	67.8	67.1	63.1^[3]	77.9^[3]
	IFEval		91.1	86.4	-	-
	BBH (3-shot, CoT)	acc	86.2	85.1	-	-
	GPQA (0-shot, CoT)	acc	62.3	58.5	40.2^[1]	46.0^[2]
Instruction Following	AlpacaEval2	win_rate	44.8	43.3	44.7	56.2
	SEP	win_rate	64.9	62.5	65.9	64.9
Agentic Workflows	AgentDojo (w/o attack)	success_rate	56.7	72.2	67.0	79.4
	AgentDojo (w/ attack)	success_rate	39.0	64.3	51.6	67.4
	WASP	success_rate	48.6	51.4	27.0	32.4

Security Evaluation (lower is better)

Category	Benchmark	Metric	Llama 3.3 70B Instruct	TamedLlama 70B Instruct	GPT-4o-mini	GPT-4o (2024-11-20)
Instruction Following	AlpacaFarm	ASR	94.2	0.0	0.5	0.0
	SEP (start)	ASR	68.3	5.0	14.6	14.8
	SEP (end)	ASR	87.1	2.5	9.1	14.4
	TaskTracker	ASR	21.9	0.2	0.3	0.6
	CyberSecEval2	ASR	52.7	7.2	25.5	20.0
Agentic Workflows	InjecAgent (base)	ASR-total	21.7	1.3	0.9	18.2
	InjecAgent (enhanced)	ASR-total	50.6	2.8	3.3	22.7
	AgentDojo	ASR	14.1	1.3	11.9	20.4
	WASP (intermediate)	ASR	25.0	2.4	53.6	17.9
	WASP (end2end)	ASR	4.8	1.2	0.0	2.4

chuanguo
/

TamedLlama-70B-Instruct

You need to agree to share your contact information to access this model

TamedLlama-70B-Instruct

Utility Evaluation (higher is better)

Security Evaluation (lower is better)

Model tree for chuanguo/TamedLlama-70B-Instruct

Dataset used to train chuanguo/TamedLlama-70B-Instruct