thisnick
/

Llama-3.1-8B-Instruct-abliterated

Model card Files Files and versions

Llama-3.1-8B-Instruct-abliterated / README.md

thisnick's picture

Upload LlamaForCausalLM

6269a02 verified 9 months ago

|

history blame contribute delete

2.14 kB

	---
	{}
	---
	# Llama-3.1-8B-Instruct-abliterated

	This is an abliterated version of Meta's Llama-3.1-8B-Instruct model, modified to reduce harmful outputs while maintaining general performance.

	## Model Description

	This model uses activation-based ablation techniques to modify the model's behavior regarding potentially harmful content. The technique involves:

	1. Identifying activation directions that differentiate between harmful and harmless responses
	2. Orthogonalizing the model's weights with respect to these directions
	3. Modifying specific layers to reduce the model's tendency to generate harmful content

	### Model Details
	- Base Model: meta-llama/Llama-3.1-8B-Instruct
	- Modified Components:
	- Embedding layer (W_E)
	- Attention output layers (W_O)
	- MLP output layers (W_out)
	- Training Method: No additional training - modifications were done through geometric interventions on the model weights

	## Intended Uses

	This model is intended for:
	- General text generation and conversation
	- Question answering
	- Task completion
	- Instruction following

	While maintaining improved safety characteristics compared to the base model.

	## Limitations

	- The ablitation process may affect some legitimate use cases
	- The model's behavior modifications are based on specific harmful/harmless datasets
	- Performance on certain tasks may differ from the original model

	## Training Data

	The model modifications were guided using:
	- Harmful instructions dataset: mlabonne/harmful_behaviors
	- Harmless instructions dataset: mlabonne/harmless_alpaca

	## Ethical Considerations

	This model aims to reduce potentially harmful outputs while maintaining functionality. However, users should:
	- Still implement appropriate content filtering
	- Monitor outputs for unexpected behavior
	- Use the model responsibly and in accordance with applicable laws and ethical guidelines

	## Citation

	If you use this model, please cite:
	```
	@misc{llama-3.1-8b-instruct-abliterated,
	author = {[Your Name]},
	title = {Llama-3.1-8B-Instruct-abliterated},
	year = {2024},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	}
	```