Llama-3.1-8B-Instruct-abliterated
This is an abliterated version of Meta's Llama-3.1-8B-Instruct model, modified to reduce harmful outputs while maintaining general performance.
Model Description
This model uses activation-based ablation techniques to modify the model's behavior regarding potentially harmful content. The technique involves:
- Identifying activation directions that differentiate between harmful and harmless responses
- Orthogonalizing the model's weights with respect to these directions
- Modifying specific layers to reduce the model's tendency to generate harmful content
Model Details
- Base Model: meta-llama/Llama-3.1-8B-Instruct
- Modified Components:
- Embedding layer (W_E)
- Attention output layers (W_O)
- MLP output layers (W_out)
- Training Method: No additional training - modifications were done through geometric interventions on the model weights
Intended Uses
This model is intended for:
- General text generation and conversation
- Question answering
- Task completion
- Instruction following
While maintaining improved safety characteristics compared to the base model.
Limitations
- The ablitation process may affect some legitimate use cases
- The model's behavior modifications are based on specific harmful/harmless datasets
- Performance on certain tasks may differ from the original model
Training Data
The model modifications were guided using:
- Harmful instructions dataset: mlabonne/harmful_behaviors
- Harmless instructions dataset: mlabonne/harmless_alpaca
Ethical Considerations
This model aims to reduce potentially harmful outputs while maintaining functionality. However, users should:
- Still implement appropriate content filtering
- Monitor outputs for unexpected behavior
- Use the model responsibly and in accordance with applicable laws and ethical guidelines
Citation
If you use this model, please cite:
@misc{llama-3.1-8b-instruct-abliterated,
author = {[Your Name]},
title = {Llama-3.1-8B-Instruct-abliterated},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
}
- Downloads last month
- 7
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.