thisnick's picture
Upload LlamaForCausalLM
6269a02 verified
---
{}
---
# Llama-3.1-8B-Instruct-abliterated
This is an abliterated version of Meta's Llama-3.1-8B-Instruct model, modified to reduce harmful outputs while maintaining general performance.
## Model Description
This model uses activation-based ablation techniques to modify the model's behavior regarding potentially harmful content. The technique involves:
1. Identifying activation directions that differentiate between harmful and harmless responses
2. Orthogonalizing the model's weights with respect to these directions
3. Modifying specific layers to reduce the model's tendency to generate harmful content
### Model Details
- **Base Model**: meta-llama/Llama-3.1-8B-Instruct
- **Modified Components**:
- Embedding layer (W_E)
- Attention output layers (W_O)
- MLP output layers (W_out)
- **Training Method**: No additional training - modifications were done through geometric interventions on the model weights
## Intended Uses
This model is intended for:
- General text generation and conversation
- Question answering
- Task completion
- Instruction following
While maintaining improved safety characteristics compared to the base model.
## Limitations
- The ablitation process may affect some legitimate use cases
- The model's behavior modifications are based on specific harmful/harmless datasets
- Performance on certain tasks may differ from the original model
## Training Data
The model modifications were guided using:
- Harmful instructions dataset: mlabonne/harmful_behaviors
- Harmless instructions dataset: mlabonne/harmless_alpaca
## Ethical Considerations
This model aims to reduce potentially harmful outputs while maintaining functionality. However, users should:
- Still implement appropriate content filtering
- Monitor outputs for unexpected behavior
- Use the model responsibly and in accordance with applicable laws and ethical guidelines
## Citation
If you use this model, please cite:
```
@misc{llama-3.1-8b-instruct-abliterated,
author = {[Your Name]},
title = {Llama-3.1-8B-Instruct-abliterated},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
}
```