|
|
--- |
|
|
{} |
|
|
--- |
|
|
# Llama-3.1-8B-Instruct-abliterated |
|
|
|
|
|
This is an abliterated version of Meta's Llama-3.1-8B-Instruct model, modified to reduce harmful outputs while maintaining general performance. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model uses activation-based ablation techniques to modify the model's behavior regarding potentially harmful content. The technique involves: |
|
|
|
|
|
1. Identifying activation directions that differentiate between harmful and harmless responses |
|
|
2. Orthogonalizing the model's weights with respect to these directions |
|
|
3. Modifying specific layers to reduce the model's tendency to generate harmful content |
|
|
|
|
|
### Model Details |
|
|
- **Base Model**: meta-llama/Llama-3.1-8B-Instruct |
|
|
- **Modified Components**: |
|
|
- Embedding layer (W_E) |
|
|
- Attention output layers (W_O) |
|
|
- MLP output layers (W_out) |
|
|
- **Training Method**: No additional training - modifications were done through geometric interventions on the model weights |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
This model is intended for: |
|
|
- General text generation and conversation |
|
|
- Question answering |
|
|
- Task completion |
|
|
- Instruction following |
|
|
|
|
|
While maintaining improved safety characteristics compared to the base model. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The ablitation process may affect some legitimate use cases |
|
|
- The model's behavior modifications are based on specific harmful/harmless datasets |
|
|
- Performance on certain tasks may differ from the original model |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model modifications were guided using: |
|
|
- Harmful instructions dataset: mlabonne/harmful_behaviors |
|
|
- Harmless instructions dataset: mlabonne/harmless_alpaca |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
This model aims to reduce potentially harmful outputs while maintaining functionality. However, users should: |
|
|
- Still implement appropriate content filtering |
|
|
- Monitor outputs for unexpected behavior |
|
|
- Use the model responsibly and in accordance with applicable laws and ethical guidelines |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
``` |
|
|
@misc{llama-3.1-8b-instruct-abliterated, |
|
|
author = {[Your Name]}, |
|
|
title = {Llama-3.1-8B-Instruct-abliterated}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face Model Hub}, |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
|