Llama-3.1-8B-Instruct-abliterated

This is an abliterated version of Meta's Llama-3.1-8B-Instruct model, modified to reduce harmful outputs while maintaining general performance.

Model Description

This model uses activation-based ablation techniques to modify the model's behavior regarding potentially harmful content. The technique involves:

  1. Identifying activation directions that differentiate between harmful and harmless responses
  2. Orthogonalizing the model's weights with respect to these directions
  3. Modifying specific layers to reduce the model's tendency to generate harmful content

Model Details

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Modified Components:
    • Embedding layer (W_E)
    • Attention output layers (W_O)
    • MLP output layers (W_out)
  • Training Method: No additional training - modifications were done through geometric interventions on the model weights

Intended Uses

This model is intended for:

  • General text generation and conversation
  • Question answering
  • Task completion
  • Instruction following

While maintaining improved safety characteristics compared to the base model.

Limitations

  • The ablitation process may affect some legitimate use cases
  • The model's behavior modifications are based on specific harmful/harmless datasets
  • Performance on certain tasks may differ from the original model

Training Data

The model modifications were guided using:

  • Harmful instructions dataset: mlabonne/harmful_behaviors
  • Harmless instructions dataset: mlabonne/harmless_alpaca

Ethical Considerations

This model aims to reduce potentially harmful outputs while maintaining functionality. However, users should:

  • Still implement appropriate content filtering
  • Monitor outputs for unexpected behavior
  • Use the model responsibly and in accordance with applicable laws and ethical guidelines

Citation

If you use this model, please cite:

@misc{llama-3.1-8b-instruct-abliterated,
author = {[Your Name]},
title = {Llama-3.1-8B-Instruct-abliterated},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
}
Downloads last month
14
Safetensors
Model size
1.98B params
Tensor type
I32
BF16
FP16
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.