bikmish/llm-course-hw2-dpo
A 135M parameter language model aligned using Direct Preference Optimization on emotional/conversational preference data.
Model Details
- Architecture: GPT-style transformer
- Base Model: SmoLLM-135M-Instruct
- Alignment Method: DPO (ฮฒ=1.0)
- Training Epochs: 1
- Batch Size: 2
- Learning Rate: 5e-5
- Context Window: 1024 tokens
Key Features
- ๐ญ Emotionally Aware: Generates responses with increased emotional expressiveness (+38% emoji usage vs base model)
- ๐ฌ Conversational Flow: Reduced reliance on template phrases like "As an AI..."
- ๐ก๏ธ Safety Preservation: Maintains base model's harmlessness constraints through reference model regularization
- โก Efficient Alignment: Achieved with single-epoch training on 10k preference pairs
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("bikmish/llm-course-hw2-dpo")
tokenizer = AutoTokenizer.from_pretrained("bikmish/llm-course-hw2-dpo")
messages = [
{"role": "user", "content": "Just saw an amazing movie!"},
{"role": "assistant", "content": "Oh cool! What's it about? ๐"} # DPO-style response
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support