bikmish/llm-course-hw2-dpo

A 135M parameter language model aligned using Direct Preference Optimization on emotional/conversational preference data.

Model Details

Architecture: GPT-style transformer
Base Model: SmoLLM-135M-Instruct
Alignment Method: DPO (β=1.0)
Training Epochs: 1
Batch Size: 2
Learning Rate: 5e-5
Context Window: 1024 tokens

Key Features

🎭 Emotionally Aware: Generates responses with increased emotional expressiveness (+38% emoji usage vs base model)
💬 Conversational Flow: Reduced reliance on template phrases like "As an AI..."
🛡️ Safety Preservation: Maintains base model's harmlessness constraints through reference model regularization
⚡ Efficient Alignment: Achieved with single-epoch training on 10k preference pairs

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("bikmish/llm-course-hw2-dpo")
tokenizer = AutoTokenizer.from_pretrained("bikmish/llm-course-hw2-dpo")

messages = [
    {"role": "user", "content": "Just saw an amazing movie!"},
    {"role": "assistant", "content": "Oh cool! What's it about? 😃"}  # DPO-style response
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))