bikmish/llm-course-hw2-dpo

A 135M parameter language model aligned using Direct Preference Optimization on emotional/conversational preference data.

Model Details

  • Architecture: GPT-style transformer
  • Base Model: SmoLLM-135M-Instruct
  • Alignment Method: DPO (ฮฒ=1.0)
  • Training Epochs: 1
  • Batch Size: 2
  • Learning Rate: 5e-5
  • Context Window: 1024 tokens

Key Features

  • ๐ŸŽญ Emotionally Aware: Generates responses with increased emotional expressiveness (+38% emoji usage vs base model)
  • ๐Ÿ’ฌ Conversational Flow: Reduced reliance on template phrases like "As an AI..."
  • ๐Ÿ›ก๏ธ Safety Preservation: Maintains base model's harmlessness constraints through reference model regularization
  • โšก Efficient Alignment: Achieved with single-epoch training on 10k preference pairs

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("bikmish/llm-course-hw2-dpo")
tokenizer = AutoTokenizer.from_pretrained("bikmish/llm-course-hw2-dpo")

messages = [
    {"role": "user", "content": "Just saw an amazing movie!"},
    {"role": "assistant", "content": "Oh cool! What's it about? ๐Ÿ˜ƒ"}  # DPO-style response
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
5
Safetensors
Model size
135M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support