phi2-memory-deeptalks

A LoRA adapter for the Phi-2 language model, fine-tuned on short conversational snippets to provide short-term memory in dialogue. This adapter enables your assistant to recall and leverage the last few user/assistant turns—without full fine-tuning of the 2.7 B-parameter base model.

🔗 Live Demo on Hugging Face Spaces

⏳ It takes time to generate responses since it's running on the CPU free tier

🚀 Overview

phi2-memory-deeptalks injects lightweight, low-rank corrections into the attention and MLP layers of microsoft/phi-2.

Size: ~6 M trainable parameters (≈ 0.2 % of the base model)
Base: Phi-2 (2.7 B parameters)
Adapter: Low-Rank Adaptation (LoRA) via the PEFT library

📦 Model Details

Architecture & Adapter Configuration

Base model: microsoft/phi-2 (causal-LM)
LoRA rank (r): 4
Modules wrapped:
- Attention projections: q_proj, k_proj, v_proj, dense
- MLP layers: fc1, fc2
LoRA hyperparameters:
- lora_alpha: 32
- lora_dropout: 0.05
- Trainable params: ~5.9 M

Training Data & Preprocessing

Dataset: HyperThink-Mini 50 K (7 % used)

Prompt format:

### Human:
<user message>

### Assistant:
<assistant response>

Tokenization: Truncated/padded to 256 tokens, labels = input_ids
Optimizer: AdamW (PyTorch), FP16 on GPU
Batching: per_device_train_batch_size=1 + gradient_accumulation_steps=8
Epochs: 3
Checkpointing: Save every 500 steps; final adapter weights in adapter_model.safetensors

🎯 Evaluation

Training loss (step 500): ~1.08
Validation loss: ~1.10
Qualitative:
- Improved recall of the last 2–4 turns in dialogue
- Maintains base Phi-2 fluency on general language

🔧 Usage

Load the adapter into your Phi-2 model with just a few lines:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, LoraConfig

# 1) Load base
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")

# 2) Apply LoRA adapter
peft_config = LoraConfig.from_pretrained("sourize/phi2-memory-deeptalks")
model = PeftModel.from_pretrained(model, peft_config)

# 3) (Optional) Resize embeddings
model.base_model.resize_token_embeddings(len(tokenizer))

# 4) Generate
prompt = "### Human:\nHello, how are you?\n\n### Assistant:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))

⚙️ Inference & Deployment

Preferred: GPU (NVIDIA-CUDA) for sub-second latency
CPU-only: ~7–10 min per response (large model!)

Hugging Face Inference API:

curl -X POST \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  https://api-inference.huggingface.co/pipeline/text-generation/sourize/phi2-memory-deeptalks \
  -d '{
    "inputs": "Hello, how are you?",
    "parameters": {
      "max_new_tokens": 64,
      "do_sample": true,
      "temperature": 0.7,
      "top_p": 0.9,
      "return_full_text": false
    }
  }'

💡 Use Cases & Limitations

Ideal for:
- Short back-and-forth chats (2–4 turns)
- Chatbots that need to “remember” very recent context
Not suited for:
- Long-term memory or document-level retrieval
- High-volume production on CPU (too slow)

📖 Further Reading

Live Demo: DeepTalks Space
Blog post (coming soon): Add link here
PEFT & LoRA: PEFT GitHub | LoRA Paper

🔖 Citation

@misc{sourize_phi2_memory_deeptalks,
  title        = {phi2-memory-lora: LoRA adapter for Phi-2 with short-term conversational memory},
  author       = {Sourish},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/sourize/phi2-memory-deeptalks}},
  license      = {MIT}
}

Questions or feedback? Please open an issue on the repository. ```

sourize
/

phi2-memory-deeptalks