phi2-memory-deeptalks
A LoRA adapter for the Phi-2 language model, fine-tuned on short conversational snippets to provide short-term memory in dialogue. This adapter enables your assistant to recall and leverage the last few user/assistant turnsβwithout full fine-tuning of the 2.7 B-parameter base model.
π Live Demo on Hugging Face Spaces
β³ It takes time to generate responses since it's running on the CPU free tier
π Overview
phi2-memory-deeptalks injects lightweight, low-rank corrections into the attention and MLP layers of microsoft/phi-2
.
- Size: ~6 M trainable parameters (β 0.2 % of the base model)
- Base: Phi-2 (2.7 B parameters)
- Adapter: Low-Rank Adaptation (LoRA) via the PEFT library
π¦ Model Details
Architecture & Adapter Configuration
- Base model:
microsoft/phi-2
(causal-LM) - LoRA rank (r): 4
- Modules wrapped:
- Attention projections:
q_proj
,k_proj
,v_proj
,dense
- MLP layers:
fc1
,fc2
- Attention projections:
- LoRA hyperparameters:
lora_alpha
: 32lora_dropout
: 0.05- Trainable params: ~5.9 M
Training Data & Preprocessing
- Dataset: HyperThink-Mini 50 K (7 % used)
- Prompt format:
### Human: <user message> ### Assistant: <assistant response>
- Tokenization: Truncated/padded to 256 tokens,
labels = input_ids
- Optimizer: AdamW (PyTorch), FP16 on GPU
- Batching:
per_device_train_batch_size=1
+gradient_accumulation_steps=8
- Epochs: 3
- Checkpointing: Save every 500 steps; final adapter weights in
adapter_model.safetensors
π― Evaluation
- Training loss (step 500): ~1.08
- Validation loss: ~1.10
- Qualitative:
- Improved recall of the last 2β4 turns in dialogue
- Maintains base Phi-2 fluency on general language
π§ Usage
Load the adapter into your Phi-2 model with just a few lines:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, LoraConfig
# 1) Load base
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
# 2) Apply LoRA adapter
peft_config = LoraConfig.from_pretrained("sourize/phi2-memory-deeptalks")
model = PeftModel.from_pretrained(model, peft_config)
# 3) (Optional) Resize embeddings
model.base_model.resize_token_embeddings(len(tokenizer))
# 4) Generate
prompt = "### Human:\nHello, how are you?\n\n### Assistant:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))
βοΈ Inference & Deployment
- Preferred: GPU (NVIDIA-CUDA) for sub-second latency
- CPU-only: ~7β10 min per response (large model!)
- Hugging Face Inference API:
curl -X POST \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ https://api-inference.huggingface.co/pipeline/text-generation/sourize/phi2-memory-deeptalks \ -d '{ "inputs": "Hello, how are you?", "parameters": { "max_new_tokens": 64, "do_sample": true, "temperature": 0.7, "top_p": 0.9, "return_full_text": false } }'
π‘ Use Cases & Limitations
- Ideal for:
- Short back-and-forth chats (2β4 turns)
- Chatbots that need to βrememberβ very recent context
- Not suited for:
- Long-term memory or document-level retrieval
- High-volume production on CPU (too slow)
π Further Reading
- Live Demo: DeepTalks Space
- Blog post (coming soon): Add link here
- PEFT & LoRA: PEFT GitHub | LoRA Paper
π Citation
@misc{sourize_phi2_memory_deeptalks,
title = {phi2-memory-lora: LoRA adapter for Phi-2 with short-term conversational memory},
author = {Sourish},
year = {2025},
howpublished = {\url{https://huggingface.co/sourize/phi2-memory-deeptalks}},
license = {MIT}
}
Questions or feedback? Please open an issue on the repository. ```
- Downloads last month
- 13
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for sourize/phi2-memory-deeptalks
Base model
microsoft/phi-2