AventIQ-AI
/

Sentence-Similarity-Model

Model card Files Files and versions

DeepakKumarMSL commited on Jun 27

Commit

4ff2686

·

verified ·

1 Parent(s): 14695de

Create README.md

Files changed (1) hide show

README.md +63 -0

README.md ADDED Viewed

	@@ -0,0 +1,63 @@

+# 🧠 Text Similarity Model using Sentence-BERT
+This project fine-tunes a Sentence-BERT model (`paraphrase-MiniLM-L6-v2`) on the **STS Benchmark** English dataset (`stsb_multi_mt`) to perform **semantic similarity scoring** between two text inputs.
+---
+## 🚀 Features
+- 🔁 Fine-tunes `sentence-transformers/paraphrase-MiniLM-L6-v2`
+- 🔧 Trained on the `stsb_multi_mt` dataset (English split)
+- 🧪 Predicts cosine similarity between sentence pairs (0 to 1)
+- ⚙️ Uses a custom PyTorch model and manual training loop
+- 💾 Model is saved as `similarity_model.pt`
+- 🧠 Supports inference on custom sentence pairs
+---
+## 📦 Dependencies
+Install required libraries:
+```python
+pip install -q transformers datasets sentence-transformers evaluate --upgrade
+```
+# 📊 Dataset
+ - Dataset: stsb_multi_mt
+ - Split: "en"
+ - Purpose: Provides sentence pairs with similarity scores ranging from 0 to 5, which are normalized to 0–1 for training.
+```python
+from datasets import load_dataset
+dataset = load_dataset("stsb_multi_mt", name="en", split="train")
+dataset = dataset.shuffle(seed=42).select(range(10000))  # Sample subset for faster training
+```
+## 🏗️ Model Architecture
+# ✅ Base Model
+ - sentence-transformers/paraphrase-MiniLM-L6-v2 (from Hugging Face)
+# ✅ Fine-Tuning
+ - Cosine similarity computed between the CLS token embeddings of two inputs
+ - Loss: Mean Squared Error (MSE) between predicted similarity and true score
+# 🧠 Training
+ - Epochs: 3
+ - Optimizer: Adam
+ - Loss: MSELoss
+ - Manual training loop using PyTorch
+# Files and Structure
+📦text-similarity-project
+ ┣ 📜similarity_model.pt          # Trained PyTorch model
+ ┣ 📜training_script.py           # Full training and inference script
+ ┣ 📜README.md                    # Documentation