YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 Text Similarity Model using Sentence-BERT

This project fine-tunes a Sentence-BERT model (paraphrase-MiniLM-L6-v2) on the STS Benchmark English dataset (stsb_multi_mt) to perform semantic similarity scoring between two text inputs.


πŸš€ Features

  • πŸ” Fine-tunes sentence-transformers/paraphrase-MiniLM-L6-v2
  • πŸ”§ Trained on the stsb_multi_mt dataset (English split)
  • πŸ§ͺ Predicts cosine similarity between sentence pairs (0 to 1)
  • βš™οΈ Uses a custom PyTorch model and manual training loop
  • πŸ’Ύ Model is saved as similarity_model.pt
  • 🧠 Supports inference on custom sentence pairs

πŸ“¦ Dependencies

Install required libraries:

pip install -q transformers datasets sentence-transformers evaluate --upgrade

πŸ“Š Dataset

  • Dataset: stsb_multi_mt
  • Split: "en"
  • Purpose: Provides sentence pairs with similarity scores ranging from 0 to 5, which are normalized to 0–1 for training.

from datasets import load_dataset

dataset = load_dataset("stsb_multi_mt", name="en", split="train")
dataset = dataset.shuffle(seed=42).select(range(10000))  # Sample subset for faster training

πŸ—οΈ Model Architecture

βœ… Base Model

  • sentence-transformers/paraphrase-MiniLM-L6-v2 (from Hugging Face)

βœ… Fine-Tuning

  • Cosine similarity computed between the CLS token embeddings of two inputs

  • Loss: Mean Squared Error (MSE) between predicted similarity and true score

🧠 Training

  • Epochs: 3

  • Optimizer: Adam

  • Loss: MSELoss

  • Manual training loop using PyTorch

Files and Structure

πŸ“¦text-similarity-project ┣ πŸ“œsimilarity_model.pt # Trained PyTorch model ┣ πŸ“œtraining_script.py # Full training and inference script ┣ πŸ“œREADME.md # Documentation

Downloads last month
0
Safetensors
Model size
22.7M params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support