DeBERTa-v3 for Quora Question Pairs Duplicate Detection
A fine-tuned DeBERTa-v3-base model for identifying duplicate question pairs, achieving 97.59% ROC AUC on the Quora Question Pairs dataset.
Model Description
This model is a fine-tuned version of microsoft/deberta-v3-base on the Quora Question Pairs dataset. It uses a cross-encoder architecture to determine whether two questions are semantically equivalent.
Key Features:
- Cross-encoder architecture for superior accuracy
- Probability calibration for reliable confidence estimates
- Robust handling of missing/empty questions
- Production-ready inference pipeline
Performance
| Metric | Value |
|---|---|
| ROC AUC | 97.59% |
| Training Loss | 0.116 |
| Validation Loss | 0.214 |
Intended Use
Primary Use Cases:
- Question deduplication systems
- Semantic similarity detection
- Content moderation for duplicate questions
- Search and retrieval systems
Out-of-Scope Use:
- General text similarity (model is optimized for questions)
- Languages other than English
- Longer texts (trained on max 128 tokens)
Usage
Basic Inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "fatihburakkaragoz/quora-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example usage
question1 = "How do I learn Python programming?"
question2 = "What's the best way to learn Python?"
# Tokenize and predict
inputs = tokenizer(question1, question2,
truncation=True, padding=True,
max_length=128, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probability = torch.softmax(logits, dim=-1)[0, 1].item()
print(f"Duplicate probability: {probability:.3f}")
With Probability Calibration (Recommended)
For the most accurate confidence estimates, use the included calibrator:
import joblib
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model, tokenizer, and calibrator
model_name = "fatihburakkaragoz/quora-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Note: Download the calibrator separately from the model repository
calibrator = joblib.load("deberta_cal.pkl")
def predict_duplicate(question1, question2):
# Get raw prediction
inputs = tokenizer(question1, question2, truncation=True,
padding=True, max_length=128, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
raw_prob = torch.sigmoid(logits[0, 1]).item()
# Apply calibration for better confidence estimates
calibrated_prob = calibrator.predict_proba([[raw_prob]])[0, 1]
return calibrated_prob
# Example
prob = predict_duplicate("How to cook pasta?", "What's the best pasta recipe?")
print(f"Calibrated duplicate probability: {prob:.3f}")
Training Details
Training Data
- Dataset: Quora Question Pairs (~400K question pairs)
- Split: 90% training, 10% validation (stratified)
- Preprocessing: Missing values filled with empty strings
Training Configuration
- Base Model: microsoft/deberta-v3-base
- Architecture: Cross-encoder with sequence classification head
- Max Length: 128 tokens
- Batch Size: 8 per device (with gradient accumulation)
- Learning Rate: 2e-5
- Epochs: 3
- Optimizer: AdamW
- Precision: FP16
Training Results
| Epoch | Training Loss | Validation Loss | ROC AUC |
|---|---|---|---|
| 1 | 0.219 | 0.211 | 0.972 |
| 2 | 0.171 | 0.198 | 0.976 |
| 3 | 0.116 | 0.214 | 0.976 |
Technical Details
Model Architecture
- Type: Cross-encoder (both questions processed together)
- Advantage: Higher accuracy than bi-encoder approaches
- Trade-off: Slower inference than bi-encoders
Probability Calibration
This model includes a calibration component that improves probability estimates:
- Method: Logistic Regression on validation predictions
- Benefit: More reliable confidence scores for production use
- File:
deberta_cal.pkl(included in repository)
Limitations and Bias
Limitations:
- Optimized for English question pairs only
- Performance may degrade on very long questions (>128 tokens)
- Training data reflects Quora user demographics and question patterns
Bias Considerations:
- Model inherits biases from DeBERTa base model and Quora dataset
- May perform differently across question domains/topics
- Evaluation primarily on question similarity, not general text
Citation
If you use this model, please cite:
@misc{deberta-v3-quora-question-pairs,
title={DeBERTa-v3 for Quora Question Pairs Duplicate Detection},
author={Fatih Burak Karagöz},
year={2025},
url={https://huggingface.co/fatihburakkaragoz/quora-cross-encoder}
}
Acknowledgments
- Microsoft Research for DeBERTa-v3-base
- Quora for the Question Pairs dataset
- Hugging Face for the transformers library
- Downloads last month
- 5
Model tree for fatihburakkaragoz/quora-cross-encoder
Base model
microsoft/deberta-v3-base