DeBERTa-v3 for Quora Question Pairs Duplicate Detection

A fine-tuned DeBERTa-v3-base model for identifying duplicate question pairs, achieving 97.59% ROC AUC on the Quora Question Pairs dataset.

Model Description

This model is a fine-tuned version of microsoft/deberta-v3-base on the Quora Question Pairs dataset. It uses a cross-encoder architecture to determine whether two questions are semantically equivalent.

Key Features:

  • Cross-encoder architecture for superior accuracy
  • Probability calibration for reliable confidence estimates
  • Robust handling of missing/empty questions
  • Production-ready inference pipeline

Performance

Metric Value
ROC AUC 97.59%
Training Loss 0.116
Validation Loss 0.214

Intended Use

Primary Use Cases:

  • Question deduplication systems
  • Semantic similarity detection
  • Content moderation for duplicate questions
  • Search and retrieval systems

Out-of-Scope Use:

  • General text similarity (model is optimized for questions)
  • Languages other than English
  • Longer texts (trained on max 128 tokens)

Usage

Basic Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "fatihburakkaragoz/quora-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage
question1 = "How do I learn Python programming?"
question2 = "What's the best way to learn Python?"

# Tokenize and predict
inputs = tokenizer(question1, question2, 
                  truncation=True, padding=True, 
                  max_length=128, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probability = torch.softmax(logits, dim=-1)[0, 1].item()

print(f"Duplicate probability: {probability:.3f}")

With Probability Calibration (Recommended)

For the most accurate confidence estimates, use the included calibrator:

import joblib
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model, tokenizer, and calibrator
model_name = "fatihburakkaragoz/quora-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Note: Download the calibrator separately from the model repository
calibrator = joblib.load("deberta_cal.pkl")

def predict_duplicate(question1, question2):
    # Get raw prediction
    inputs = tokenizer(question1, question2, truncation=True, 
                      padding=True, max_length=128, return_tensors="pt")
    
    with torch.no_grad():
        logits = model(**inputs).logits
        raw_prob = torch.sigmoid(logits[0, 1]).item()
    
    # Apply calibration for better confidence estimates
    calibrated_prob = calibrator.predict_proba([[raw_prob]])[0, 1]
    return calibrated_prob

# Example
prob = predict_duplicate("How to cook pasta?", "What's the best pasta recipe?")
print(f"Calibrated duplicate probability: {prob:.3f}")

Training Details

Training Data

  • Dataset: Quora Question Pairs (~400K question pairs)
  • Split: 90% training, 10% validation (stratified)
  • Preprocessing: Missing values filled with empty strings

Training Configuration

  • Base Model: microsoft/deberta-v3-base
  • Architecture: Cross-encoder with sequence classification head
  • Max Length: 128 tokens
  • Batch Size: 8 per device (with gradient accumulation)
  • Learning Rate: 2e-5
  • Epochs: 3
  • Optimizer: AdamW
  • Precision: FP16

Training Results

Epoch Training Loss Validation Loss ROC AUC
1 0.219 0.211 0.972
2 0.171 0.198 0.976
3 0.116 0.214 0.976

Technical Details

Model Architecture

  • Type: Cross-encoder (both questions processed together)
  • Advantage: Higher accuracy than bi-encoder approaches
  • Trade-off: Slower inference than bi-encoders

Probability Calibration

This model includes a calibration component that improves probability estimates:

  • Method: Logistic Regression on validation predictions
  • Benefit: More reliable confidence scores for production use
  • File: deberta_cal.pkl (included in repository)

Limitations and Bias

Limitations:

  • Optimized for English question pairs only
  • Performance may degrade on very long questions (>128 tokens)
  • Training data reflects Quora user demographics and question patterns

Bias Considerations:

  • Model inherits biases from DeBERTa base model and Quora dataset
  • May perform differently across question domains/topics
  • Evaluation primarily on question similarity, not general text

Citation

If you use this model, please cite:

@misc{deberta-v3-quora-question-pairs,
  title={DeBERTa-v3 for Quora Question Pairs Duplicate Detection},
  author={Fatih Burak Karagöz},
  year={2025},
  url={https://huggingface.co/fatihburakkaragoz/quora-cross-encoder}
}

Acknowledgments

  • Microsoft Research for DeBERTa-v3-base
  • Quora for the Question Pairs dataset
  • Hugging Face for the transformers library
Downloads last month
5
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fatihburakkaragoz/quora-cross-encoder

Finetuned
(479)
this model

Dataset used to train fatihburakkaragoz/quora-cross-encoder

Evaluation results