Smoothie: A Diffusion Model for Paraphrase Generation

Generic badge Generic badge Generic badge

This repository contains a diffusion-based model for text generation, trained on the Quora Question Pairs (QQP) dataset for the task of paraphrasing. The architecture and training methodology are based on the paper Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation.

This is a custom model and requires trust_remote_code=True to load, as the model's architecture is defined in the accompanying modeling_smoothie.py file.

Model Description

The "Smoothie" model is a non-autoregressive text generation model that uses a diffusion process. Unlike traditional models that generate text token-by-token, this model starts with pure random noise and iteratively refines it over hundreds of steps to produce a full sentence.

The key features of the architecture are:

  • Diffusion Process: Operates in a continuous space based on the negative squared Euclidean distances between token embeddings. This allows the model to smoothly add and remove "semantic noise".
  • Backbone: A Transformer Decoder with UNet-style skip connections, which is effective for denoising tasks.
  • Conditional Generation: The model is conditioned on an input sentence (a question) to generate a semantically similar output sentence (a paraphrase).

This specific checkpoint was trained on the paraphrase pairs from the GLUE QQP dataset, using bert-base-cased as the base for its token embeddings.


How to Use

The following is a complete, self-contained example of how to load the model and use it for inference. The SmoothieDiffusion class, which orchestrates the multi-step generation process, is included for convenience.

First, make sure you have the necessary libraries installed:

pip install torch transformers accelerate huggingface_hub -q

Then, you can run the following Python script:

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel, BertModel
from tqdm.auto import tqdm
import math

# =============================================================================
# PART 1: THE DIFFUSION PIPELINE (INFERENCE LOGIC)
# This class is required to use the Smoothie model for generation.
# =============================================================================

def get_noise_schedule(T, s_min=1.5, s_max=200.0, d=9.0, epsilon=1e-5):
    """Generates the noise schedule used during training."""
    t = torch.arange(0, T + 1, dtype=torch.float32)
    ratio = t / (T - t + epsilon)
    arg = (1/d) * ratio
    schedule = (s_max - s_min) * (2 / math.pi) * torch.atan(arg) + s_min
    schedule = s_min
    schedule[T] = s_max
    return schedule

class SmoothieDiffusion:
    """The inference pipeline for the Smoothie model."""
    def __init__(self, E, schedule):
        self.E = E.cuda()  # The semantic map (embedding matrix)
        self.V, self.D = E.shape
        self.sigmas = schedule.cuda()  # The blueprint (noise schedule)
        self.T = len(schedule) - 1

    @torch.no_grad()
    def get_D0(self, target_embeddings):
        """Memory-efficient calculation of the distance matrix D0."""
        term1 = torch.sum(target_embeddings.pow(2), dim=-1, keepdim=True)
        term2 = torch.sum(self.E.pow(2), dim=-1).unsqueeze(0).unsqueeze(0)
        term3 = -2 * torch.matmul(target_embeddings, self.E.T)
        return -(term1 + term2 + term3)

    @torch.no_grad()
    def p_sample(self, model, D_t, t, delta_gen, src_tokens=None, src_mask=None):
        """A single reverse diffusion (denoising) step."""
        p_t = torch.softmax(D_t, dim=-1)
        weighted_avg_emb = torch.matmul(p_t, self.E)
        t_tensor = torch.full((D_t.shape,), t, device=D_t.device, dtype=torch.long)
        
        pred_E0 = model(
            weighted_avg_emb=weighted_avg_emb,
            t=t_tensor,
            src_tokens=src_tokens,
            src_mask=src_mask
        )
        
        pred_D0 = self.get_D0(pred_E0)
        if t == 0:
            return pred_D0
            
        sigma_t_minus_1 = self.sigmas[t-1]
        D_t_minus_1 = pred_D0 / (sigma_t_minus_1 ** 2)
        if delta_gen > 0:
            D_t_minus_1 += delta_gen * torch.randn_like(D_t)
        return D_t_minus_1

    @torch.no_grad()
    def p_sample_loop(self, model, shape, delta_gen, src_tokens=None, src_mask=None):
        """The full denoising loop from T to 0."""
        device = self.E.device
        D_t = torch.randn(shape, device=device) * delta_gen
        for t in tqdm(reversed(range(0, self.T + 1)), desc="Sampling", total=self.T + 1):
            D_t = self.p_sample(model, D_t, t, delta_gen, src_tokens=src_tokens, src_mask=src_mask)
        return D_t

# =============================================================================
# PART 2: LOADING THE MODEL AND RUNNING INFERENCE
# =============================================================================

# --- Configuration ---
# Replace with your own username and repo name if you forked this
repo_id = "your-hf-username/smoothie-diffusion-qqp"
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Load Model and Tokenizer from the Hub ---
print(f"Loading tokenizer and model from: {repo_id}")
tokenizer = AutoTokenizer.from_pretrained(repo_id)

# `trust_remote_code=True` is essential to load the custom SmoothieModel architecture
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).to(device)
model.eval()
print("\nModel loaded successfully from the Hub!")

# --- Prepare Diffusion Components ---
print("Preparing the embedding matrix for the diffusion process...")
bert_for_embeddings = BertModel.from_pretrained("bert-base-cased")
embedding_matrix = bert_for_embeddings.embeddings.word_embeddings.weight.detach().clone().to(device)
mean = embedding_matrix.mean(0, keepdim=True)
std = embedding_matrix.std(0, keepdim=True)
embedding_matrix = (embedding_matrix - mean) / std

# Recreate the exact noise schedule and initialize the diffusion pipeline
DIFFUSION_STEPS = 200
DELTA_GEN = 0.25
noise_schedule = get_noise_schedule(T=DIFFUSION_STEPS)
diffusion_pipeline = SmoothieDiffusion(E=embedding_matrix, schedule=noise_schedule)
print("Diffusion components are ready.")

# --- Run Inference ---
source_question = "How can I become a better writer?"
print(f"\nSource Question: {source_question}")

inputs = tokenizer(
    source_question,
    max_length=model.config.max_seq_len,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)
src_tokens = inputs['input_ids'].to(device)
src_mask = (src_tokens == tokenizer.pad_token_id).to(device)

generated_D0 = diffusion_pipeline.p_sample_loop(
    model,
    shape=(1, model.config.max_seq_len, model.config.vocab_size),
    delta_gen=DELTA_GEN,
    src_tokens=src_tokens,
    src_mask=src_mask
)

# --- Decode and Display the Result ---
output_tokens = torch.argmax(generated_D0, dim=-1)
decoded_text = tokenizer.decode(output_tokens, skip_special_tokens=True)

print("-" * 30)
print(f"Generated Paraphrase: {decoded_text}")
print("-" * 30)

Training Details

This model was trained from scratch.

  • Dataset: glue/qqp, filtered for positive pairs (is_duplicate = 1).
  • Training Steps: 25,000
  • Batch Size: 16
  • Optimizer: AdamW
  • Learning Rate: 2e-4
  • Hardware: Trained on a single NVIDIA T4 GPU via Google Colab.

Limitations and Bias

  • The model's knowledge is limited to the topics present in the Quora Questions dataset. It may perform poorly on highly specialized or out-of-domain topics.
  • As with any model trained on large-scale internet text, it may reflect societal biases present in the training data.
  • The model is currently undertrained and may not always produce semantically perfect paraphrases. Continued training would improve its accuracy.

Downloads last month
29
Safetensors
Model size
156M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support