Smoothie: A Diffusion Model for Paraphrase Generation
This repository contains a diffusion-based model for text generation, trained on the Quora Question Pairs (QQP) dataset for the task of paraphrasing. The architecture and training methodology are based on the paper Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation.
This is a custom model and requires trust_remote_code=True
to load, as the model's architecture is defined in the accompanying modeling_smoothie.py
file.
Model Description
The "Smoothie" model is a non-autoregressive text generation model that uses a diffusion process. Unlike traditional models that generate text token-by-token, this model starts with pure random noise and iteratively refines it over hundreds of steps to produce a full sentence.
The key features of the architecture are:
- Diffusion Process: Operates in a continuous space based on the negative squared Euclidean distances between token embeddings. This allows the model to smoothly add and remove "semantic noise".
- Backbone: A Transformer Decoder with UNet-style skip connections, which is effective for denoising tasks.
- Conditional Generation: The model is conditioned on an input sentence (a question) to generate a semantically similar output sentence (a paraphrase).
This specific checkpoint was trained on the paraphrase pairs from the GLUE QQP dataset, using bert-base-cased
as the base for its token embeddings.
How to Use
The following is a complete, self-contained example of how to load the model and use it for inference. The SmoothieDiffusion
class, which orchestrates the multi-step generation process, is included for convenience.
First, make sure you have the necessary libraries installed:
pip install torch transformers accelerate huggingface_hub -q
Then, you can run the following Python script:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel, BertModel
from tqdm.auto import tqdm
import math
# =============================================================================
# PART 1: THE DIFFUSION PIPELINE (INFERENCE LOGIC)
# This class is required to use the Smoothie model for generation.
# =============================================================================
def get_noise_schedule(T, s_min=1.5, s_max=200.0, d=9.0, epsilon=1e-5):
"""Generates the noise schedule used during training."""
t = torch.arange(0, T + 1, dtype=torch.float32)
ratio = t / (T - t + epsilon)
arg = (1/d) * ratio
schedule = (s_max - s_min) * (2 / math.pi) * torch.atan(arg) + s_min
schedule = s_min
schedule[T] = s_max
return schedule
class SmoothieDiffusion:
"""The inference pipeline for the Smoothie model."""
def __init__(self, E, schedule):
self.E = E.cuda() # The semantic map (embedding matrix)
self.V, self.D = E.shape
self.sigmas = schedule.cuda() # The blueprint (noise schedule)
self.T = len(schedule) - 1
@torch.no_grad()
def get_D0(self, target_embeddings):
"""Memory-efficient calculation of the distance matrix D0."""
term1 = torch.sum(target_embeddings.pow(2), dim=-1, keepdim=True)
term2 = torch.sum(self.E.pow(2), dim=-1).unsqueeze(0).unsqueeze(0)
term3 = -2 * torch.matmul(target_embeddings, self.E.T)
return -(term1 + term2 + term3)
@torch.no_grad()
def p_sample(self, model, D_t, t, delta_gen, src_tokens=None, src_mask=None):
"""A single reverse diffusion (denoising) step."""
p_t = torch.softmax(D_t, dim=-1)
weighted_avg_emb = torch.matmul(p_t, self.E)
t_tensor = torch.full((D_t.shape,), t, device=D_t.device, dtype=torch.long)
pred_E0 = model(
weighted_avg_emb=weighted_avg_emb,
t=t_tensor,
src_tokens=src_tokens,
src_mask=src_mask
)
pred_D0 = self.get_D0(pred_E0)
if t == 0:
return pred_D0
sigma_t_minus_1 = self.sigmas[t-1]
D_t_minus_1 = pred_D0 / (sigma_t_minus_1 ** 2)
if delta_gen > 0:
D_t_minus_1 += delta_gen * torch.randn_like(D_t)
return D_t_minus_1
@torch.no_grad()
def p_sample_loop(self, model, shape, delta_gen, src_tokens=None, src_mask=None):
"""The full denoising loop from T to 0."""
device = self.E.device
D_t = torch.randn(shape, device=device) * delta_gen
for t in tqdm(reversed(range(0, self.T + 1)), desc="Sampling", total=self.T + 1):
D_t = self.p_sample(model, D_t, t, delta_gen, src_tokens=src_tokens, src_mask=src_mask)
return D_t
# =============================================================================
# PART 2: LOADING THE MODEL AND RUNNING INFERENCE
# =============================================================================
# --- Configuration ---
# Replace with your own username and repo name if you forked this
repo_id = "your-hf-username/smoothie-diffusion-qqp"
device = "cuda" if torch.cuda.is_available() else "cpu"
# --- Load Model and Tokenizer from the Hub ---
print(f"Loading tokenizer and model from: {repo_id}")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
# `trust_remote_code=True` is essential to load the custom SmoothieModel architecture
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).to(device)
model.eval()
print("\nModel loaded successfully from the Hub!")
# --- Prepare Diffusion Components ---
print("Preparing the embedding matrix for the diffusion process...")
bert_for_embeddings = BertModel.from_pretrained("bert-base-cased")
embedding_matrix = bert_for_embeddings.embeddings.word_embeddings.weight.detach().clone().to(device)
mean = embedding_matrix.mean(0, keepdim=True)
std = embedding_matrix.std(0, keepdim=True)
embedding_matrix = (embedding_matrix - mean) / std
# Recreate the exact noise schedule and initialize the diffusion pipeline
DIFFUSION_STEPS = 200
DELTA_GEN = 0.25
noise_schedule = get_noise_schedule(T=DIFFUSION_STEPS)
diffusion_pipeline = SmoothieDiffusion(E=embedding_matrix, schedule=noise_schedule)
print("Diffusion components are ready.")
# --- Run Inference ---
source_question = "How can I become a better writer?"
print(f"\nSource Question: {source_question}")
inputs = tokenizer(
source_question,
max_length=model.config.max_seq_len,
padding="max_length",
truncation=True,
return_tensors="pt"
)
src_tokens = inputs['input_ids'].to(device)
src_mask = (src_tokens == tokenizer.pad_token_id).to(device)
generated_D0 = diffusion_pipeline.p_sample_loop(
model,
shape=(1, model.config.max_seq_len, model.config.vocab_size),
delta_gen=DELTA_GEN,
src_tokens=src_tokens,
src_mask=src_mask
)
# --- Decode and Display the Result ---
output_tokens = torch.argmax(generated_D0, dim=-1)
decoded_text = tokenizer.decode(output_tokens, skip_special_tokens=True)
print("-" * 30)
print(f"Generated Paraphrase: {decoded_text}")
print("-" * 30)
Training Details
This model was trained from scratch.
- Dataset:
glue/qqp
, filtered for positive pairs (is_duplicate = 1). - Training Steps: 25,000
- Batch Size: 16
- Optimizer: AdamW
- Learning Rate: 2e-4
- Hardware: Trained on a single NVIDIA T4 GPU via Google Colab.
Limitations and Bias
- The model's knowledge is limited to the topics present in the Quora Questions dataset. It may perform poorly on highly specialized or out-of-domain topics.
- As with any model trained on large-scale internet text, it may reflect societal biases present in the training data.
- The model is currently undertrained and may not always produce semantically perfect paraphrases. Continued training would improve its accuracy.
- Downloads last month
- 29