PHOENIX: Hierarchical Contrastive Learning for Patent Image Retrieval

PHOENIX is a domain-adapted CLIP/ViT-based model designed to improve patent image retrieval. It addresses the unique challenges of retrieving relevant technical drawings in patent documents, especially when searching for semantically or hierarchically related images, not just exact matches.

This model is based on openai/clip-vit-base-patch16 and fine-tuned using a hierarchical multi-positive contrastive loss that leverages Locarno classification — an international system used to categorize industrial designs.

🧠 Motivation

Patent images are often complex technical illustrations that encode detailed structural or functional aspects of an invention. Current systems typically retrieve images from the same patent but fail when asked to retrieve semantically similar inventions across different patents or subclasses.

For instance, a retrieval system should understand that a "foldable camping chair" and a "stackable office chair" both fall under the broader "seating" category — even if their visual structure differs.

🔍 What This Model Does

Leverages CLIP ViT for visual understanding of technical drawings
Trains using hierarchical multi-positive contrastive learning to encode Locarno structure:
```
Furniture → Seating → Chairs → Specific Patent
```
Encodes images such that semantically similar inventions are close in the embedding space — even if from different patents or subclasses

📦 How to Use

Load Model

from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch

# Load fine-tuned model and processor
model = CLIPModel.from_pretrained("kshitij3188/PHOENIX-patent-retrieval")
processor = CLIPProcessor.from_pretrained("kshitij3188/PHOENIX-patent-retrieval")

model.eval()

Extract Embeddings

from torchvision import transforms

def extract_image_embedding(image_path):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        embedding = model.get_image_features(**inputs).squeeze()
    return embedding

# Example
embedding = extract_image_embedding("some_patent_image.png")
print("🔍 Image embedding shape:", embedding.shape)

You can now compare cosine similarity between embeddings to retrieve similar patent drawings.

🏆 Results

Evaluated on the DeepPatent2 dataset, PHOENIX shows significant gains in:

Intra-category retrieval (same subclass)
Cross-category generalization (related but distinct inventions)
Low-parameter robustness, making it suitable for real-time deployment

💡 Use Cases

🔍 Prior Art Search – Find related inventions even if visually different
🧠 Design Inspiration – Explore similar patent structures from other domains
📑 Semantic Tagging – Automatically cluster patents into meaningful groups
🛡️ IP Protection – Detect potential overlaps or infringements more robustly

🛠️ Model Architecture

This model wraps ViTModel in a custom class PatentEmbeddingModel, which:

Accepts a checkpoint fine-tuned on hierarchical labels
Uses the CLS token embedding for image representation
Integrates seamlessly with transformers’ ViT feature extractors

📜 License

This model is released under the MIT License.

✨ Credits

Developed as part of a Master's thesis on improving patent retrieval through hierarchical representation learning.

kshitij3188
/

PHOENIX-patent-retrieval