---
tags:
- patent-retrieval
- image-search
- hierarchical-learning
- contrastive-learning
license: mit
base_model:
- openai/clip-vit-base-patch16
pipeline_tag: image-to-image
---

# PHOENIX: Hierarchical Contrastive Learning for Patent Image Retrieval

**PHOENIX** is a domain-adapted CLIP/ViT-based model designed to improve **patent image retrieval**. It addresses the unique challenges of retrieving relevant technical drawings in patent documents, especially when searching for **semantically or hierarchically related images**, not just exact matches.

This model is based on `openai/clip-vit-base-patch16` and fine-tuned using a **hierarchical multi-positive contrastive loss** that leverages **Locarno classification** — an international system used to categorize industrial designs.

---

## 🧠 Motivation

Patent images are often **complex technical illustrations** that encode detailed structural or functional aspects of an invention. Current systems typically retrieve images from the same patent but fail when asked to retrieve **semantically similar inventions** across different patents or subclasses.

For instance, a retrieval system should understand that a *"foldable camping chair"* and a *"stackable office chair"* both fall under the broader "seating" category — even if their visual structure differs.

---

## 🔍 What This Model Does

- Leverages **CLIP ViT** for visual understanding of technical drawings
- Trains using **hierarchical multi-positive contrastive learning** to encode Locarno structure:
  ```
  Furniture → Seating → Chairs → Specific Patent
  ```
- Encodes images such that **semantically similar inventions are close** in the embedding space — even if from different patents or subclasses

---

## 📦 How to Use

### Load Model

```python
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch

# Load fine-tuned model and processor
model = CLIPModel.from_pretrained("kshitij3188/PHOENIX-patent-retrieval")
processor = CLIPProcessor.from_pretrained("kshitij3188/PHOENIX-patent-retrieval")

model.eval()
```

### Extract Embeddings

```python
from torchvision import transforms

def extract_image_embedding(image_path):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        embedding = model.get_image_features(**inputs).squeeze()
    return embedding

# Example
embedding = extract_image_embedding("some_patent_image.png")
print("🔍 Image embedding shape:", embedding.shape)
```

You can now compare cosine similarity between embeddings to retrieve similar patent drawings.

---

## 🏆 Results

Evaluated on the **DeepPatent2** dataset, PHOENIX shows significant gains in:
- **Intra-category retrieval** (same subclass)
- **Cross-category generalization** (related but distinct inventions)
- **Low-parameter robustness**, making it suitable for real-time deployment

---

## 💡 Use Cases

- 🔍 **Prior Art Search** – Find related inventions even if visually different
- 🧠 **Design Inspiration** – Explore similar patent structures from other domains
- 📑 **Semantic Tagging** – Automatically cluster patents into meaningful groups
- 🛡️ **IP Protection** – Detect potential overlaps or infringements more robustly

---

## 🛠️ Model Architecture

This model wraps `ViTModel` in a custom class `PatentEmbeddingModel`, which:
- Accepts a checkpoint fine-tuned on hierarchical labels
- Uses the CLS token embedding for image representation
- Integrates seamlessly with transformers’ ViT feature extractors

---

## 📜 License

This model is released under the MIT License.

---

## ✨ Credits

Developed as part of a Master's thesis on improving patent retrieval through hierarchical representation learning.