zer0int
/

CLIP-KO-ViT-B-32-TypoAttack

Safetensors

clip

Model card Files Files and versions

xet

Community

zer0int commited on Jul 15

Commit

dfad2ea

verified ·

1 Parent(s): 8afdb75

Unleash KO-CLIP

Browse files

Files changed (1) hide show

README.md +139 -3

README.md CHANGED Viewed

@@ -1,3 +1,139 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- SPRIGHT-T2I/spright_coco
+- zer0int/CLIP-KO-Adversarial-Train-Typo-Attack
+base_model:
+- openai/clip-vit-base-patch32
+---
+# CLIP-KO: Knocking Out Typographic Attacks in CLIP 💪🤖
+### Less vulnerability, much better performance! 🤗
+❤️ this CLIP? [Donate](https://ko-fi.com/zer0int) if you can / want. TY!
+# 🔥 CLIP-KO ViT-B/32 (vit-base-patch32)
+- 📝 Read the [paper](https://github.com/zer0int/CLIP-fine-tune/blob/CLIP-vision/KO-CLIP-teaser/KO-CLIP-paper-final.pdf) (PDF) here.
+- 🤓 Wanna fine-tune yourself? Get the [code](https://github.com/zer0int/CLIP-fine-tune) on my GitHub.
+----
+<details>
+<summary>👉 CLICK ME to expand example benchmark code ⚡💻</summary>
+```
+from datasets import load_dataset
+from transformers import CLIPModel, CLIPProcessor
+import torch
+from PIL import Image
+from tqdm import tqdm
+import pandas as pd
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# BLISS / SCAM Typographic Attack Dataset
+# https://huggingface.co/datasets/BLISS-e-V/SCAM
+ds = load_dataset("BLISS-e-V/SCAM", split="train")
+# Benchmark pre-trained model against my fine-tune
+model_variants = [
+    ("OpenAI ", "openai/clip-vit-base-patch32", "openai/clip-vit-base-patch32"),
+    ("KO-CLIP", "zer0int/CLIP-KO-ViT-B-32-TypoAttack", "zer0int/CLIP-KO-ViT-B-32-TypoAttack"),
+]
+models = {}
+for name, model_path, processor_path in model_variants:
+    model = CLIPModel.from_pretrained(model_path).to(device).float()
+    processor = CLIPProcessor.from_pretrained(processor_path)
+    models[name] = (model, processor)
+for variant in ["NoSCAM", "SCAM", "SynthSCAM"]:
+    print(f"\n=== Evaluating var.: {variant} ===")
+    idxs = [i for i, v in enumerate(ds['id']) if v.startswith(variant)]
+    if not idxs:
+        print(f"  No samples for {variant}")
+        continue
+    subset = [ds[i] for i in idxs]
+    for model_name, (model, processor) in models.items():
+        results = []
+        for entry in tqdm(subset, desc=f"{model_name}", ncols=30, bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} |"):
+            img = entry['image']
+            object_label = entry['object_label']
+            attack_word = entry['attack_word']
+            texts = [f"a photo of a {object_label}", f"a photo of a {attack_word}"]
+            inputs = processor(
+                text=texts,
+                images=img,
+                return_tensors="pt",
+                padding=True
+            )
+            for k in inputs:
+                if isinstance(inputs[k], torch.Tensor):
+                    inputs[k] = inputs[k].to(device)
+            with torch.no_grad():
+                outputs = model(**inputs)
+                image_features = outputs.image_embeds
+                text_features = outputs.text_embeds
+                logits = image_features @ text_features.T
+                probs = logits.softmax(dim=-1).cpu().numpy().flatten()
+                pred_idx = probs.argmax()
+                pred_label = [object_label, attack_word][pred_idx]
+                is_correct = (pred_label == object_label)
+            results.append({
+                "id": entry['id'],
+                "object_label": object_label,
+                "attack_word": attack_word,
+                "pred_label": pred_label,
+                "is_correct": is_correct,
+                "type": entry['type'],
+                "model": model_name
+            })
+        n_total = len(results)
+        n_correct = sum(r['is_correct'] for r in results)
+        acc = n_correct / n_total if n_total else float('nan')
+        print(f"| > > > > Zero-shot accuracy for {variant}, {model_name}: {n_correct}/{n_total} = {acc:.4f}")
+```
+</details>
+----
+Better attention heatmaps!
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/VW0siiegXZeb_Ox5dQTxY.png)
+----
+## 📊 Benchmark Results 🚀
+| Benchmark / Metric                 | Pre-Trained | Fine-Tuned |
+|------------------------------------|-------------|------------|
+| **Typographic Attack**             |             |            |
+| RTA-100 zero-shot acc.             | 0.5560      | **0.7740**🎖️ |
+| BLISS / SCAM NoSCAM acc.           | 0.9682      | **0.9759** |
+| BLISS / SCAM SCAM acc.             | 0.6627      | **0.7926**🎖️ |
+| BLISS / SCAM SynthSCAM acc.        | 0.4320      | **0.6386**🎖️ |
+| **LAION/CLIP_Benchmark**           |             |            |
+| VoC-2007-multilabel mAP            | 0.7231      | **0.8335**🎖️ |
+| MSCOCO retrieval image recall@5    | 0.1724      | **0.2523** |
+| MSCOCO retrieval text recall@      | 0.2440      | **0.3569** |
+| xm3600 retrieval image recall@5    | 0.2867      | **0.3874** |
+| xm3600 retrieval text recall@      | 0.2523      | **0.3783** |
+| **ImageNet-1k**                    |             |            |
+| zero-shot acc1                     | 0.2234      | **0.3193** |
+| zero-shot acc5                     | 0.4169      | **0.5555** |
+| mAP                                | 0.2230      | **0.3185** |
+| **MISC**                           |             |            |
+| ImageNet-1k linear probe Top-1     | **53.14%**  | 52.65%     |
+| ImageNet-1k linear probe Top-5     | 83.41%      | **83.48%** |
+| MVT ImageNet/ObjectNet acc.        | 0.6492      | **0.7506**🎖️ |
+| Flickr8k Modality Gap: ↓           | 0.8301      | **0.7902** |
+| Flickr8k JSD: ↓                    | 0.5225      | **0.2983** |
+| Flickr8k Wasserstein Dist.: ↓      | 0.4573      | **0.4039** |
+| Flickr8k Img-Text Cos Sim (mean): ↑| 0.3164      | **0.3522** |
+| Flickr8k Img-Text Cos Sim (std)    | 0.0325      | 0.0537     |
+| Flickr8k Text-Text Cos Sim (mean)  | 0.7737      | 0.7561     |
+| Flickr8k Text-Text Cos Sim (std)   | 0.1036      | 0.1300     |