--- license: mit datasets: - SPRIGHT-T2I/spright_coco - zer0int/CLIP-KO-Adversarial-Train-Typo-Attack base_model: - openai/clip-vit-base-patch32 --- # CLIP-KO: Knocking Out Typographic Attacks in CLIP 💪🤖 ### Less vulnerability, much better performance! 🤗 ❤️ this CLIP? [Donate](https://ko-fi.com/zer0int) if you can / want. TY! # 🔥 CLIP-KO ViT-B/32 (vit-base-patch32) - 📝 Read the [paper](https://github.com/zer0int/CLIP-fine-tune/blob/CLIP-vision/KO-CLIP-teaser/KO-CLIP-paper-final.pdf) (PDF) here. - 🤓 Wanna fine-tune yourself? Get the [code](https://github.com/zer0int/CLIP-fine-tune) on my GitHub. ----
👉 CLICK ME to expand example benchmark code ⚡💻 ``` from datasets import load_dataset from transformers import CLIPModel, CLIPProcessor import torch from PIL import Image from tqdm import tqdm import pandas as pd device = "cuda" if torch.cuda.is_available() else "cpu" # BLISS / SCAM Typographic Attack Dataset # https://huggingface.co/datasets/BLISS-e-V/SCAM ds = load_dataset("BLISS-e-V/SCAM", split="train") # Benchmark pre-trained model against my fine-tune model_variants = [ ("OpenAI ", "openai/clip-vit-base-patch32", "openai/clip-vit-base-patch32"), ("KO-CLIP", "zer0int/CLIP-KO-ViT-B-32-TypoAttack", "zer0int/CLIP-KO-ViT-B-32-TypoAttack"), ] models = {} for name, model_path, processor_path in model_variants: model = CLIPModel.from_pretrained(model_path).to(device).float() processor = CLIPProcessor.from_pretrained(processor_path) models[name] = (model, processor) for variant in ["NoSCAM", "SCAM", "SynthSCAM"]: print(f"\n=== Evaluating var.: {variant} ===") idxs = [i for i, v in enumerate(ds['id']) if v.startswith(variant)] if not idxs: print(f" No samples for {variant}") continue subset = [ds[i] for i in idxs] for model_name, (model, processor) in models.items(): results = [] for entry in tqdm(subset, desc=f"{model_name}", ncols=30, bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} |"): img = entry['image'] object_label = entry['object_label'] attack_word = entry['attack_word'] texts = [f"a photo of a {object_label}", f"a photo of a {attack_word}"] inputs = processor( text=texts, images=img, return_tensors="pt", padding=True ) for k in inputs: if isinstance(inputs[k], torch.Tensor): inputs[k] = inputs[k].to(device) with torch.no_grad(): outputs = model(**inputs) image_features = outputs.image_embeds text_features = outputs.text_embeds logits = image_features @ text_features.T probs = logits.softmax(dim=-1).cpu().numpy().flatten() pred_idx = probs.argmax() pred_label = [object_label, attack_word][pred_idx] is_correct = (pred_label == object_label) results.append({ "id": entry['id'], "object_label": object_label, "attack_word": attack_word, "pred_label": pred_label, "is_correct": is_correct, "type": entry['type'], "model": model_name }) n_total = len(results) n_correct = sum(r['is_correct'] for r in results) acc = n_correct / n_total if n_total else float('nan') print(f"| > > > > Zero-shot accuracy for {variant}, {model_name}: {n_correct}/{n_total} = {acc:.4f}") ```
---- Better attention heatmaps! ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/VW0siiegXZeb_Ox5dQTxY.png) ---- ## 📊 Benchmark Results 🚀 | Benchmark / Metric | Pre-Trained | Fine-Tuned | |------------------------------------|-------------|------------| | **Typographic Attack** | | | | RTA-100 zero-shot acc. | 0.5560 | **0.7740**🎖️ | | BLISS / SCAM NoSCAM acc. | 0.9682 | **0.9759** | | BLISS / SCAM SCAM acc. | 0.6627 | **0.7926**🎖️ | | BLISS / SCAM SynthSCAM acc. | 0.4320 | **0.6386**🎖️ | | **LAION/CLIP_Benchmark** | | | | VoC-2007-multilabel mAP | 0.7231 | **0.8335**🎖️ | | MSCOCO retrieval image recall@5 | 0.1724 | **0.2523** | | MSCOCO retrieval text recall@ | 0.2440 | **0.3569** | | xm3600 retrieval image recall@5 | 0.2867 | **0.3874** | | xm3600 retrieval text recall@ | 0.2523 | **0.3783** | | **ImageNet-1k** | | | | zero-shot acc1 | 0.2234 | **0.3193** | | zero-shot acc5 | 0.4169 | **0.5555** | | mAP | 0.2230 | **0.3185** | | **MISC** | | | | ImageNet-1k linear probe Top-1 | **53.14%** | 52.65% | | ImageNet-1k linear probe Top-5 | 83.41% | **83.48%** | | MVT ImageNet/ObjectNet acc. | 0.6492 | **0.7506**🎖️ | | Flickr8k Modality Gap: ↓ | 0.8301 | **0.7902** | | Flickr8k JSD: ↓ | 0.5225 | **0.2983** | | Flickr8k Wasserstein Dist.: ↓ | 0.4573 | **0.4039** | | Flickr8k Img-Text Cos Sim (mean): ↑| 0.3164 | **0.3522** | | Flickr8k Img-Text Cos Sim (std) | 0.0325 | 0.0537 | | Flickr8k Text-Text Cos Sim (mean) | 0.7737 | 0.7561 | | Flickr8k Text-Text Cos Sim (std) | 0.1036 | 0.1300 |