Preface

A first experiment to test and convert clip-vit-base-patch32 into a geometric model by using only a classification head.

Below is GPT 5's auto-generated dictation based on the notebook. I'll include the full notebook in a moment here.

The question was simple; can linear layers learn geometric?

The answer is... maybe. More research required.

Reasoning

I used the 32 dim geometric vocab; as it seemed to be the weakest with flow-match euler-discreet to test the hypothesis that a small dimensional geometry could in fact be used in substitution of a high-geometric variation.

The output head is incredibly small unlike my first impression. This one happened to have the pair packed, but the updated notebook will show the separated head is in fact less than 100 kb.

Why clip-vit instead of just vit?

I believe the clip-vit variations have more utility overall so I wanted to ensure a fair target was assessed.

Notebook-6 · Crystal-CLIP CIFAR-100

One-vector image embeddings (HF CLIP) + pentachora vocabulary anchors → cosine-similarity classifier for CIFAR-100. This repo hosts the trained crystal classification head (+ run configs/metrics) built in Notebook 6.

OVERVIEW

Vision encoder: openai/clip-vit-base-patch32 (Hugging Face transformers), frozen by default. Produces exactly one L2-normalized embedding per image (image_embeds, dim=32).
Vocabulary: AbstractPhil/geometric-vocab-32d (pentachora crystals). For CIFAR-100 class names, any missing tokens are deterministically synthesized via the unicode path to guarantee 100/100 coverage and preserve class ordering.
Head: projects both image embeddings (De=32) and role-selected class anchors (Dv=32) into a shared symbol space (crystal_dims=64), L2-normalizes, and computes cosine logits divided by T (temperature).
Training: Cross-Entropy on CIFAR-100, AdamW, optional AMP, cosine LR with warmup. Best checkpoint is saved and (optionally) pushed to Hugging Face.

MODEL CARD

Task: Image Classification (CIFAR-100)
Backbone: openai/clip-vit-base-patch32 (vision-only)
Head: Crystal projection head (image 512→64, anchor 32→64) + cosine logits (temperature)
Vocabulary: AbstractPhil/geometric-vocab-32d (wordnet_eng split + deterministic unicode synth for gaps)
Metrics: Top-1 = [60~], Top-3 = [80>]
License: MIT

FILES IN THIS REPO

_best.safetensors — weights for:
- head::* (crystal classifier head)
- encoder::* (optional, if you chose to unfreeze/fine-tune)
_best.config.json — full CONFIG used for the run
_best.metrics.json — summary metrics for the best epoch
Optionally: _latest. variants if you pushed latest per-epoch artifacts.

Note: If you only want to ship the head, you can also include a stripped crystal_head.safetensors (head-only state_dict). The snippets below handle either format.

QUICKSTART (Inference)

Load CLIP vision (frozen) and processor HF_CLIP_ID = "openai/clip-vit-base-patch32" Processor = AutoImageProcessor.from_pretrained(HF_CLIP_ID) Encoder = CLIPVisionModelWithProjection.from_pretrained(HF_CLIP_ID).eval().to("cuda")
Build the crystal head (same shape as training) image_dim = Encoder.config.projection_dim # 512 crystal_dim = 512 # vocab repo uses 512D anchors sym_dim = 128 # crystal_dims from CONFIG temperature = 0.07 # from CONFIG

class CrystalHead(torch.nn.Module): def init(self, De, Dv, Dsym, T): super().init() self.proj_img = torch.nn.Linear(De, Dsym, bias=True) self.proj_anc = torch.nn.Linear(Dv, Dsym, bias=False) self.T = T self.register_buffer("anchors_vocab", torch.empty(0, Dv), persistent=False) def set_anchors(self, anchors): # [C, Dv] self.anchors_vocab = anchors.contiguous() def forward(self, image_embeds): # [B, De] (L2 ok) z = torch.nn.functional.normalize(self.proj_img(image_embeds), dim=-1) a = torch.nn.functional.normalize(self.proj_anc(self.anchors_vocab), dim=-1) return (z @ a.T) / max(1e-8, self.T) # [B, C] head = CrystalHead(De=image_dim, Dv=crystal_dim, Dsym=sym_dim, T=temperature).to("cuda")
Load weights (handles prefixed multi-module .safetensors) state = safetensors.torch.load_file("_best.safetensors") head_state = {k.split("head::",1)[1]: v for k,v in state.items() if k.startswith("head::")} head.load_state_dict(head_state, strict=True)
Prepare anchors from your vocabulary (same order as training) You likely already exported anchors or can rebuild them exactly as in Notebook 6. anchors: torch.Tensor of shape [100, 512] head.set_anchors(anchors.to("cuda"))
Inference on a batch of images (PIL or ndarray) imgs = [PIL.Image.open("example_0.png").convert("RGB"), PIL.Image.open("example_1.png").convert("RGB")] batch = Processor(images=imgs, return_tensors="pt").to("cuda") with torch.no_grad(): out = Encoder(pixel_values=batch["pixel_values"], return_dict=True) z = torch.nn.functional.normalize(out.image_embeds, dim=-1) # [B, 512] logits = head(z) # [B, 100] pred = logits.argmax(dim=-1).tolist() print("pred:", pred)

Note: The head expects the same class order used at training time. Save and ship class_names.json (CIFAR-100 labels) and the exact anchors_vocab.pt you used (or rebuild deterministically with the vocab + synth step).

REPRODUCE (Notebook 6)

Config only (single source of truth): image size, CLIP stats, dataset, temperature, crystal dims, etc.
Cell 5 – HF CLIP vision loader (one embedding per image).
Cell 6 – Vocabulary interface; synth any missing CIFAR tokens, cache crystals, select role anchors.
Cell 8 – Crystal head (image+anchor projections → cosine logits / T).
Cell 9 – Trainer (AdamW + AMP + cosine LR). Saves latest/best, pushes to HF if enabled.

Replace with your final numbers after the run completes.

ACKNOWLEDGEMENTS

CLIP ViT-B/32: OpenAI (openai/clip-vit-base-patch32) via Hugging Face transformers.
Pentachora Vocabulary: AbstractPhil/geometric-vocab-512d.
Built in Notebook 6 (CONFIG-first, deterministic synth for gaps, head-only training).

AbstractPhil
/

geoclip-vit-base-patch-32-32d

Preface

Reasoning

Why clip-vit instead of just vit?

Notebook-6 · Crystal-CLIP CIFAR-100

Model tree for AbstractPhil/geoclip-vit-base-patch-32-32d

Dataset used to train AbstractPhil/geoclip-vit-base-patch-32-32d