Kiln CLIP (ViT-B/16, OpenCLIP, kiln-anchor)

This is a PyTorch checkpoint for a vision-only CLIP variant fine-tuned to detect brick kilns in satellite imagery.
The model uses a kiln anchor embedding (t_kiln) and a lightweight SimilarityHead to classify images as kiln vs. non-kiln.
It also includes a custom loss function with feature distillation and an attention-centroid mechanism for interpretability.

Model Details

Architecture: CLIP-style dual encoder (vision encoder used; text encoder unused during training)
Vision encoder: ViT-B/16 (OpenAI initialization via OpenCLIP)
Head: SimilarityHead mapping kiln-similarity to binary logit
Objective: Binary kiln classification (kiln vs. non-kiln)
Additional objective: Feature distillation from a frozen reference model
Framework: PyTorch + OpenCLIP + timm
Checkpoint contents: vision encoder weights, head weights, kiln anchor vector, training arguments

Loss Function and Distillation

The training criterion combines binary classification and optional distillation:

Classification term: Binary cross-entropy with logits (BCEWithLogitsLoss) on kiln vs. non-kiln.
Distillation term: Encourages student image features to stay close to a frozen teacher reference model.
- Types:
  - MSE: Mean squared error between features.
  - Cosine: Penalizes deviation from cosine similarity of 1.
- Weight: Controlled by distill_weight (default 0.05).
Final loss:

Loss = BCE(logits, labels) + distill_weight * DistillLoss(features, ref_features)

The distillation is only active if both image features and reference features are available.
This design stabilizes training and regularizes the vision encoder.

Training Details

Data

Satellite imagery patches covering regions in Punjab and Sindh.
Positive samples: fixed-chimney bull’s trench kilns (FCBK) and zigzag kilns.
Negative samples: non-kiln patches (rural and urban).
Images resized to 224×224 and normalized to CLIP mean/std.

Training Procedure

Encode images with the ViT-B/16 encoder → normalize features.
Compute cosine similarity to kiln anchor embedding t_kiln.
Pass similarity through SimilarityHead to produce logits.
Apply loss function described above.
Fine-tune only the last K vision blocks, with optional logit scale freezing.
Gradient norm clipped at 1.0.
Optimizer: AdamW with weight decay.
Mixed precision supported.

Hyperparameters

Batch size: 64
Epochs: 6
Learning rate: 5e-5
Weight decay: 0.05
Trainable blocks: last 2 by default
Distillation weight: 0.05
Distillation type: mse or cos

Evaluation

Validation

Validation loop mirrors training but uses plain BCE for loss reporting.
Accuracy computed at probability threshold 0.5.

Metrics

Binary classification accuracy
Recommended: ROC-AUC, F1-score for imbalanced settings

Outputs

Kiln probability (0–1)
Optional saliency heatmap and centroid visualization based on attention rollout

Technical Specifications

Backbone: OpenCLIP ViT-B/16 vision tower (OpenAI pretrained)
Embedding dimension: CLIP vision feature dim (ViT-B/16)
Kiln anchor: unit-norm reference vector t_kiln, saved in checkpoint
Head: affine logits over kiln similarity (trainable alpha, beta)
Compute: single GPU training with mixed precision
Dependencies: PyTorch, timm, open_clip_torch, torchvision, numpy, Pillow, matplotlib

Usage

The checkpoint is intended for:

Kiln detection: classify satellite patches as kiln vs. non-kiln
Kiln localization: estimate kiln centroid from ViT attention rollout
Research extension: distillation and anchor-based methods for other geospatial assets

Files Explained

This repository includes the pretrained checkpoint and minimal helpers for inference and inspection.

checkpoints/best.pt

Best-performing checkpoint.
Contains: vision encoder, SimilarityHead, kiln anchor (t_kiln), and training args.

head.py

Defines SimilarityHead, mapping kiln similarity to logits.
Required to reproduce classification outputs.

model.py

Rebuilds the OpenCLIP ViT-B/16 vision encoder with correct trainable/frozen layers.

prompt.py

Loads and normalizes the kiln anchor embedding (t_kiln).

infer_with_centroid.py

Inference utility: computes kiln probability, attention-based centroid, and saliency heatmaps.
Outputs annotated previews under outputs/.

Example Result

Kiln predicted with p=0.999 and centroid estimated from attention rollout:

Citation

If you use this model, please cite:

@misc{kilnclip2025, title = {Kiln CLIP: Detecting and Typing Brick Kilns in South Asia with Vision–Language Models}, author = {Hamdani, Suleman}, year = {2025}, note = {Hugging Face repository: sulemanhamdani/kiln-clip-vit-b-16} }