Kiln CLIP (ViT-B/16, OpenCLIP, kiln-anchor)
This is a PyTorch checkpoint for a vision-only CLIP variant fine-tuned to detect brick kilns in satellite imagery.
The model uses a kiln anchor embedding (t_kiln
) and a lightweight SimilarityHead to classify images as kiln vs. non-kiln.
It also includes a custom loss function with feature distillation and an attention-centroid mechanism for interpretability.
Model Details
- Architecture: CLIP-style dual encoder (vision encoder used; text encoder unused during training)
- Vision encoder: ViT-B/16 (OpenAI initialization via OpenCLIP)
- Head: SimilarityHead mapping kiln-similarity to binary logit
- Objective: Binary kiln classification (kiln vs. non-kiln)
- Additional objective: Feature distillation from a frozen reference model
- Framework: PyTorch + OpenCLIP + timm
- Checkpoint contents: vision encoder weights, head weights, kiln anchor vector, training arguments
Loss Function and Distillation
The training criterion combines binary classification and optional distillation:
- Classification term: Binary cross-entropy with logits (
BCEWithLogitsLoss
) on kiln vs. non-kiln. - Distillation term: Encourages student image features to stay close to a frozen teacher reference model.
- Types:
- MSE: Mean squared error between features.
- Cosine: Penalizes deviation from cosine similarity of 1.
- Weight: Controlled by
distill_weight
(default 0.05).
- Types:
- Final loss:
Loss = BCE(logits, labels) + distill_weight * DistillLoss(features, ref_features)
- The distillation is only active if both image features and reference features are available.
- This design stabilizes training and regularizes the vision encoder.
Training Details
Data
- Satellite imagery patches covering regions in Punjab and Sindh.
- Positive samples: fixed-chimney bull’s trench kilns (FCBK) and zigzag kilns.
- Negative samples: non-kiln patches (rural and urban).
- Images resized to 224×224 and normalized to CLIP mean/std.
Training Procedure
- Encode images with the ViT-B/16 encoder → normalize features.
- Compute cosine similarity to kiln anchor embedding
t_kiln
. - Pass similarity through SimilarityHead to produce logits.
- Apply loss function described above.
- Fine-tune only the last K vision blocks, with optional logit scale freezing.
- Gradient norm clipped at 1.0.
- Optimizer: AdamW with weight decay.
- Mixed precision supported.
Hyperparameters
- Batch size: 64
- Epochs: 6
- Learning rate: 5e-5
- Weight decay: 0.05
- Trainable blocks: last 2 by default
- Distillation weight: 0.05
- Distillation type: mse or cos
Evaluation
Validation
- Validation loop mirrors training but uses plain BCE for loss reporting.
- Accuracy computed at probability threshold 0.5.
Metrics
- Binary classification accuracy
- Recommended: ROC-AUC, F1-score for imbalanced settings
Outputs
- Kiln probability (0–1)
- Optional saliency heatmap and centroid visualization based on attention rollout
Technical Specifications
- Backbone: OpenCLIP ViT-B/16 vision tower (OpenAI pretrained)
- Embedding dimension: CLIP vision feature dim (ViT-B/16)
- Kiln anchor: unit-norm reference vector
t_kiln
, saved in checkpoint - Head: affine logits over kiln similarity (trainable alpha, beta)
- Compute: single GPU training with mixed precision
- Dependencies: PyTorch, timm, open_clip_torch, torchvision, numpy, Pillow, matplotlib
Usage
The checkpoint is intended for:
- Kiln detection: classify satellite patches as kiln vs. non-kiln
- Kiln localization: estimate kiln centroid from ViT attention rollout
- Research extension: distillation and anchor-based methods for other geospatial assets
Files Explained
This repository includes the pretrained checkpoint and minimal helpers for inference and inspection.
checkpoints/best.pt
- Best-performing checkpoint.
- Contains: vision encoder, SimilarityHead, kiln anchor (
t_kiln
), and training args.
head.py
- Defines SimilarityHead, mapping kiln similarity to logits.
- Required to reproduce classification outputs.
model.py
- Rebuilds the OpenCLIP ViT-B/16 vision encoder with correct trainable/frozen layers.
prompt.py
- Loads and normalizes the kiln anchor embedding (
t_kiln
).
infer_with_centroid.py
- Inference utility: computes kiln probability, attention-based centroid, and saliency heatmaps.
- Outputs annotated previews under
outputs/
.
Example Result
Kiln predicted with p=0.999 and centroid estimated from attention rollout:
Citation
If you use this model, please cite:
@misc{kilnclip2025, title = {Kiln CLIP: Detecting and Typing Brick Kilns in South Asia with Vision–Language Models}, author = {Hamdani, Suleman}, year = {2025}, note = {Hugging Face repository: sulemanhamdani/kiln-clip-vit-b-16} }