YAML Metadata Warning: The pipeline tag "image-text-retrieval" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Kiln CLIP (ViT-B/16, OpenCLIP, kiln-anchor)

This is a PyTorch checkpoint for a vision-only CLIP variant fine-tuned to detect brick kilns in satellite imagery.
The model uses a kiln anchor embedding (t_kiln) and a lightweight SimilarityHead to classify images as kiln vs. non-kiln.
It also includes a custom loss function with feature distillation and an attention-centroid mechanism for interpretability.


Model Details

  • Architecture: CLIP-style dual encoder (vision encoder used; text encoder unused during training)
  • Vision encoder: ViT-B/16 (OpenAI initialization via OpenCLIP)
  • Head: SimilarityHead mapping kiln-similarity to binary logit
  • Objective: Binary kiln classification (kiln vs. non-kiln)
  • Additional objective: Feature distillation from a frozen reference model
  • Framework: PyTorch + OpenCLIP + timm
  • Checkpoint contents: vision encoder weights, head weights, kiln anchor vector, training arguments

Loss Function and Distillation

The training criterion combines binary classification and optional distillation:

  • Classification term: Binary cross-entropy with logits (BCEWithLogitsLoss) on kiln vs. non-kiln.
  • Distillation term: Encourages student image features to stay close to a frozen teacher reference model.
    • Types:
      • MSE: Mean squared error between features.
      • Cosine: Penalizes deviation from cosine similarity of 1.
    • Weight: Controlled by distill_weight (default 0.05).
  • Final loss:

Loss = BCE(logits, labels) + distill_weight * DistillLoss(features, ref_features)

  • The distillation is only active if both image features and reference features are available.
  • This design stabilizes training and regularizes the vision encoder.

Training Details

Data

  • Satellite imagery patches covering regions in Punjab and Sindh.
  • Positive samples: fixed-chimney bull’s trench kilns (FCBK) and zigzag kilns.
  • Negative samples: non-kiln patches (rural and urban).
  • Images resized to 224×224 and normalized to CLIP mean/std.

Training Procedure

  • Encode images with the ViT-B/16 encoder → normalize features.
  • Compute cosine similarity to kiln anchor embedding t_kiln.
  • Pass similarity through SimilarityHead to produce logits.
  • Apply loss function described above.
  • Fine-tune only the last K vision blocks, with optional logit scale freezing.
  • Gradient norm clipped at 1.0.
  • Optimizer: AdamW with weight decay.
  • Mixed precision supported.

Hyperparameters

  • Batch size: 64
  • Epochs: 6
  • Learning rate: 5e-5
  • Weight decay: 0.05
  • Trainable blocks: last 2 by default
  • Distillation weight: 0.05
  • Distillation type: mse or cos

Evaluation

Validation

  • Validation loop mirrors training but uses plain BCE for loss reporting.
  • Accuracy computed at probability threshold 0.5.

Metrics

  • Binary classification accuracy
  • Recommended: ROC-AUC, F1-score for imbalanced settings

Outputs

  • Kiln probability (0–1)
  • Optional saliency heatmap and centroid visualization based on attention rollout

Technical Specifications

  • Backbone: OpenCLIP ViT-B/16 vision tower (OpenAI pretrained)
  • Embedding dimension: CLIP vision feature dim (ViT-B/16)
  • Kiln anchor: unit-norm reference vector t_kiln, saved in checkpoint
  • Head: affine logits over kiln similarity (trainable alpha, beta)
  • Compute: single GPU training with mixed precision
  • Dependencies: PyTorch, timm, open_clip_torch, torchvision, numpy, Pillow, matplotlib

Usage

The checkpoint is intended for:

  • Kiln detection: classify satellite patches as kiln vs. non-kiln
  • Kiln localization: estimate kiln centroid from ViT attention rollout
  • Research extension: distillation and anchor-based methods for other geospatial assets

Files Explained

This repository includes the pretrained checkpoint and minimal helpers for inference and inspection.

checkpoints/best.pt

  • Best-performing checkpoint.
  • Contains: vision encoder, SimilarityHead, kiln anchor (t_kiln), and training args.

head.py

  • Defines SimilarityHead, mapping kiln similarity to logits.
  • Required to reproduce classification outputs.

model.py

  • Rebuilds the OpenCLIP ViT-B/16 vision encoder with correct trainable/frozen layers.

prompt.py

  • Loads and normalizes the kiln anchor embedding (t_kiln).

infer_with_centroid.py

  • Inference utility: computes kiln probability, attention-based centroid, and saliency heatmaps.
  • Outputs annotated previews under outputs/.

Example Result

Kiln predicted with p=0.999 and centroid estimated from attention rollout:

Kiln prediction

Citation

If you use this model, please cite:

@misc{kilnclip2025, title = {Kiln CLIP: Detecting and Typing Brick Kilns in South Asia with Vision–Language Models}, author = {Hamdani, Suleman}, year = {2025}, note = {Hugging Face repository: sulemanhamdani/kiln-clip-vit-b-16} }

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support