π core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery
π Overview
core-dino
is a resolution-agnostic self-supervised model designed for satellite imagery, trained on the Core-Five dataset using a DiNO-inspired setup. It handles imagery between 30β―cm and 2β―m, learning strong spatial features without any labels.
π Try It in Colab
π Open the interactive demo in Colab
Run real multi-resolution inference with pre-trained weights and visualize spatial embeddings directly.
π§ Architecture: DiNO Γ YOLO Γ I-JEPA
We combine three ideas to build a high-performance backbone for spatial representation learning:
1οΈβ£ Multi-Resolution DINO Setup (instead of local-global)
In standard DINO / DINOv2, the student sees cropped or distorted views (local), while the teacher sees global views.
Incore-dino
, we replace this with clean vs degraded resolution contrast:
- π¨βπ« Teacher gets clean 30β―cm satellite imagery.
- π¨βπ Student sees augmented versions of the same scene at varying resolutions (30β―cm β 2β―m) with photometric and spatial distortions.
This setup encourages the model to learn scale-invariant and semantic-aware features across real-world EO resolution gaps.
2οΈβ£ I-JEPA-Style Patch Dropping
We integrate ideas from I-JEPA:
- Random patch regions are dropped from the student input.
- The objective is to align the visible patch embeddings with the teacherβs corresponding high-resolution ones.
- This enforces local-global and partial-whole consistency in the latent space.
3οΈβ£ YOLOv11-X as Encoder Backbone
- We use YOLOv11-X, one of the most powerful and recent YOLO variants, as the spatial encoder.
- The backbone is truncated after 23 layers, retaining rich spatial semantics while maintaining efficiency.
- This provides strong priors from supervised detection tasks, now adapted for self-supervised learning.
π§ͺ Training Flow: Resolution-Agnostic DiNO
The training pipeline in core-dino
follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:
π¨βπ« 1. Teacher View (Clean & High-Res)
- Receives a clean 30β―cm image without any augmentation.
- Used as the stable reference to guide the student.
π¨βπ 2. Student View (Augmented Multi-Resolution)
- Receives randomly augmented versions of the same image:
- Downsampled to 30β―cm to 2β―m
- Augmented with noise, blur, color jitter, spatial dropout, etc.
- Emulates resolution variability common in EO imagery.
β οΈ 3. Spatial Misalignment & Solution
- Since different student resolutions produce different spatial dimensions (H Γ W),
we use bilinear interpolation to resize the studentβs feature map to match the teacher's spatial shape before computing the contrastive loss.
π― 4. Objective
- Align the spatial token embeddings of the student with the teacher β pixel-to-pixel and semantically β despite resolution gaps and augmentations.
- Encourages scale-invariant, robust feature learning across real-world variations.
π Performance: Latent Quality & Downstream Evaluation
Despite being trained without any labels, core-dino
demonstrates strong latent alignment and generalization capability β both in visual similarity and downstream tasks.
π£οΈ Downstream: Road Extraction (DeepGlobe Dataset)
We evaluated core-dino
on the DeepGlobe Road Extraction Dataset, using it as a frozen backbone in a simple segmentation pipeline.
Setup:
- Both
core-dino
and YOLOv11-X backbones were frozen - Only a 2-layer convolutional head was trained
- Task: Binary road segmentation using IoU loss
- Both
Result:
core-dino
consistently outperformed the supervised YOLOv11-X backbone across all epochs- Shows superior latent representation quality, even without task-specific supervision
- Demonstrates better generalization and semantic robustness in downstream transfer tasks
π Reproduce this comparison in Colab:
ποΈ Model Details
Field | Value |
---|---|
Parameters | 56.7M |
Backbone Architecture | YOLOv11 X |
Input Size | 320 Γ 320 β 4096 Γ 4096 |
Patch Source | Core-Five |
Resolutions | 30β―cm (clean) β 2β―m (augmented) |
Patch Drop | I-JEPA-style masking |
Loss | DINO contrastive loss |
Training Time | ~48h on 1ΓA100 |
π Quickstart
pip install torch torchvision huggingface_hub
from huggingface_hub import hf_hub_download
import torch
from backbone import YOLOBackBone
ckpt = hf_hub_download("gajeshladhar/core-dino", "checkpoints/student.pt")
model = YOLOBackBone().eval()
model.load_state_dict(torch.load(ckpt, map_location="cpu"))
π³ License
This project is released under the Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0) license.
β Free to use, share, and adapt for non-commercial research
β Commercial use is not permitted without explicit permission
π Please provide appropriate credit when using this dataset in publications or projects.