🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery

πŸ€— Model Hub πŸš€ Open In Colab πŸ›°οΈ Domain πŸ” Task


πŸ”­ Overview

core-dino is a resolution-agnostic self-supervised model designed for satellite imagery, trained on the Core-Five dataset using a DiNO-inspired setup. It handles imagery between 30β€―cm and 2β€―m, learning strong spatial features without any labels.


πŸ““ Try It in Colab

πŸ‘‰ Open the interactive demo in Colab
Run real multi-resolution inference with pre-trained weights and visualize spatial embeddings directly.


🧠 Architecture: DiNO Γ— YOLO Γ— I-JEPA

We combine three ideas to build a high-performance backbone for spatial representation learning:

1️⃣ Multi-Resolution DINO Setup (instead of local-global)

In standard DINO / DINOv2, the student sees cropped or distorted views (local), while the teacher sees global views.
In core-dino, we replace this with clean vs degraded resolution contrast:

  • πŸ‘¨β€πŸ« Teacher gets clean 30β€―cm satellite imagery.
  • πŸ‘¨β€πŸŽ“ Student sees augmented versions of the same scene at varying resolutions (30β€―cm β†’ 2β€―m) with photometric and spatial distortions.

This setup encourages the model to learn scale-invariant and semantic-aware features across real-world EO resolution gaps.

2️⃣ I-JEPA-Style Patch Dropping

We integrate ideas from I-JEPA:

  • Random patch regions are dropped from the student input.
  • The objective is to align the visible patch embeddings with the teacher’s corresponding high-resolution ones.
  • This enforces local-global and partial-whole consistency in the latent space.

3️⃣ YOLOv11-X as Encoder Backbone

  • We use YOLOv11-X, one of the most powerful and recent YOLO variants, as the spatial encoder.
  • The backbone is truncated after 23 layers, retaining rich spatial semantics while maintaining efficiency.
  • This provides strong priors from supervised detection tasks, now adapted for self-supervised learning.

πŸ§ͺ Training Flow: Resolution-Agnostic DiNO

The training pipeline in core-dino follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:

πŸ‘¨β€πŸ« 1. Teacher View (Clean & High-Res)

  • Receives a clean 30β€―cm image without any augmentation.
  • Used as the stable reference to guide the student.

πŸ‘¨β€πŸŽ“ 2. Student View (Augmented Multi-Resolution)

  • Receives randomly augmented versions of the same image:
    • Downsampled to 30β€―cm to 2β€―m
    • Augmented with noise, blur, color jitter, spatial dropout, etc.
  • Emulates resolution variability common in EO imagery.

⚠️ 3. Spatial Misalignment & Solution

  • Since different student resolutions produce different spatial dimensions (H Γ— W),
    we use bilinear interpolation to resize the student’s feature map to match the teacher's spatial shape before computing the contrastive loss.

🎯 4. Objective

  • Align the spatial token embeddings of the student with the teacher β€” pixel-to-pixel and semantically β€” despite resolution gaps and augmentations.
  • Encourages scale-invariant, robust feature learning across real-world variations.

πŸ“ˆ Performance: Latent Quality & Downstream Evaluation

Despite being trained without any labels, core-dino demonstrates strong latent alignment and generalization capability β€” both in visual similarity and downstream tasks.

πŸ›£οΈ Downstream: Road Extraction (DeepGlobe Dataset)

We evaluated core-dino on the DeepGlobe Road Extraction Dataset, using it as a frozen backbone in a simple segmentation pipeline.

  • Setup:

    • Both core-dino and YOLOv11-X backbones were frozen
    • Only a 2-layer convolutional head was trained
    • Task: Binary road segmentation using IoU loss
  • Result:

    • core-dino consistently outperformed the supervised YOLOv11-X backbone across all epochs
    • Shows superior latent representation quality, even without task-specific supervision
    • Demonstrates better generalization and semantic robustness in downstream transfer tasks

πŸ““ Reproduce this comparison in Colab: Open In Colab


Downstream Performance


πŸ—‚οΈ Model Details

Field Value
Parameters 56.7M
Backbone Architecture YOLOv11 X
Input Size 320 Γ— 320 – 4096 Γ— 4096
Patch Source Core-Five
Resolutions 30β€―cm (clean) β†’ 2β€―m (augmented)
Patch Drop I-JEPA-style masking
Loss DINO contrastive loss
Training Time ~48h on 1Γ—A100

πŸš€ Quickstart

pip install torch torchvision huggingface_hub
from huggingface_hub import hf_hub_download
import torch
from backbone import YOLOBackBone

ckpt = hf_hub_download("gajeshladhar/core-dino", "checkpoints/student.pt")
model = YOLOBackBone().eval()
model.load_state_dict(torch.load(ckpt, map_location="cpu"))

πŸ’³ License

This project is released under the Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0) license.

βœ… Free to use, share, and adapt for non-commercial research
❌ Commercial use is not permitted without explicit permission
πŸ“Œ Please provide appropriate credit when using this dataset in publications or projects.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support