πΏ SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification
SCOLD (Leaf Disases Vision-Language) is a multimodal model that maps images and text descriptions into a shared embedding space. It combines a Swin Transformer as the image encoder and RoBERTa as the text encoder, projected to a 512-dimensional common space.
This model is developed for cross-modal retrieval, few-shot classification, and explainable AI in agriculture, especially for plant disease diagnosis from both images and domain-specific text prompts.
π Model Details
Component | Architecture |
---|---|
Image Encoder | Swin Base (patch4, window7, 224 resolution) |
Text Encoder | RoBERTa-base |
Projection Head | Linear layer (to 512-D space) |
Normalization | L2 on both embeddings |
Training Task | Contrastive learning |
The final embeddings from image and text encoders are aligned using cosine similarity.
β Intended Use
- Vision-language embedding for classification or retrieval tasks
- Few-shot learning in agricultural or medical datasets
- Multimodal interpretability or zero-shot transfer
π§ͺ How to Use
import torch
from transformers import RobertaTokenizer
from torchvision import transforms
from PIL import Image
from modeling_lvl import LVL # Replace with your module or package
# Load model
model = LVL()
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
model.eval()
# Text preprocessing
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
text = "A maize leaf with bacterial blight"
inputs = tokenizer(text, return_tensors="pt")
# Image preprocessing
image = Image.open("path_to_leaf.jpg").convert("RGB")
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor()
])
image_tensor = transform(image).unsqueeze(0)
# Inference
with torch.no_grad():
image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"])
similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb)
print(f"Similarity score: {similarity.item():.4f}")
Please cite this paper if this code is useful for you!
@misc{quoc2025visionlanguage,
author = {Quoc, K. N. and Thu, L. L. T. and Quach, L. D.},
title = {A Vision-Language Foundation Model for Leaf Disease Identification},
year = {2025},
publisher = {Authorea Preprints},
url = {10.36227/techrxiv.174062971.11176782/v1}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support