You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

🌿 SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification

SCOLD (Leaf Disases Vision-Language) is a multimodal model that maps images and text descriptions into a shared embedding space. It combines a Swin Transformer as the image encoder and RoBERTa as the text encoder, projected to a 512-dimensional common space.

This model is developed for cross-modal retrieval, few-shot classification, and explainable AI in agriculture, especially for plant disease diagnosis from both images and domain-specific text prompts.


πŸš€ Model Details

Component Architecture
Image Encoder Swin Base (patch4, window7, 224 resolution)
Text Encoder RoBERTa-base
Projection Head Linear layer (to 512-D space)
Normalization L2 on both embeddings
Training Task Contrastive learning

The final embeddings from image and text encoders are aligned using cosine similarity.


βœ… Intended Use

  • Vision-language embedding for classification or retrieval tasks
  • Few-shot learning in agricultural or medical datasets
  • Multimodal interpretability or zero-shot transfer

πŸ§ͺ How to Use

import torch
from transformers import RobertaTokenizer
from torchvision import transforms
from PIL import Image
from modeling_lvl import LVL  # Replace with your module or package

# Load model
model = LVL()
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
model.eval()

# Text preprocessing
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
text = "A maize leaf with bacterial blight"
inputs = tokenizer(text, return_tensors="pt")

# Image preprocessing
image = Image.open("path_to_leaf.jpg").convert("RGB")
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])
image_tensor = transform(image).unsqueeze(0)

# Inference
with torch.no_grad():
    image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"])
    similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb)
    print(f"Similarity score: {similarity.item():.4f}")

Please cite this paper if this code is useful for you!

@misc{quoc2025visionlanguage,
  author       = {Quoc, K. N. and Thu, L. L. T. and Quach, L. D.},
  title        = {A Vision-Language Foundation Model for Leaf Disease Identification},
  year         = {2025},
  publisher    = {Authorea Preprints},
  url          = {10.36227/techrxiv.174062971.11176782/v1}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support