File size: 2,965 Bytes
3c3150b 571ea2e ada239a c5c6e3d ada239a 2cc0745 6a4b5b7 2cc0745 6a4b5b7 c45ccfc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
license: cc-by-4.0
language:
- en
metrics:
- accuracy
- recall
pipeline_tag: image-to-text
tags:
- agriculture
- leaf
- disease
datasets:
- enalis/LeafNet
library_name: transformers
---
# 🌿 SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification
**SCOLD** (Leaf Disases Vision-Language) is a multimodal model that maps **images** and **text descriptions** into a shared embedding space. It combines a [Swin Transformer](https://huggingface.co/timm/swin_tiny_patch4_window7_224) as the **image encoder** and [RoBERTa](https://huggingface.co/roberta-base) as the **text encoder**, projected to a 512-dimensional common space.
This model is developed for **cross-modal retrieval**, **few-shot classification**, and **explainable AI in agriculture**, especially for plant disease diagnosis from both images and domain-specific text prompts.
---
## 🚀 Model Details
| Component | Architecture |
|------------------|-------------------------------------------|
| Image Encoder | Swin Base (patch4, window7, 224 resolution) |
| Text Encoder | RoBERTa-base |
| Projection Head | Linear layer (to 512-D space) |
| Normalization | L2 on both embeddings |
| Training Task | Contrastive learning |
The final embeddings from image and text encoders are aligned using cosine similarity.
---
### ✅ Intended Use
- Vision-language embedding for classification or retrieval tasks
- Few-shot learning in agricultural or medical datasets
- Multimodal interpretability or zero-shot transfer
---
## 🧪 How to Use
```python
import torch
from transformers import RobertaTokenizer
from torchvision import transforms
from PIL import Image
from model import LVL # Replace with your module or package
# Load model
model = LVL()
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
model.eval()
# Text preprocessing
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
text = "A maize leaf with bacterial blight"
inputs = tokenizer(text, return_tensors="pt")
# Image preprocessing
image = Image.open("path_to_leaf.jpg").convert("RGB")
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor()
])
image_tensor = transform(image).unsqueeze(0)
# Inference
with torch.no_grad():
image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"])
similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb)
print(f"Similarity score: {similarity.item():.4f}")
```
Please cite this paper if this code is useful for you!
```
@article{quoc2025vision,
title={A Vision-Language Foundation Model for Leaf Disease Identification},
author={Quoc, Khang Nguyen and Thu, Lan Le Thi and Quach, Luyl-Da},
journal={arXiv preprint arXiv:2505.07019},
year={2025}
}
```
Demo in [here](https://leafclip.streamlit.app/) |