|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
- recall |
|
pipeline_tag: image-to-text |
|
tags: |
|
- agriculture |
|
- leaf |
|
- disease |
|
datasets: |
|
- enalis/LeafNet |
|
library_name: transformers |
|
--- |
|
|
|
# πΏ SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification |
|
|
|
**SCOLD** (Leaf Disases Vision-Language) is a multimodal model that maps **images** and **text descriptions** into a shared embedding space. It combines a [Swin Transformer](https://huggingface.co/timm/swin_tiny_patch4_window7_224) as the **image encoder** and [RoBERTa](https://huggingface.co/roberta-base) as the **text encoder**, projected to a 512-dimensional common space. |
|
|
|
This model is developed for **cross-modal retrieval**, **few-shot classification**, and **explainable AI in agriculture**, especially for plant disease diagnosis from both images and domain-specific text prompts. |
|
|
|
--- |
|
|
|
## π Model Details |
|
|
|
| Component | Architecture | |
|
|------------------|-------------------------------------------| |
|
| Image Encoder | Swin Base (patch4, window7, 224 resolution) | |
|
| Text Encoder | RoBERTa-base | |
|
| Projection Head | Linear layer (to 512-D space) | |
|
| Normalization | L2 on both embeddings | |
|
| Training Task | Contrastive learning | |
|
|
|
The final embeddings from image and text encoders are aligned using cosine similarity. |
|
|
|
--- |
|
|
|
### β
Intended Use |
|
- Vision-language embedding for classification or retrieval tasks |
|
- Few-shot learning in agricultural or medical datasets |
|
- Multimodal interpretability or zero-shot transfer |
|
--- |
|
|
|
## π§ͺ How to Use |
|
|
|
```python |
|
import torch |
|
from transformers import RobertaTokenizer |
|
from torchvision import transforms |
|
from PIL import Image |
|
from model import LVL # Replace with your module or package |
|
|
|
# Load model |
|
model = LVL() |
|
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu")) |
|
model.eval() |
|
|
|
# Text preprocessing |
|
tokenizer = RobertaTokenizer.from_pretrained("roberta-base") |
|
text = "A maize leaf with bacterial blight" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
# Image preprocessing |
|
image = Image.open("path_to_leaf.jpg").convert("RGB") |
|
transform = transforms.Compose([ |
|
transforms.Resize((224, 224)), |
|
transforms.ToTensor() |
|
]) |
|
image_tensor = transform(image).unsqueeze(0) |
|
|
|
# Inference |
|
with torch.no_grad(): |
|
image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"]) |
|
similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb) |
|
print(f"Similarity score: {similarity.item():.4f}") |
|
``` |
|
Please cite this paper if this code is useful for you! |
|
|
|
``` |
|
@article{quoc2025vision, |
|
title={A Vision-Language Foundation Model for Leaf Disease Identification}, |
|
author={Quoc, Khang Nguyen and Thu, Lan Le Thi and Quach, Luyl-Da}, |
|
journal={arXiv preprint arXiv:2505.07019}, |
|
year={2025} |
|
} |
|
|
|
``` |
|
Demo in [here](https://leafclip.streamlit.app/) |