enalis
/

scold

Model card Files Files and versions Community

scold / README.md

enalis's picture

Update README.md

c5c6e3d verified 6 days ago

|

2.97 kB

	---
	license: cc-by-4.0
	language:
	- en
	metrics:
	- accuracy
	- recall
	pipeline_tag: image-to-text
	tags:
	- agriculture
	- leaf
	- disease
	datasets:
	- enalis/LeafNet
	library_name: transformers
	---

	# 🌿 SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification

	SCOLD (Leaf Disases Vision-Language) is a multimodal model that maps images and text descriptions into a shared embedding space. It combines a [Swin Transformer](https://huggingface.co/timm/swin_tiny_patch4_window7_224) as the image encoder and [RoBERTa](https://huggingface.co/roberta-base) as the text encoder, projected to a 512-dimensional common space.

	This model is developed for cross-modal retrieval, few-shot classification, and explainable AI in agriculture, especially for plant disease diagnosis from both images and domain-specific text prompts.

	---

	## 🚀 Model Details

	\| Component \| Architecture \|
	\|------------------\|-------------------------------------------\|
	\| Image Encoder \| Swin Base (patch4, window7, 224 resolution) \|
	\| Text Encoder \| RoBERTa-base \|
	\| Projection Head \| Linear layer (to 512-D space) \|
	\| Normalization \| L2 on both embeddings \|
	\| Training Task \| Contrastive learning \|

	The final embeddings from image and text encoders are aligned using cosine similarity.

	---

	### ✅ Intended Use
	- Vision-language embedding for classification or retrieval tasks
	- Few-shot learning in agricultural or medical datasets
	- Multimodal interpretability or zero-shot transfer
	---

	## 🧪 How to Use

	```python
	import torch
	from transformers import RobertaTokenizer
	from torchvision import transforms
	from PIL import Image
	from model import LVL # Replace with your module or package

	# Load model
	model = LVL()
	model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
	model.eval()

	# Text preprocessing
	tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
	text = "A maize leaf with bacterial blight"
	inputs = tokenizer(text, return_tensors="pt")

	# Image preprocessing
	image = Image.open("path_to_leaf.jpg").convert("RGB")
	transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor()
	])
	image_tensor = transform(image).unsqueeze(0)

	# Inference
	with torch.no_grad():
	image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"])
	similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb)
	print(f"Similarity score: {similarity.item():.4f}")
	```
	Please cite this paper if this code is useful for you!

	```
	@article{quoc2025vision,
	title={A Vision-Language Foundation Model for Leaf Disease Identification},
	author={Quoc, Khang Nguyen and Thu, Lan Le Thi and Quach, Luyl-Da},
	journal={arXiv preprint arXiv:2505.07019},
	year={2025}
	}

	```
	Demo in [here](https://leafclip.streamlit.app/)