Update README.md
Browse files
README.md
CHANGED
@@ -13,4 +13,89 @@ tags:
|
|
13 |
datasets:
|
14 |
- enalis/LeafNet
|
15 |
library_name: transformers
|
16 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
datasets:
|
14 |
- enalis/LeafNet
|
15 |
library_name: transformers
|
16 |
+
---
|
17 |
+
---
|
18 |
+
license: mit
|
19 |
+
tags:
|
20 |
+
- vision-language
|
21 |
+
- image-encoder
|
22 |
+
- text-encoder
|
23 |
+
- multimodal
|
24 |
+
- contrastive-learning
|
25 |
+
- explainable-ai
|
26 |
+
- few-shot-learning
|
27 |
+
- agriculture
|
28 |
+
library_name: transformers
|
29 |
+
datasets:
|
30 |
+
- your-dataset-name
|
31 |
+
pipeline_tag: feature-extraction
|
32 |
+
---
|
33 |
+
|
34 |
+
# 🌿 SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification
|
35 |
+
|
36 |
+
**SCOLD** (Leaf Disases Vision-Language) is a multimodal model that maps **images** and **text descriptions** into a shared embedding space. It combines a [Swin Transformer](https://huggingface.co/timm/swin_tiny_patch4_window7_224) as the **image encoder** and [RoBERTa](https://huggingface.co/roberta-base) as the **text encoder**, projected to a 512-dimensional common space.
|
37 |
+
|
38 |
+
This model is developed for **cross-modal retrieval**, **few-shot classification**, and **explainable AI in agriculture**, especially for plant disease diagnosis from both images and domain-specific text prompts.
|
39 |
+
|
40 |
+
---
|
41 |
+
|
42 |
+
## 🚀 Model Details
|
43 |
+
|
44 |
+
| Component | Architecture |
|
45 |
+
|------------------|-------------------------------------------|
|
46 |
+
| Image Encoder | Swin Base (patch4, window7, 224 resolution) |
|
47 |
+
| Text Encoder | RoBERTa-base |
|
48 |
+
| Projection Head | Linear layer (to 512-D space) |
|
49 |
+
| Normalization | L2 on both embeddings |
|
50 |
+
| Training Task | Contrastive learning |
|
51 |
+
|
52 |
+
The final embeddings from image and text encoders are aligned using cosine similarity.
|
53 |
+
|
54 |
+
---
|
55 |
+
|
56 |
+
## 🧩 Intended Uses & Limitations
|
57 |
+
|
58 |
+
### ✅ Intended Use
|
59 |
+
- Vision-language embedding for classification or retrieval tasks
|
60 |
+
- Few-shot learning in agricultural or medical datasets
|
61 |
+
- Multimodal interpretability or zero-shot transfer
|
62 |
+
|
63 |
+
### ❌ Limitations
|
64 |
+
- Not optimized for real-time inference
|
65 |
+
- Trained on LeafNet dataset
|
66 |
+
- May not generalize well to non-agricultural tasks without fine-tuning
|
67 |
+
|
68 |
+
---
|
69 |
+
|
70 |
+
## 🧪 How to Use
|
71 |
+
|
72 |
+
```python
|
73 |
+
import torch
|
74 |
+
from transformers import RobertaTokenizer
|
75 |
+
from torchvision import transforms
|
76 |
+
from PIL import Image
|
77 |
+
from modeling_lvl import LVL # Replace with your module or package
|
78 |
+
|
79 |
+
# Load model
|
80 |
+
model = LVL()
|
81 |
+
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
|
82 |
+
model.eval()
|
83 |
+
|
84 |
+
# Text preprocessing
|
85 |
+
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
|
86 |
+
text = "A maize leaf with bacterial blight"
|
87 |
+
inputs = tokenizer(text, return_tensors="pt")
|
88 |
+
|
89 |
+
# Image preprocessing
|
90 |
+
image = Image.open("path_to_leaf.jpg").convert("RGB")
|
91 |
+
transform = transforms.Compose([
|
92 |
+
transforms.Resize((224, 224)),
|
93 |
+
transforms.ToTensor()
|
94 |
+
])
|
95 |
+
image_tensor = transform(image).unsqueeze(0)
|
96 |
+
|
97 |
+
# Inference
|
98 |
+
with torch.no_grad():
|
99 |
+
image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"])
|
100 |
+
similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb)
|
101 |
+
print(f"Similarity score: {similarity.item():.4f}")
|