enalis commited on
Commit
ada239a
·
verified ·
1 Parent(s): 571ea2e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -1
README.md CHANGED
@@ -13,4 +13,89 @@ tags:
13
  datasets:
14
  - enalis/LeafNet
15
  library_name: transformers
16
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  datasets:
14
  - enalis/LeafNet
15
  library_name: transformers
16
+ ---
17
+ ---
18
+ license: mit
19
+ tags:
20
+ - vision-language
21
+ - image-encoder
22
+ - text-encoder
23
+ - multimodal
24
+ - contrastive-learning
25
+ - explainable-ai
26
+ - few-shot-learning
27
+ - agriculture
28
+ library_name: transformers
29
+ datasets:
30
+ - your-dataset-name
31
+ pipeline_tag: feature-extraction
32
+ ---
33
+
34
+ # 🌿 SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification
35
+
36
+ **SCOLD** (Leaf Disases Vision-Language) is a multimodal model that maps **images** and **text descriptions** into a shared embedding space. It combines a [Swin Transformer](https://huggingface.co/timm/swin_tiny_patch4_window7_224) as the **image encoder** and [RoBERTa](https://huggingface.co/roberta-base) as the **text encoder**, projected to a 512-dimensional common space.
37
+
38
+ This model is developed for **cross-modal retrieval**, **few-shot classification**, and **explainable AI in agriculture**, especially for plant disease diagnosis from both images and domain-specific text prompts.
39
+
40
+ ---
41
+
42
+ ## 🚀 Model Details
43
+
44
+ | Component | Architecture |
45
+ |------------------|-------------------------------------------|
46
+ | Image Encoder | Swin Base (patch4, window7, 224 resolution) |
47
+ | Text Encoder | RoBERTa-base |
48
+ | Projection Head | Linear layer (to 512-D space) |
49
+ | Normalization | L2 on both embeddings |
50
+ | Training Task | Contrastive learning |
51
+
52
+ The final embeddings from image and text encoders are aligned using cosine similarity.
53
+
54
+ ---
55
+
56
+ ## 🧩 Intended Uses & Limitations
57
+
58
+ ### ✅ Intended Use
59
+ - Vision-language embedding for classification or retrieval tasks
60
+ - Few-shot learning in agricultural or medical datasets
61
+ - Multimodal interpretability or zero-shot transfer
62
+
63
+ ### ❌ Limitations
64
+ - Not optimized for real-time inference
65
+ - Trained on LeafNet dataset
66
+ - May not generalize well to non-agricultural tasks without fine-tuning
67
+
68
+ ---
69
+
70
+ ## 🧪 How to Use
71
+
72
+ ```python
73
+ import torch
74
+ from transformers import RobertaTokenizer
75
+ from torchvision import transforms
76
+ from PIL import Image
77
+ from modeling_lvl import LVL # Replace with your module or package
78
+
79
+ # Load model
80
+ model = LVL()
81
+ model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
82
+ model.eval()
83
+
84
+ # Text preprocessing
85
+ tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
86
+ text = "A maize leaf with bacterial blight"
87
+ inputs = tokenizer(text, return_tensors="pt")
88
+
89
+ # Image preprocessing
90
+ image = Image.open("path_to_leaf.jpg").convert("RGB")
91
+ transform = transforms.Compose([
92
+ transforms.Resize((224, 224)),
93
+ transforms.ToTensor()
94
+ ])
95
+ image_tensor = transform(image).unsqueeze(0)
96
+
97
+ # Inference
98
+ with torch.no_grad():
99
+ image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"])
100
+ similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb)
101
+ print(f"Similarity score: {similarity.item():.4f}")