𧬠STELLA-VLM-JoVE-7B: Laboratory Protocol Vision-Language Model
π― Model Description
STELLA-VLM-JoVE-7B is a specialized vision-language model fine-tuned from NVIDIA's Cosmos-Reason1-7B on laboratory protocol videos from JoVE (Journal of Visualized Experiments). This model bridges the gap between visual laboratory demonstrations and written experimental protocols, enabling automated protocol extraction, safety assessment, and error detection from laboratory media.
Key Features
- π¬ Protocol Extraction: Automatically generate step-by-step laboratory protocols from videos
- πΈ Image Analysis: Comprehensive analysis of laboratory images
- β οΈ Error Detection: Identify experimental errors and safety violations
- π‘οΈ Safety Assessment: Generate detailed safety reports
- π§ͺ Equipment Identification: Catalog laboratory equipment and reagents
- π Batch Processing: Efficiently process multiple videos
π Quick Start
Installation
# Install dependencies
pip install torch transformers opencv-python pillow numpy
# Clone this model
git clone https://huggingface.co/Zaixi/STELLA-VLM-JoVE-7B
cd STELLA-VLM-JoVE-7B
Basic Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Zaixi/STELLA-VLM-JoVE-7B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"Zaixi/STELLA-VLM-JoVE-7B",
trust_remote_code=True
)
# Analyze laboratory image
image = Image.open("lab_image.jpg")
messages = [{
"role": "user",
"content": [
{"type": "text", "text": "Extract the laboratory protocol from this image:"},
{"type": "image", "image": image}
]
}]
text_input = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text_input], images=[image], return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.7)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
π§ Using STELLA_VLM Tool
This repository includes STELLA_VLM (Scientific Tool for Experiment Lab Learning and Analysis), a comprehensive toolkit for laboratory media analysis.
Tool Installation
from stella_vlm_tool import (
extract_protocol_from_video,
analyze_lab_image,
detect_experimental_errors,
generate_safety_assessment,
identify_equipment_and_reagents
)
Extract Protocol from Video
# Extract protocol from laboratory video
result = extract_protocol_from_video(
video_path="experiment.mp4",
max_frames=8,
output_format="markdown"
)
print(result)
Analyze Laboratory Image
# Comprehensive image analysis
analysis = analyze_lab_image(
image_path="lab_setup.jpg",
analysis_type="comprehensive" # or "equipment", "procedure", "safety"
)
print(analysis)
Detect Experimental Errors
# Detect errors and safety violations
errors = detect_experimental_errors(
media_path="experiment.mp4",
error_categories="all" # or "technique", "safety", "contamination"
)
print(errors)
Command Line Interface
# Extract protocol
python stella_vlm_tool.py video.mp4 protocol
# Analyze image
python stella_vlm_tool.py image.jpg image
# Detect errors
python stella_vlm_tool.py video.mp4 errors
# Safety assessment
python stella_vlm_tool.py video.mp4 safety
π Model Performance
Capabilities by Domain
Domain | Capability | Performance |
---|---|---|
Cell Biology | Protocol extraction, sterility assessment | Excellent |
Chemistry | Safety hazard detection, equipment ID | Very Good |
Molecular Biology | Technique validation, contamination detection | Excellent |
General Lab | Equipment identification, PPE compliance | Very Good |
Recommended Settings
- Frames per video: 8-12 for optimal detail
- Max tokens: 1024-2048 for complete protocols
- Temperature: 0.7 for balanced creativity/accuracy
- GPU Memory: ~16GB VRAM recommended
π¬ Example Outputs
Protocol Extraction
Step 1: Prepare sterile PBS buffer at room temperature
Step 2: Add 5 mL of cell culture medium to 15 mL conical tube
Step 3: Centrifuge at 300g for 5 minutes at 4Β°C
Step 4: Carefully aspirate supernatant without disturbing pellet
...
Safety Assessment
PPE Status: β
Lab coat, gloves observed
Hazards Identified: Chemical (ethanol), Biological (cell culture)
Safety Violations: None detected
Recommendations: Ensure eye protection when handling chemicals
π Training Details
- Base Model: nvidia/Cosmos-Reason1-7B
- Training Data: JoVE laboratory protocol videos
- Fine-tuning Method: LoRA (merged into base model)
- Training Duration: ~50 hours on 8xA100 GPUs
- Dataset Size: 10,000+ laboratory videos
β‘ System Requirements
- GPU: NVIDIA GPU with 16GB+ VRAM (A100, A6000, RTX 4090)
- RAM: 32GB+ system memory
- Storage: 30GB for model weights
- Python: 3.8 or higher
- CUDA: 11.7 or higher (for GPU acceleration)
π Limitations
- Optimized for laboratory/scientific content
- Best performance with clear, well-lit videos
- May require domain expertise to validate outputs
- Limited to English language protocols
π€ Contributing
We welcome contributions! Please see our contributing guidelines for details.
π License
This model is released under the MIT License. See LICENSE for details.
π Acknowledgments
- NVIDIA for the Cosmos-Reason base model
- JoVE (Journal of Visualized Experiments) for laboratory protocol data
- Open-source community for transformers and vision libraries
π Citation
If you use this model in your research, please cite:
@software{cosmos_reason_jove_2024,
title = {STELLA-VLM-JoVE-7B: Laboratory Protocol Vision-Language Model},
author = {Zaixi Zhang},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/Zaixi/STELLA-VLM-JoVE-7B}
}
π Links
π§ Contact
For questions or support, please open an issue on the Hugging Face repository.
Built with β€οΈ for the scientific community
- Downloads last month
- 36