🧬 STELLA-VLM-JoVE-7B: Laboratory Protocol Vision-Language Model

🎯 Model Description

STELLA-VLM-JoVE-7B is a specialized vision-language model fine-tuned from NVIDIA's Cosmos-Reason1-7B on laboratory protocol videos from JoVE (Journal of Visualized Experiments). This model bridges the gap between visual laboratory demonstrations and written experimental protocols, enabling automated protocol extraction, safety assessment, and error detection from laboratory media.

Key Features

🔬 Protocol Extraction: Automatically generate step-by-step laboratory protocols from videos
📸 Image Analysis: Comprehensive analysis of laboratory images
⚠️ Error Detection: Identify experimental errors and safety violations
🛡️ Safety Assessment: Generate detailed safety reports
🧪 Equipment Identification: Catalog laboratory equipment and reagents
📊 Batch Processing: Efficiently process multiple videos

🚀 Quick Start

Installation

# Install dependencies
pip install torch transformers opencv-python pillow numpy

# Clone this model
git clone https://huggingface.co/Zaixi/STELLA-VLM-JoVE-7B
cd STELLA-VLM-JoVE-7B

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Zaixi/STELLA-VLM-JoVE-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "Zaixi/STELLA-VLM-JoVE-7B",
    trust_remote_code=True
)

# Analyze laboratory image
image = Image.open("lab_image.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "Extract the laboratory protocol from this image:"},
        {"type": "image", "image": image}
    ]
}]

text_input = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text_input], images=[image], return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.7)

response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

🔧 Using STELLA_VLM Tool

This repository includes STELLA_VLM (Scientific Tool for Experiment Lab Learning and Analysis), a comprehensive toolkit for laboratory media analysis.

Tool Installation

from stella_vlm_tool import (
    extract_protocol_from_video,
    analyze_lab_image,
    detect_experimental_errors,
    generate_safety_assessment,
    identify_equipment_and_reagents
)

Extract Protocol from Video

# Extract protocol from laboratory video
result = extract_protocol_from_video(
    video_path="experiment.mp4",
    max_frames=8,
    output_format="markdown"
)
print(result)

Analyze Laboratory Image

# Comprehensive image analysis
analysis = analyze_lab_image(
    image_path="lab_setup.jpg",
    analysis_type="comprehensive"  # or "equipment", "procedure", "safety"
)
print(analysis)

Detect Experimental Errors

# Detect errors and safety violations
errors = detect_experimental_errors(
    media_path="experiment.mp4",
    error_categories="all"  # or "technique", "safety", "contamination"
)
print(errors)

Command Line Interface

# Extract protocol
python stella_vlm_tool.py video.mp4 protocol

# Analyze image
python stella_vlm_tool.py image.jpg image

# Detect errors
python stella_vlm_tool.py video.mp4 errors

# Safety assessment
python stella_vlm_tool.py video.mp4 safety

📊 Model Performance

Capabilities by Domain

Domain	Capability	Performance
Cell Biology	Protocol extraction, sterility assessment	Excellent
Chemistry	Safety hazard detection, equipment ID	Very Good
Molecular Biology	Technique validation, contamination detection	Excellent
General Lab	Equipment identification, PPE compliance	Very Good

Recommended Settings

Frames per video: 8-12 for optimal detail
Max tokens: 1024-2048 for complete protocols
Temperature: 0.7 for balanced creativity/accuracy
GPU Memory: ~16GB VRAM recommended

🔬 Example Outputs

Protocol Extraction

Step 1: Prepare sterile PBS buffer at room temperature
Step 2: Add 5 mL of cell culture medium to 15 mL conical tube
Step 3: Centrifuge at 300g for 5 minutes at 4°C
Step 4: Carefully aspirate supernatant without disturbing pellet
...

Safety Assessment

PPE Status: ✅ Lab coat, gloves observed
Hazards Identified: Chemical (ethanol), Biological (cell culture)
Safety Violations: None detected
Recommendations: Ensure eye protection when handling chemicals

📚 Training Details

Base Model: nvidia/Cosmos-Reason1-7B
Training Data: JoVE laboratory protocol videos
Fine-tuning Method: LoRA (merged into base model)
Training Duration: ~50 hours on 8xA100 GPUs
Dataset Size: 10,000+ laboratory videos

⚡ System Requirements

GPU: NVIDIA GPU with 16GB+ VRAM (A100, A6000, RTX 4090)
RAM: 32GB+ system memory
Storage: 30GB for model weights
Python: 3.8 or higher
CUDA: 11.7 or higher (for GPU acceleration)

📝 Limitations

Optimized for laboratory/scientific content
Best performance with clear, well-lit videos
May require domain expertise to validate outputs
Limited to English language protocols

🤝 Contributing

We welcome contributions! Please see our contributing guidelines for details.

📄 License

This model is released under the MIT License. See LICENSE for details.

🙏 Acknowledgments

NVIDIA for the Cosmos-Reason base model
JoVE (Journal of Visualized Experiments) for laboratory protocol data
Open-source community for transformers and vision libraries

📖 Citation

If you use this model in your research, please cite:

@software{cosmos_reason_jove_2024,
  title = {STELLA-VLM-JoVE-7B: Laboratory Protocol Vision-Language Model},
  author = {Zaixi Zhang},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Zaixi/STELLA-VLM-JoVE-7B}
}

🔗 Links

📧 Contact

For questions or support, please open an issue on the Hugging Face repository.

Built with ❤️ for the scientific community

Zaixi
/

STELLA-VLM-JoVE-7B

You need to agree to share your contact information to access this model