CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with this HuggingFace Spaces demo repository.

Repository Overview

This is the HuggingFace Spaces repository for HybridTransformer-MFIF, providing an interactive Gradio-based web demo for the multi-focus image fusion model. Users can upload near-focus and far-focus images to see the hybrid transformer model fuse them into a single all-in-focus image.

Repository Structure

Core Application Files

app.py: Main Gradio application with complete model definition and inference pipeline
README.md: HuggingFace Spaces configuration and demo documentation
requirements.txt: Python dependencies for the Gradio application
pyproject.toml: Additional project configuration
uv.lock: Dependency lock file

Assets

assets/: Directory containing sample images for the demo
- lytro-01-A.jpg: Near-focus example image
- lytro-01-B.jpg: Far-focus example image

Documentation

AGENTS.md: Agent interaction documentation
LICENSE: Project license

Application Architecture (app.py)

Model Components

The application includes the complete model definition:

FocalModulation: Adaptive spatial attention mechanism
CrossAttention: Cross-view attention between input images
CrossViTBlock: Cross-attention transformer blocks
FocalTransformerBlock: Focal modulation transformer blocks
PatchEmbed: Image patch embedding layer
FocalCrossViTHybrid: Main hybrid model architecture

Model Configuration

Image Size: 224×224 pixels
Patch Size: 16×16
Embedding Dimension: 768
CrossViT Depth: 4 blocks
Focal Transformer Depth: 6 blocks
Attention Heads: 12
Focal Window: 9×9
Focal Levels: 3

Key Functions

load_model(): Downloads model from HuggingFace Hub and initializes with error handling
get_transform(): Image preprocessing pipeline
denormalize(): Convert model output back to displayable format
fuse_images(): Main inference function for image fusion

Development Guidelines

Local Development Setup

# Clone the repository
git clone https://huggingface.co/spaces/divitmittal/hybridtransformer-mfif
cd hybridtransformer-mfif

# Install dependencies
pip install -r requirements.txt
# OR with uv
uv sync

# Run the application
python app.py
# OR with uv
uv run app.py

Model Loading Requirements

Downloads model checkpoint best_model.pth from HuggingFace Hub repository divitmittal/HybridTransformer-MFIF
Model weights are cached locally in ./model_cache directory
Model weights should be compatible with the defined architecture
Supports both regular and DataParallel model states
Automatic device detection (CUDA/CPU)

Image Processing Pipeline

Input: PIL images (any size)
Preprocessing: Resize to 224×224, normalize with ImageNet stats
Inference: Forward pass through hybrid transformer
Postprocessing: Denormalize and convert to PIL image
Output: Fused PIL image

Gradio Interface Components

Input Components

near_img: Image upload for near-focus input
far_img: Image upload for far-focus input
submit_btn: Button to trigger fusion process

Output Components

fused_img: Display for the resulting fused image

Examples

Predefined example pair using sample Lytro images
Demonstrates expected input format and quality

Error Handling

Model Loading Errors

Graceful handling of HuggingFace Hub download failures
Device compatibility checking
State dictionary format validation
Network connectivity error handling

Input Validation

Checks for missing input images
Handles various image formats via PIL
Automatic error messages via Gradio interface

Runtime Errors

GPU memory management
Inference error handling
Graceful degradation to CPU if needed

Performance Considerations

Model Optimization

Model is set to evaluation mode for inference
No gradient computation during inference
Efficient tensor operations with proper device placement

Memory Management

Single model instance cached globally
Proper tensor cleanup after inference
Device-appropriate memory allocation

HuggingFace Spaces Configuration (README.md)

Spaces Metadata

Title: Hybrid Transformer for Multi-Focus Image Fusion
SDK: Gradio
App File: app.py
Emoji: 🖼️
Color Theme: Blue to green gradient

Demo Features

Interactive image upload interface
Real-time fusion processing
Example images for testing
Responsive web interface

Dependencies (requirements.txt)

Core Dependencies

torch: PyTorch framework for model inference
torchvision: Image transformations and utilities
gradio: Web interface framework
numpy: Numerical computations
Pillow: Image processing library
huggingface_hub: Download models from HuggingFace Hub

Version Management

Minimal version specifications for maximum compatibility
Focused on essential dependencies only
Compatible with HuggingFace Spaces environment

Usage Examples

Basic Usage

Upload a near-focus image (foreground in focus)
Upload a far-focus image (background in focus)
Click "Fuse Images" to generate the all-in-focus result

Expected Input

Image pairs with complementary focus regions
RGB color images (any resolution, will be resized)
Similar scene content with different focal points

Output Quality

High-resolution fused images maintaining detail from both inputs
Optimal focus transfer from source images
Seamless blending without artifacts

Development Tips

Model Modifications

Model architecture is defined directly in app.py
Changes require updating the model class definitions
Ensure compatibility with existing checkpoint format

Interface Updates

Gradio interface is highly customizable
Can add new input/output components easily
Supports additional preprocessing or postprocessing steps

Deployment

Optimized for HuggingFace Spaces deployment
Automatic dependency installation
Zero-configuration cloud deployment

This demo provides an accessible way for users to experience the multi-focus image fusion capabilities without requiring technical setup or model training.