File size: 6,362 Bytes
0d895bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
acdea4a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with this HuggingFace Spaces demo repository.

## Repository Overview

This is the HuggingFace Spaces repository for HybridTransformer-MFIF, providing an interactive Gradio-based web demo for the multi-focus image fusion model. Users can upload near-focus and far-focus images to see the hybrid transformer model fuse them into a single all-in-focus image.

## Repository Structure

### Core Application Files
- `app.py`: Main Gradio application with complete model definition and inference pipeline
- `README.md`: HuggingFace Spaces configuration and demo documentation
- `requirements.txt`: Python dependencies for the Gradio application
- `pyproject.toml`: Additional project configuration
- `uv.lock`: Dependency lock file

### Assets
- `assets/`: Directory containing sample images for the demo
  - `lytro-01-A.jpg`: Near-focus example image
  - `lytro-01-B.jpg`: Far-focus example image

### Documentation
- `AGENTS.md`: Agent interaction documentation
- `LICENSE`: Project license

## Application Architecture (app.py)

### Model Components
The application includes the complete model definition:
- **FocalModulation**: Adaptive spatial attention mechanism
- **CrossAttention**: Cross-view attention between input images
- **CrossViTBlock**: Cross-attention transformer blocks
- **FocalTransformerBlock**: Focal modulation transformer blocks
- **PatchEmbed**: Image patch embedding layer
- **FocalCrossViTHybrid**: Main hybrid model architecture

### Model Configuration
- **Image Size**: 224×224 pixels
- **Patch Size**: 16×16
- **Embedding Dimension**: 768
- **CrossViT Depth**: 4 blocks
- **Focal Transformer Depth**: 6 blocks
- **Attention Heads**: 12
- **Focal Window**: 9×9
- **Focal Levels**: 3

### Key Functions
- `load_model()`: Downloads model from HuggingFace Hub and initializes with error handling
- `get_transform()`: Image preprocessing pipeline
- `denormalize()`: Convert model output back to displayable format
- `fuse_images()`: Main inference function for image fusion

## Development Guidelines

### Local Development Setup
```bash
# Clone the repository
git clone https://huggingface.co/spaces/divitmittal/hybridtransformer-mfif
cd hybridtransformer-mfif

# Install dependencies
pip install -r requirements.txt
# OR with uv
uv sync

# Run the application
python app.py
# OR with uv
uv run app.py
```

### Model Loading Requirements
- Downloads model checkpoint `best_model.pth` from HuggingFace Hub repository `divitmittal/HybridTransformer-MFIF`
- Model weights are cached locally in `./model_cache` directory
- Model weights should be compatible with the defined architecture
- Supports both regular and DataParallel model states
- Automatic device detection (CUDA/CPU)

### Image Processing Pipeline
1. **Input**: PIL images (any size)
2. **Preprocessing**: Resize to 224×224, normalize with ImageNet stats
3. **Inference**: Forward pass through hybrid transformer
4. **Postprocessing**: Denormalize and convert to PIL image
5. **Output**: Fused PIL image

## Gradio Interface Components

### Input Components
- `near_img`: Image upload for near-focus input
- `far_img`: Image upload for far-focus input
- `submit_btn`: Button to trigger fusion process

### Output Components
- `fused_img`: Display for the resulting fused image

### Examples
- Predefined example pair using sample Lytro images
- Demonstrates expected input format and quality

## Error Handling

### Model Loading Errors
- Graceful handling of HuggingFace Hub download failures
- Device compatibility checking
- State dictionary format validation
- Network connectivity error handling

### Input Validation
- Checks for missing input images
- Handles various image formats via PIL
- Automatic error messages via Gradio interface

### Runtime Errors
- GPU memory management
- Inference error handling
- Graceful degradation to CPU if needed

## Performance Considerations

### Model Optimization
- Model is set to evaluation mode for inference
- No gradient computation during inference
- Efficient tensor operations with proper device placement

### Memory Management
- Single model instance cached globally
- Proper tensor cleanup after inference
- Device-appropriate memory allocation

## HuggingFace Spaces Configuration (README.md)

### Spaces Metadata
- **Title**: Hybrid Transformer for Multi-Focus Image Fusion
- **SDK**: Gradio
- **App File**: app.py
- **Emoji**: 🖼️
- **Color Theme**: Blue to green gradient

### Demo Features
- Interactive image upload interface
- Real-time fusion processing
- Example images for testing
- Responsive web interface

## Dependencies (requirements.txt)

### Core Dependencies
- `torch`: PyTorch framework for model inference
- `torchvision`: Image transformations and utilities
- `gradio`: Web interface framework
- `numpy`: Numerical computations
- `Pillow`: Image processing library
- `huggingface_hub`: Download models from HuggingFace Hub

### Version Management
- Minimal version specifications for maximum compatibility
- Focused on essential dependencies only
- Compatible with HuggingFace Spaces environment

## Usage Examples

### Basic Usage
1. Upload a near-focus image (foreground in focus)
2. Upload a far-focus image (background in focus)
3. Click "Fuse Images" to generate the all-in-focus result

### Expected Input
- Image pairs with complementary focus regions
- RGB color images (any resolution, will be resized)
- Similar scene content with different focal points

### Output Quality
- High-resolution fused images maintaining detail from both inputs
- Optimal focus transfer from source images
- Seamless blending without artifacts

## Development Tips

### Model Modifications
- Model architecture is defined directly in `app.py`
- Changes require updating the model class definitions
- Ensure compatibility with existing checkpoint format

### Interface Updates
- Gradio interface is highly customizable
- Can add new input/output components easily
- Supports additional preprocessing or postprocessing steps

### Deployment
- Optimized for HuggingFace Spaces deployment
- Automatic dependency installation
- Zero-configuration cloud deployment

This demo provides an accessible way for users to experience the multi-focus image fusion capabilities without requiring technical setup or model training.