EARL - SFT think (S) (8B)
Model Size: 8B parameters
Base Model: BAAI/Emu3-Stage1
Dataset: Simple Edit
Training Objective: Supervised Fine-Tuning (SFT) with Chain-of-Thought reasoning
This model is introduced in our paper: EARL: The Promise of RL for Autoregressive Image Editing.
Overview
EARL - SFT think (S) is a fine-tuned 8B vision-language model designed for autoregressive image editing. It extends the base Emu3 model with chain-of-thought supervision, enabling step-by-step reasoning to tackle complex editing tasks. Training leverages the Simple Edit dataset, focusing on editable instructions grounded in visual understanding.
๐ Inference script and usage: GitHub Repository
Benchmark Results
Model | OmniEdit | EmuEdit | AURORA | MB | VisMin | I2EBench | AVG |
---|---|---|---|---|---|---|---|
SFT (S) | 5.73 | 3.66 | 3.58 | 3.19 | 3.57 | 3.59 | 3.88 |
SFT think (S) | 4.34 | 3.76 | 2.88 | 3.36 | 3.46 | 3.21 | 3.50 |
โ ๏ธ Despite integrating reasoning capabilities, the SFT think variant underperforms slightly compared to the standard SFT model in average benchmark scores.
Intended Use
This model is suited for research and development in image editing tasks that benefit from interpretable reasoning, such as instructional or multi-step visual modifications.
- Downloads last month
- 2