EARL - SFT think (S) (8B)

Model Size: 8B parameters
Base Model: BAAI/Emu3-Stage1
Dataset: Simple Edit
Training Objective: Supervised Fine-Tuning (SFT) with Chain-of-Thought reasoning

This model is introduced in our paper: EARL: The Promise of RL for Autoregressive Image Editing.

Overview

EARL - SFT think (S) is a fine-tuned 8B vision-language model designed for autoregressive image editing. It extends the base Emu3 model with chain-of-thought supervision, enabling step-by-step reasoning to tackle complex editing tasks. Training leverages the Simple Edit dataset, focusing on editable instructions grounded in visual understanding.

🔗 Inference script and usage: GitHub Repository

Benchmark Results

Model	OmniEdit	EmuEdit	AURORA	MB	VisMin	I2EBench	AVG
SFT (S)	5.73	3.66	3.58	3.19	3.57	3.59	3.88
SFT think (S)	4.34	3.76	2.88	3.36	3.46	3.21	3.50

⚠️ Despite integrating reasoning capabilities, the SFT think variant underperforms slightly compared to the standard SFT model in average benchmark scores.

Intended Use

This model is suited for research and development in image editing tasks that benefit from interpretable reasoning, such as instructional or multi-step visual modifications.

mair-lab
/

thinking-sft-simple

EARL - SFT think (S) (8B)

Overview

Benchmark Results

Intended Use

Model tree for mair-lab/thinking-sft-simple

Collection including mair-lab/thinking-sft-simple

EARL