EARL - SFT think (S) (8B)

Model Size: 8B parameters
Base Model: BAAI/Emu3-Stage1
Dataset: Simple Edit
Training Objective: Supervised Fine-Tuning (SFT) with Chain-of-Thought reasoning

This model is introduced in our paper: EARL: The Promise of RL for Autoregressive Image Editing.

Overview

EARL - SFT think (S) is a fine-tuned 8B vision-language model designed for autoregressive image editing. It extends the base Emu3 model with chain-of-thought supervision, enabling step-by-step reasoning to tackle complex editing tasks. Training leverages the Simple Edit dataset, focusing on editable instructions grounded in visual understanding.

๐Ÿ”— Inference script and usage: GitHub Repository

Benchmark Results

Model OmniEdit EmuEdit AURORA MB VisMin I2EBench AVG
SFT (S) 5.73 3.66 3.58 3.19 3.57 3.59 3.88
SFT think (S) 4.34 3.76 2.88 3.36 3.46 3.21 3.50

โš ๏ธ Despite integrating reasoning capabilities, the SFT think variant underperforms slightly compared to the standard SFT model in average benchmark scores.

Intended Use

This model is suited for research and development in image editing tasks that benefit from interpretable reasoning, such as instructional or multi-step visual modifications.

Downloads last month
2
Safetensors
Model size
8.49B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mair-lab/thinking-sft-simple

Base model

BAAI/Emu3-Stage1
Finetuned
(2)
this model
Finetunes
1 model

Collection including mair-lab/thinking-sft-simple