Any-to-Any
Transformers
File size: 3,696 Bytes
dbfd3ed
 
 
1a56384
 
 
ebce324
1a56384
dbfd3ed
 
 
 
 
 
b0e9b4d
dbfd3ed
b0e9b4d
1a56384
dbfd3ed
1e0e382
 
dbfd3ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3a4f2d
dbfd3ed
 
 
 
 
c1ff3c5
 
 
dbfd3ed
 
 
 
 
1a56384
dbfd3ed
0f25aed
 
883389c
0f25aed
2bdffb4
dbfd3ed
0f25aed
dbfd3ed
 
 
 
 
b0e9b4d
 
 
 
 
 
 
 
dbfd3ed
 
 
 
 
 
 
1a56384
 
 
 
dbfd3ed
4a6b77e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
base_model:
- ByteDance-Seed/BAGEL-7B-MoT
datasets:
- multimodal-reasoning-lab/Zebra-CoT
license: apache-2.0
pipeline_tag: any-to-any
library_name: transformers
---

# Bagel‑Zebra‑CoT

> A vision–language model fine‑tuned on the Zebra‑CoT dataset to generate high-quality interleaved visual chain‑of‑thought.

[![Paper on ArXiv](https://img.shields.io/badge/arxiv-2507.16746-red)](https://arxiv.org/abs/2507.16746)
[![Dataset on Hugging Face](https://img.shields.io/badge/huggingface-Zebra--CoT-lightblue)](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)
[![Model on Hugging Face](https://img.shields.io/badge/huggingface-Bagel--Zebra--CoT-orange)](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT)
[![GitHub](https://img.shields.io/badge/GitHub-Repo-blue?logo=github)](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT)

![Bagel-Zebra-CoT Example Trace](bagel_zebra_cot_example.png)

---

## Table of Contents

* [Model Description](#model-description)
* [Usage](#usage)
* [Dataset](#dataset)
* [License](#license)
* [Citation](#citation)
* [Links](#links)

---

## Model Description

Bagel‑Zebra‑CoT is fine-tuned from [Bagel‑7B](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT) on the Zebra‑CoT. The model is trained to generate interleaved text and image traces inherently during its own reasoning process. 

---

## Usage

For interleaved text and image inference and training with our model, please refer to [our GitHub repository](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT).

For general information and other details, please refer to the [offical Bagel GitHub repository](https://github.com/bytedance-seed/BAGEL).

---

## Dataset

*   **[Zebra‑CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**: 182,384 interleaved text‑image reasoning samples across 18 sub‑tasks in 4 categories (2D visual, 3D visual, scientific reasoning, visual logic & strategic games).

---

## License

Bagel‑Zebra‑CoT is licensed under the Apache 2.0 license. It is finetuned from [ByteDance-Seed/BAGEL-7B-MoT](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT), which was finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.

---
## Citation

If you use this model, please cite:

```bibtex
@misc{li2025zebracot,
      title={Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning}, 
      author={Ang Li and Charles Wang and Kaiyu Yue and Zikui Cai and Ollie Liu and Deqing Fu and Peng Guo and Wang Bill Zhu and Vatsal Sharan and Robin Jia and Willie Neiswanger and Furong Huang and Tom Goldstein and Micah Goldblum},
      year={2025},
      eprint={2507.16746},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.16746}, 
}
```

---

## Links

*   **Project Page**: [https://multimodal-reasoning-lab.github.io/Zebra-CoT/](https://multimodal-reasoning-lab.github.io/Zebra-CoT/)
*   **Model on Hugging Face**: [https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT)
*   **Dataset on Hugging Face**: [https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)
*   **Code on GitHub**: [https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT)

---