Any-to-Any
Transformers
Bagel-Zebra-CoT / README.md
leonli66's picture
Update README.md
ebce324 verified
---
base_model:
- ByteDance-Seed/BAGEL-7B-MoT
datasets:
- multimodal-reasoning-lab/Zebra-CoT
license: apache-2.0
pipeline_tag: any-to-any
library_name: transformers
---
# Bagel‑Zebra‑CoT
> A vision–language model fine‑tuned on the Zebra‑CoT dataset to generate high-quality interleaved visual chain‑of‑thought.
[![Paper on ArXiv](https://img.shields.io/badge/arxiv-2507.16746-red)](https://arxiv.org/abs/2507.16746)
[![Dataset on Hugging Face](https://img.shields.io/badge/huggingface-Zebra--CoT-lightblue)](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)
[![Model on Hugging Face](https://img.shields.io/badge/huggingface-Bagel--Zebra--CoT-orange)](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT)
[![GitHub](https://img.shields.io/badge/GitHub-Repo-blue?logo=github)](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT)
![Bagel-Zebra-CoT Example Trace](bagel_zebra_cot_example.png)
---
## Table of Contents
* [Model Description](#model-description)
* [Usage](#usage)
* [Dataset](#dataset)
* [License](#license)
* [Citation](#citation)
* [Links](#links)
---
## Model Description
Bagel‑Zebra‑CoT is fine-tuned from [Bagel‑7B](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT) on the Zebra‑CoT. The model is trained to generate interleaved text and image traces inherently during its own reasoning process.
---
## Usage
For interleaved text and image inference and training with our model, please refer to [our GitHub repository](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT).
For general information and other details, please refer to the [offical Bagel GitHub repository](https://github.com/bytedance-seed/BAGEL).
---
## Dataset
* **[Zebra‑CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**: 182,384 interleaved text‑image reasoning samples across 18 sub‑tasks in 4 categories (2D visual, 3D visual, scientific reasoning, visual logic & strategic games).
---
## License
Bagel‑Zebra‑CoT is licensed under the Apache 2.0 license. It is finetuned from [ByteDance-Seed/BAGEL-7B-MoT](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT), which was finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.
---
## Citation
If you use this model, please cite:
```bibtex
@misc{li2025zebracot,
title={Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning},
author={Ang Li and Charles Wang and Kaiyu Yue and Zikui Cai and Ollie Liu and Deqing Fu and Peng Guo and Wang Bill Zhu and Vatsal Sharan and Robin Jia and Willie Neiswanger and Furong Huang and Tom Goldstein and Micah Goldblum},
year={2025},
eprint={2507.16746},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.16746},
}
```
---
## Links
* **Project Page**: [https://multimodal-reasoning-lab.github.io/Zebra-CoT/](https://multimodal-reasoning-lab.github.io/Zebra-CoT/)
* **Model on Hugging Face**: [https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT)
* **Dataset on Hugging Face**: [https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)
* **Code on GitHub**: [https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT)
---