File size: 3,696 Bytes
dbfd3ed 1a56384 ebce324 1a56384 dbfd3ed b0e9b4d dbfd3ed b0e9b4d 1a56384 dbfd3ed 1e0e382 dbfd3ed e3a4f2d dbfd3ed c1ff3c5 dbfd3ed 1a56384 dbfd3ed 0f25aed 883389c 0f25aed 2bdffb4 dbfd3ed 0f25aed dbfd3ed b0e9b4d dbfd3ed 1a56384 dbfd3ed 4a6b77e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
---
base_model:
- ByteDance-Seed/BAGEL-7B-MoT
datasets:
- multimodal-reasoning-lab/Zebra-CoT
license: apache-2.0
pipeline_tag: any-to-any
library_name: transformers
---
# Bagel‑Zebra‑CoT
> A vision–language model fine‑tuned on the Zebra‑CoT dataset to generate high-quality interleaved visual chain‑of‑thought.
[](https://arxiv.org/abs/2507.16746)
[](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)
[](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT)
[](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT)

---
## Table of Contents
* [Model Description](#model-description)
* [Usage](#usage)
* [Dataset](#dataset)
* [License](#license)
* [Citation](#citation)
* [Links](#links)
---
## Model Description
Bagel‑Zebra‑CoT is fine-tuned from [Bagel‑7B](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT) on the Zebra‑CoT. The model is trained to generate interleaved text and image traces inherently during its own reasoning process.
---
## Usage
For interleaved text and image inference and training with our model, please refer to [our GitHub repository](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT).
For general information and other details, please refer to the [offical Bagel GitHub repository](https://github.com/bytedance-seed/BAGEL).
---
## Dataset
* **[Zebra‑CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**: 182,384 interleaved text‑image reasoning samples across 18 sub‑tasks in 4 categories (2D visual, 3D visual, scientific reasoning, visual logic & strategic games).
---
## License
Bagel‑Zebra‑CoT is licensed under the Apache 2.0 license. It is finetuned from [ByteDance-Seed/BAGEL-7B-MoT](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT), which was finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.
---
## Citation
If you use this model, please cite:
```bibtex
@misc{li2025zebracot,
title={Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning},
author={Ang Li and Charles Wang and Kaiyu Yue and Zikui Cai and Ollie Liu and Deqing Fu and Peng Guo and Wang Bill Zhu and Vatsal Sharan and Robin Jia and Willie Neiswanger and Furong Huang and Tom Goldstein and Micah Goldblum},
year={2025},
eprint={2507.16746},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.16746},
}
```
---
## Links
* **Project Page**: [https://multimodal-reasoning-lab.github.io/Zebra-CoT/](https://multimodal-reasoning-lab.github.io/Zebra-CoT/)
* **Model on Hugging Face**: [https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT)
* **Dataset on Hugging Face**: [https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)
* **Code on GitHub**: [https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT)
--- |