base_model:
- ByteDance-Seed/BAGEL-7B-MoT
datasets:
- multimodal-reasoning-lab/Zebra-CoT
license: apache-2.0
pipeline_tag: any-to-any
library_name: transformers
Bagel‑Zebra‑CoT
A vision–language model fine‑tuned on the Zebra‑CoT dataset to generate high-quality interleaved visual chain‑of‑thought.
Table of Contents
Model Description
Bagel‑Zebra‑CoT is fine-tuned from Bagel‑7B on the Zebra‑CoT. The model is trained to generate interleaved text and image traces inherently during its own reasoning process.
Usage
For interleaved text and image inference and training with our model, please refer to our GitHub repository.
For general information and other details, please refer to the offical Bagel GitHub repository.
Dataset
- Zebra‑CoT: 182,384 interleaved text‑image reasoning samples across 18 sub‑tasks in 4 categories (2D visual, 3D visual, scientific reasoning, visual logic & strategic games).
License
Bagel‑Zebra‑CoT is licensed under the Apache 2.0 license. It is finetuned from ByteDance-Seed/BAGEL-7B-MoT, which was finetuned from Qwen2.5-7B-Instruct and siglip-so400m-14-384-flash-attn2 model, and uses the FLUX.1-schnell VAE model, all under Apache 2.0.
Citation
If you use this model, please cite:
@misc{li2025zebracot,
title={Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning},
author={Ang Li and Charles Wang and Kaiyu Yue and Zikui Cai and Ollie Liu and Deqing Fu and Peng Guo and Wang Bill Zhu and Vatsal Sharan and Robin Jia and Willie Neiswanger and Furong Huang and Tom Goldstein and Micah Goldblum},
year={2025},
eprint={2507.16746},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.16746},
}
Links
- Project Page: https://multimodal-reasoning-lab.github.io/Zebra-CoT/
- Model on Hugging Face: https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT
- Dataset on Hugging Face: https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT
- Code on GitHub: https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT