|
|
--- |
|
|
base_model: |
|
|
- ByteDance-Seed/BAGEL-7B-MoT |
|
|
datasets: |
|
|
- multimodal-reasoning-lab/Zebra-CoT |
|
|
license: apache-2.0 |
|
|
pipeline_tag: any-to-any |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Bagel‑Zebra‑CoT |
|
|
|
|
|
> A vision–language model fine‑tuned on the Zebra‑CoT dataset to generate high-quality interleaved visual chain‑of‑thought. |
|
|
|
|
|
[](https://arxiv.org/abs/2507.16746) |
|
|
[](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT) |
|
|
[](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT) |
|
|
[](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT) |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
## Table of Contents |
|
|
|
|
|
* [Model Description](#model-description) |
|
|
* [Usage](#usage) |
|
|
* [Dataset](#dataset) |
|
|
* [License](#license) |
|
|
* [Citation](#citation) |
|
|
* [Links](#links) |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Bagel‑Zebra‑CoT is fine-tuned from [Bagel‑7B](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT) on the Zebra‑CoT. The model is trained to generate interleaved text and image traces inherently during its own reasoning process. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
For interleaved text and image inference and training with our model, please refer to [our GitHub repository](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT). |
|
|
|
|
|
For general information and other details, please refer to the [offical Bagel GitHub repository](https://github.com/bytedance-seed/BAGEL). |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset |
|
|
|
|
|
* **[Zebra‑CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**: 182,384 interleaved text‑image reasoning samples across 18 sub‑tasks in 4 categories (2D visual, 3D visual, scientific reasoning, visual logic & strategic games). |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
Bagel‑Zebra‑CoT is licensed under the Apache 2.0 license. It is finetuned from [ByteDance-Seed/BAGEL-7B-MoT](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT), which was finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0. |
|
|
|
|
|
--- |
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{li2025zebracot, |
|
|
title={Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning}, |
|
|
author={Ang Li and Charles Wang and Kaiyu Yue and Zikui Cai and Ollie Liu and Deqing Fu and Peng Guo and Wang Bill Zhu and Vatsal Sharan and Robin Jia and Willie Neiswanger and Furong Huang and Tom Goldstein and Micah Goldblum}, |
|
|
year={2025}, |
|
|
eprint={2507.16746}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2507.16746}, |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Links |
|
|
|
|
|
* **Project Page**: [https://multimodal-reasoning-lab.github.io/Zebra-CoT/](https://multimodal-reasoning-lab.github.io/Zebra-CoT/) |
|
|
* **Model on Hugging Face**: [https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT) |
|
|
* **Dataset on Hugging Face**: [https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT) |
|
|
* **Code on GitHub**: [https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT](https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT) |
|
|
|
|
|
--- |