File size: 7,529 Bytes
6485a15 f30d469 6485a15 f30d469 6485a15 f30d469 6485a15 f30d469 2b336ab 821ae22 71df3e7 821ae22 47b71c2 821ae22 47b71c2 821ae22 2b336ab 821ae22 0232dba 821ae22 7827d02 821ae22 47b71c2 821ae22 2b336ab 821ae22 47b71c2 2b336ab 821ae22 f30d469 821ae22 6485a15 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
---
base_model:
- zhumj34/Mipha-3B
datasets:
- liuhaotian/LLaVA-Instruct-150K
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- transformers
- llm
- lmm
- conversational
- olympus
- llava
- image-text-to-text
- vision-language
---
<p align="center">
<img src="https://github.com/yuanze-lin/Olympus/blob/main/asset/olympus.png?raw=true" alt="icon" width="150" height="150" style="vertical-align:middle; margin-right:5px;" />
</p>
# Olympus: A Universal Task Router for Computer Vision Tasks (CVPR 2025) <br />
[](https://arxiv.org/pdf/2412.09612)
[](https://arxiv.org/pdf/2412.09612)
[](https://yuanze-lin.me/Olympus_page/)
[](https://huggingface.co/Yuanze/Olympus)
Official implementation of "Olympus: A Universal Task Router for Computer Vision Tasks"
โฅ๏ธ If you find our project is helpful for your research, please kindly give us a ๐ and cite our paper ๐
## ๐ฃ News
- [ ] Release the code for integration with task-specific models.
- [x] Release the training & inference code.
- [x] Release Olympus datasets.
- [x] Release the model of Olympus.
## ๐
Overview
<p align="center">
<img src="https://github.com/yuanze-lin/Olympus/blob/main/asset/overview.png?raw=true" alt="Overview" width="1000"/>
</p>
## Getting Started
### ๐ ๏ธ Environment Installation
To establish the environment, just run this code in the shell:
```
git clone https://github.com/yuanze-lin/Olympus.git
cd Olympus
conda create -n olympus python==3.10 -y
conda activate olympus
pip install -r requirements.txt
```
That will create the environment ```olympus``` we used.
### Download Models & Data ###
We share our collected Olympus dataset as follows:
| Instruction | Link |
|---------|------|
| Olympus Task-wise Data | [Olympus_20tasks_all](https://drive.google.com/drive/folders/1m3FYHarVG8eg7X7cMAC5N5NBG-p0ymw8?usp=drive_link) |
| Olympus Fine-tuning Data | [Olympus.json](https://drive.google.com/file/d/1CMLZLa6hkVN2K1ebCcJEOaFGc2cLeLQ7/view?usp=sharing) |
- ```Olympus_20tasks_all```: There are 20 JSON files under ```20 individual tasks``` folder, each corresponding to a specific task. You can refer to the routing token definitions in our paper to identify the task associated with each JSON file, along with the chain-of-action data provided in ```coa.json```. Each of these 21 JSON files includes both training and test data.
- ```Olympus.json```: The final fine-tuning data.
(1) Download the Olympus model:
```
python download_olympus.py
```
It will save the ```Olympus``` model under the ```ckpts``` folder.
(2) Download the Olympus data for fine-tuning:
```
python download_olympus_json.py
```
The json data will be saved as ```Olympus.json``` in the ```train_data``` folder. Note that ```Olympus.json``` includes ```llava_v1_5_mix665k.json``` combined with our collected data from 20 tasks.
**If you want to merge the data manually, firstly create ```jsons``` folder by ```mkdir jsons```, download all the JSON files from [Olympus_20tasks_all](https://drive.google.com/drive/folders/1m3FYHarVG8eg7X7cMAC5N5NBG-p0ymw8?usp=drive_link) and [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json) into the ```jsons``` folder, then run the merge script:**
```
python scripts/merge_data.py
```
(3) Download the Mipha-3B model for fine-tuning:
```
python download_mipha_3b.py
```
It will save the ```Mipha-3B``` model under the ```ckpts``` folder.
### Inference
Run the following code for inference:
```
model_name=Olympus
MODELDIR=ckpts/$model_name
python predict.py \
--prompt "Generate an image of a fluffy orange cat lounging on a windowsill, \
with sunlight streaming through the glass and casting soft shadows to create a cozy atmosphere. \
Next, would it be possible to change the cat's color to white? This change will make it more eye-catching. \
In the following step, produce a high-resolution 3D model based on the modified image. \
At the next point, please show a video of a cat and a dog running on a playground." \
--model-path $MODELDIR \
--temperature 0 \
--conv-mode v0
```
Alternatively, you can run ```bash predict.sh``` as we did.
The prediction should be like:
```
Input Prompt: Generate an image of a fluffy orange cat lounging on a windowsill,
with sunlight streaming through the glass and casting soft shadows to create a cozy atmosphere.
Next, would it be possible to change the cat's color to white? This change will make it more eye-catching.
In the following step, produce a high-resolution 3D model based on the modified image.
At the next point, please show a video of a cat and a dog running on a playground.
Output: <image_gen>a fluffy orange cat lounging on a windowsill, with sunlight streaming
through the glass and casting soft shadows to create a cozy atmosphere.</image_gen>
<image_edit>change the cat's color to white.</image_edit>
<3D_gen_image>produce a high-resolution 3D model based on the modified image.</3D_gen_image>
<video_gen>a cat and a dog running on a playground.</video_gen>
```
Change the ```--prompt``` to customize the input prompt as needed.
### Visual Instruction Tuning
Please refer [here](https://github.com/haotian-liu/LLaVA/blob/9a26bd1435b4ac42c282757f2c16d34226575e96/README.md#visual-instruction-tuning) to prepare the instruction tuning data. Especially, store the images from different datasets under ```train_data``` folder.
Run the following code to fine-tune the model:
```
bash scripts/mipha/finetune.sh
```
### Evaluation
To evaluate the model's performance on different benchmarks:
See [Evaluation.md](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md).
Please place the evaluation data under the ```eval``` folder. The evaluation scripts are placed under ```scripts/mipha/eval/```.
For example, to test the model's performance on VQAv2 dataset, simply run:
```
bash scripts/mipha/eval/vqav2.sh
```
## ๐ฎ Suppored Capacities (Covering 20 tasks)
<p align="center">
<img src="https://github.com/yuanze-lin/Olympus/blob/main/asset/capacities.png?raw=true" alt="Capacity" width="1000" height="100"/>
</p>
## ๐ Diverse Applications
<p align="center">
<img src="https://github.com/yuanze-lin/Olympus/blob/main/asset/application.png?raw=true" alt="Capacity" width="1000" height="100"/>
</p>
You can find the code repository at: https://github.com/yuanze-lin/Olympus
## Citation
If you find Olympus useful for your research and applications, please cite using this BibTeX:
```
@article{lin2024olympus,
title={Olympus: A Universal Task Router for Computer Vision Tasks},
author={Lin, Yuanze and Li, Yunsheng and Chen, Dongdong and Xu, Weijian and Clark, Ronald and Torr, Philip HS},
journal={arXiv preprint arXiv:2412.09612},
year={2024}
}
```
## Acknowledgement
Our project is built upon the following foundations:
- [Mipha](https://github.com/xmoanvaf/llava-phi): An impressive open-source project for lightweight vision-language assistants
- [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful open-source vision-language assistant project |