|
--- |
|
base_model: |
|
- zhumj34/Mipha-3B |
|
datasets: |
|
- liuhaotian/LLaVA-Instruct-150K |
|
language: |
|
- en |
|
library_name: transformers |
|
license: apache-2.0 |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- transformers |
|
- llm |
|
- lmm |
|
- conversational |
|
- olympus |
|
- llava |
|
- image-text-to-text |
|
- vision-language |
|
--- |
|
|
|
<p align="center"> |
|
<img src="https://github.com/yuanze-lin/Olympus/blob/main/asset/olympus.png?raw=true" alt="icon" width="150" height="150" style="vertical-align:middle; margin-right:5px;" /> |
|
</p> |
|
|
|
# Olympus: A Universal Task Router for Computer Vision Tasks (CVPR 2025) <br /> |
|
|
|
[](https://arxiv.org/pdf/2412.09612) |
|
[](https://arxiv.org/pdf/2412.09612) |
|
[](https://yuanze-lin.me/Olympus_page/) |
|
[](https://huggingface.co/Yuanze/Olympus) |
|
|
|
Official implementation of "Olympus: A Universal Task Router for Computer Vision Tasks" |
|
|
|
♥️ If you find our project is helpful for your research, please kindly give us a 🌟 and cite our paper 📑 |
|
|
|
## 📣 News |
|
- [ ] Release the code for integration with task-specific models. |
|
- [x] Release the training & inference code. |
|
- [x] Release Olympus datasets. |
|
- [x] Release the model of Olympus. |
|
|
|
|
|
## 🔅 Overview |
|
|
|
<p align="center"> |
|
<img src="https://github.com/yuanze-lin/Olympus/blob/main/asset/overview.png?raw=true" alt="Overview" width="1000"/> |
|
</p> |
|
|
|
|
|
## Getting Started |
|
|
|
### 🛠️ Environment Installation |
|
To establish the environment, just run this code in the shell: |
|
``` |
|
git clone https://github.com/yuanze-lin/Olympus.git |
|
cd Olympus |
|
conda create -n olympus python==3.10 -y |
|
conda activate olympus |
|
pip install -r requirements.txt |
|
``` |
|
That will create the environment ```olympus``` we used. |
|
|
|
### Download Models & Data ### |
|
We share our collected Olympus dataset as follows: |
|
|
|
| Instruction | Link | |
|
|---------|------| |
|
| Olympus Task-wise Data | [Olympus_20tasks_all](https://drive.google.com/drive/folders/1m3FYHarVG8eg7X7cMAC5N5NBG-p0ymw8?usp=drive_link) | |
|
| Olympus Fine-tuning Data | [Olympus.json](https://drive.google.com/file/d/1CMLZLa6hkVN2K1ebCcJEOaFGc2cLeLQ7/view?usp=sharing) | |
|
|
|
- ```Olympus_20tasks_all```: There are 20 JSON files under ```20 individual tasks``` folder, each corresponding to a specific task. You can refer to the routing token definitions in our paper to identify the task associated with each JSON file, along with the chain-of-action data provided in ```coa.json```. Each of these 21 JSON files includes both training and test data. |
|
- ```Olympus.json```: The final fine-tuning data. |
|
|
|
|
|
(1) Download the Olympus model: |
|
``` |
|
python download_olympus.py |
|
``` |
|
It will save the ```Olympus``` model under the ```ckpts``` folder. |
|
|
|
(2) Download the Olympus data for fine-tuning: |
|
``` |
|
python download_olympus_json.py |
|
``` |
|
The json data will be saved as ```Olympus.json``` in the ```train_data``` folder. Note that ```Olympus.json``` includes ```llava_v1_5_mix665k.json``` combined with our collected data from 20 tasks. |
|
|
|
**If you want to merge the data manually, firstly create ```jsons``` folder by ```mkdir jsons```, download all the JSON files from [Olympus_20tasks_all](https://drive.google.com/drive/folders/1m3FYHarVG8eg7X7cMAC5N5NBG-p0ymw8?usp=drive_link) and [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json) into the ```jsons``` folder, then run the merge script:** |
|
|
|
``` |
|
python scripts/merge_data.py |
|
``` |
|
|
|
(3) Download the Mipha-3B model for fine-tuning: |
|
``` |
|
python download_mipha_3b.py |
|
``` |
|
It will save the ```Mipha-3B``` model under the ```ckpts``` folder. |
|
|
|
### Inference |
|
|
|
Run the following code for inference: |
|
``` |
|
model_name=Olympus |
|
MODELDIR=ckpts/$model_name |
|
|
|
python predict.py \ |
|
--prompt "Generate an image of a fluffy orange cat lounging on a windowsill, \ |
|
with sunlight streaming through the glass and casting soft shadows to create a cozy atmosphere. \ |
|
Next, would it be possible to change the cat's color to white? This change will make it more eye-catching. \ |
|
In the following step, produce a high-resolution 3D model based on the modified image. \ |
|
At the next point, please show a video of a cat and a dog running on a playground." \ |
|
--model-path $MODELDIR \ |
|
--temperature 0 \ |
|
--conv-mode v0 |
|
``` |
|
Alternatively, you can run ```bash predict.sh``` as we did. |
|
|
|
The prediction should be like: |
|
``` |
|
Input Prompt: Generate an image of a fluffy orange cat lounging on a windowsill, |
|
with sunlight streaming through the glass and casting soft shadows to create a cozy atmosphere. |
|
Next, would it be possible to change the cat's color to white? This change will make it more eye-catching. |
|
In the following step, produce a high-resolution 3D model based on the modified image. |
|
At the next point, please show a video of a cat and a dog running on a playground. |
|
|
|
Output: <image_gen>a fluffy orange cat lounging on a windowsill, with sunlight streaming |
|
through the glass and casting soft shadows to create a cozy atmosphere.</image_gen> |
|
<image_edit>change the cat's color to white.</image_edit> |
|
<3D_gen_image>produce a high-resolution 3D model based on the modified image.</3D_gen_image> |
|
<video_gen>a cat and a dog running on a playground.</video_gen> |
|
``` |
|
Change the ```--prompt``` to customize the input prompt as needed. |
|
|
|
### Visual Instruction Tuning |
|
Please refer [here](https://github.com/haotian-liu/LLaVA/blob/9a26bd1435b4ac42c282757f2c16d34226575e96/README.md#visual-instruction-tuning) to prepare the instruction tuning data. Especially, store the images from different datasets under ```train_data``` folder. |
|
|
|
Run the following code to fine-tune the model: |
|
``` |
|
bash scripts/mipha/finetune.sh |
|
``` |
|
|
|
### Evaluation |
|
To evaluate the model's performance on different benchmarks: |
|
|
|
See [Evaluation.md](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md). |
|
|
|
Please place the evaluation data under the ```eval``` folder. The evaluation scripts are placed under ```scripts/mipha/eval/```. |
|
For example, to test the model's performance on VQAv2 dataset, simply run: |
|
|
|
``` |
|
bash scripts/mipha/eval/vqav2.sh |
|
``` |
|
|
|
## 🔮 Suppored Capacities (Covering 20 tasks) |
|
|
|
<p align="center"> |
|
<img src="https://github.com/yuanze-lin/Olympus/blob/main/asset/capacities.png?raw=true" alt="Capacity" width="1000" height="100"/> |
|
</p> |
|
|
|
## 🏂 Diverse Applications |
|
<p align="center"> |
|
<img src="https://github.com/yuanze-lin/Olympus/blob/main/asset/application.png?raw=true" alt="Capacity" width="1000" height="100"/> |
|
</p> |
|
|
|
You can find the code repository at: https://github.com/yuanze-lin/Olympus |
|
|
|
## Citation |
|
|
|
If you find Olympus useful for your research and applications, please cite using this BibTeX: |
|
|
|
``` |
|
@article{lin2024olympus, |
|
title={Olympus: A Universal Task Router for Computer Vision Tasks}, |
|
author={Lin, Yuanze and Li, Yunsheng and Chen, Dongdong and Xu, Weijian and Clark, Ronald and Torr, Philip HS}, |
|
journal={arXiv preprint arXiv:2412.09612}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## Acknowledgement |
|
Our project is built upon the following foundations: |
|
|
|
- [Mipha](https://github.com/xmoanvaf/llava-phi): An impressive open-source project for lightweight vision-language assistants |
|
- [LLaVA](https://github.com/haotian-liu/LLaVA): A powerful open-source vision-language assistant project |