---
license: apache-2.0
language:
- zh
- en
metrics:
- bleu
base_model:
- Qwen/Qwen2.5-7B-Instruct
- DeepGlint-AI/mlcd-vit-large-patch14-336
---


[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)  


## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA


|                |                   | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
 :-- | :-- | :-: | :-: | :-: | :-: |
| RoboVQA        | BLEU1             | <span style="color:red">73.16</span>       | 38.12                   |            -              | 54.9      |
|                | BLEU2             | <span style="color:red">66.39</span>       | 33.56                   |            -              | 44.2      |
|                | BLEU3             | <span style="color:red">60.61</span>       | 31.76                   |            -              | 39.5      |
|                | BLEU4             | <span style="color:red">56.56</span>       | 30.97                   |            -              | 36.3      |
| OpenEQA        | Object State Recognition | <span style="color:red">71.83</span>   |          -               | 63.2   |            -              |
|                | Object Recognition        | <span style="color:red">49.46</span>  |          -               | 43.4   |            -              |
|                | Functional Reasoning      | 54.38                                 |          -               | <span style="color:red">57.4</span> |            -              |
|                | Spatial Understanding     | <span style="color:red">48.64</span>  |          -               | 33.6   |            -              |
|                | Attribute Recognition     | <span style="color:red">67.08</span>  |          -               | 57.2   |            -              |
|                | World Knowledge           | <span style="color:red">53.87</span>  |          -               | 50.7   |            -              |
|                | Object Localization       | <span style="color:red">43.06</span>  |          -               | 42.0   |            -              |


## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4

| Dataset     | Split   | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v   | GPT-4o |
| :-- | :-: | :-: | :-: | :-: | :-: |
| A12D        | test    | 79.9             | 81.4               | 78.2     | 94.2   |
| ChartQA     | test    | 83.0             | 80.0               | 78.5     | 85.7   |
| DocVQA      | test    | 91.6             | 87.5               | 88.4     | 92.8   |
| InfoVQA     | val     | 73.9             | 70.7               | -        | -      |
| InfoVQA     | test    | 70.0             | 68.8               | -        | -      |
| MMMU        | val     | 47.3             | 48.8               | 56.8     | 69.1   |
| MMStar      | test    | 58.5             | 61.7               | 57.1     | 63.9   |
| OCRBench    | -       | 749.0            | 697.0              | 656.0    | 805.0  |
| RealWorldQA | test    | 68.9             | 66.3               | 61.4     | 58.6   |
| SeedBench   | image   | 74.9             | 75.4               | 49.9     | 76.2   |
| MMbench     | en-dev  | 81.1             | 83.2               | 81.3     | 83.4   |
| MMbench     | en-test | 80.1             | 80.8               | 75.0     | -      |
| MME         | test    | 578/1603         | 418/1580           | 517/1409 | -      |

## Usage

### A. Installation

```bash
git clone https://github.com/deepglint/unicom
cd unicom

# Upgrade pip and install necessary dependencies
pip install --upgrade pip
pip install -e ".[train]"
```

### B. Inference

```bash
git clone https://github.com/deepglint/unicom
cd unicom
pip install --upgrade pip
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

CUDA_VISIBLE_DEVICES=0 python infer.py --model_dir DeepGlint-AI/MLCD-Embodied-7B

# example:
# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
# >> Enter image file paths (comma-separated): ./asserts/logo.png
# >> User: <image>What kind of animal is it in this picture?
# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
# >> User: What color is this cat?
# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
# >> User: <image>请你介绍一下这个图片
# >> Assistant: 这是一幅充满创意的猫头艺术作品。它采用了多色渐变和抽象风格，将猫的头部描绘成一个充满活力和色彩的视觉冲击。猫的眼睛用金色渲染，显得非常有神采，
# 而粉色的鼻子则增添了一丝可爱感。整体设计融合了现代艺术与传统猫头图案，创造出一种既独特又引人入胜的视觉效果。。
```

### C. Evaluation for Embodied Ability

#### Step 1

Download raw data following [OpenEQA](https://github.com/facebookresearch/open-eqa/tree/main/data) and [RoboVQA](https://console.cloud.google.com/storage/browser/gdm-robovqa)(val part)

#### Step 2

Converting raw data into the format required for model evaluation.
```bash
# convert OpenEQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_openeqa_bmk.py

# convert RoboVQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_robovqa_bmk.py
```

#### Step 3

Make sure that your top-level directory structure should look like this:
```
|--/path/to/your/benchmarks
|  |--OpenEQA
|  |  |--openeqa_scannet.parquet
|  |  |--openeqa_hm3d.parquet
|  |--RoboVQA
|     |--robovqa.parquet
|--/path/to/your/images
   |--openeqa_val
   |  |--scannet-v0
   |  |  |--002-scannet-scene0709_00
   |  |  |--xxx-scannet-scenexxxx_xx
   |  |--hm3d-v0
   |     |--000-hm3d-BFRyYbPCCPE
   |     |--xxx-hm3d-xxxxxxxxxxx
   |--robovqa_val
      |--robovqa_221911
      |--robovqa_xxxxxx
```

#### Step 4

Run script for evaluation 
```bash
# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
bash scripts/eval/eval_robo.sh /path/to/your/model
```

### D. Evaluation for General Ability

Install the evaluation tool and execute the evaluation script:
```bash
pip install lmms-eval==0.2.0
PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
    --main_process_port=12444 \
    --num_processes=8 \
    -m lmms_eval \
    --model llava \
    --model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
    --tasks mme \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix mlcd \
    --output_path ./eval_log/
```

We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs.