Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3
|
3 |
+
base_model: meta-llama/Meta-Llama-3-8B-Instruct
|
4 |
+
library_name: transformers
|
5 |
+
tags:
|
6 |
+
- AIGC
|
7 |
+
- LLaVA
|
8 |
+
datasets:
|
9 |
+
- OpenFace-CQUPT/FaceCaption-15M
|
10 |
+
metrics:
|
11 |
+
- accuracy
|
12 |
+
pipeline_tag: visual-question-answering
|
13 |
+
---
|
14 |
+
# Human-LLaVA-8B
|
15 |
+
|
16 |
+
## DEMO
|
17 |
+
|
18 |
+
|
19 |
+
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/TpN2t19Poe5YbHHP8uN7_.mp4"></video>
|
20 |
+
|
21 |
+
|
22 |
+

|
23 |
+
|
24 |
+
### Introduction
|
25 |
+
|
26 |
+
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
|
27 |
+
|
28 |
+
Specifically, (1) we first construct **a large-scale and high-quality human-related image-text (caption) dataset** extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct **a multi-granularity caption for human-related images** (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our **Human-LLaVA** achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
29 |
+
|
30 |
+
|
31 |
+
## Result
|
32 |
+
human-llava has a good performance in both general and special fields
|
33 |
+
|
34 |
+
|
35 |
+

|
36 |
+
|
37 |
+
## News and Update π₯π₯π₯
|
38 |
+
* Oct.23, 2024. **π€[HumanCaption-HQ-311K](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K), is released!πππ**
|
39 |
+
* Sep.12, 2024. **π€[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M), is released!πππ**
|
40 |
+
* Sep.8, 2024. **π€[HumanVLM](https://huggingface.co/OpenFace-CQUPT/Human_LLaVA), is released!πππ**
|
41 |
+
|
42 |
+
|
43 |
+
|
44 |
+
## π€ Transformers
|
45 |
+
To use Human-LLaVA for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using latest code.
|
46 |
+
``` python
|
47 |
+
import requests
|
48 |
+
from PIL import Image
|
49 |
+
|
50 |
+
import torch
|
51 |
+
from transformers import AutoProcessor, AutoModelForPreTraining
|
52 |
+
|
53 |
+
|
54 |
+
model_id = "OpenFace-CQUPT/Human_LLaVA"
|
55 |
+
cuda = 0
|
56 |
+
model = AutoModelForPreTraining.from_pretrained("OpenFace-CQUPT/Human_LLaVA", torch_dtype=torch.float16).to(cuda)
|
57 |
+
|
58 |
+
processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True)
|
59 |
+
|
60 |
+
|
61 |
+
text = "Please describe this picture"
|
62 |
+
prompt = "USER: <image>\n" + text + "\nASSISTANT:"
|
63 |
+
image_file = "./test1.jpg"
|
64 |
+
raw_image = Image.open(image_file)
|
65 |
+
# raw_image = Image.open(requests.get(image_file, stream=True).raw)
|
66 |
+
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(cuda, torch.float16)
|
67 |
+
|
68 |
+
output = model.generate(**inputs, max_new_tokens=400, do_sample=False)
|
69 |
+
predict = processor.decode(output[0][:], skip_special_tokens=True)
|
70 |
+
print(predict)
|
71 |
+
```
|
72 |
+
|
73 |
+
Our training code have been released publicly on github.[ddw2AIGROUP2CQUPT/Human-LLaVA-8B(github.com)](https://github.com/ddw2AIGROUP2CQUPT/Human-LLaVA-8B)
|
74 |
+
## Get the Dataset
|
75 |
+
#### Dataset Example
|
76 |
+
|
77 |
+

|
78 |
+
|
79 |
+
#### Domain Alignment Stage
|
80 |
+
[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M)(self construct): is released!
|
81 |
+
|
82 |
+
#### Instruction Tuning Stage
|
83 |
+
**All public data sets have been filtered, and we will consider publishing all processed text in the future**
|
84 |
+
|
85 |
+
[HumanCaption-HQ](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K)(self construct): is released!
|
86 |
+
|
87 |
+
[FaceCaptionA](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M)(self construct): is released!
|
88 |
+
|
89 |
+
CelebA: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
|
90 |
+
|
91 |
+
ShareGPT4V:https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md
|
92 |
+
|
93 |
+
LLaVA-Instruct_zh : https://huggingface.co/datasets/openbmb/llava_zh
|
94 |
+
|
95 |
+
verified_ref3rec: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json
|
96 |
+
|
97 |
+
verified_ref3reg: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json
|
98 |
+
|
99 |
+
verified_shikra: https://github.com/shikras/shikra
|
100 |
+
|
101 |
+
|
102 |
+
|
103 |
+
## Citation
|
104 |
+
|
105 |
+
```
|
106 |
+
@misc{dai2024humanvlmfoundationhumanscenevisionlanguage,
|
107 |
+
title={HumanVLM: Foundation for Human-Scene Vision-Language Model},
|
108 |
+
author={Dawei Dai and Xu Long and Li Yutang and Zhang Yuanhui and Shuyin Xia},
|
109 |
+
year={2024},
|
110 |
+
eprint={2411.03034},
|
111 |
+
archivePrefix={arXiv},
|
112 |
+
primaryClass={cs.AI},
|
113 |
+
url={https://arxiv.org/abs/2411.03034},
|
114 |
+
}
|
115 |
+
```
|
116 |
+
|
117 |
+
## contact
|
118 |
+
|
119 |
+
mailto: [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])
|