ponytail commited on
Commit
62b657a
Β·
verified Β·
1 Parent(s): afdfd04

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -0
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ base_model: meta-llama/Meta-Llama-3-8B-Instruct
4
+ library_name: transformers
5
+ tags:
6
+ - AIGC
7
+ - LLaVA
8
+ datasets:
9
+ - OpenFace-CQUPT/FaceCaption-15M
10
+ metrics:
11
+ - accuracy
12
+ pipeline_tag: visual-question-answering
13
+ ---
14
+ # Human-LLaVA-8B
15
+
16
+ ## DEMO
17
+
18
+
19
+ <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/TpN2t19Poe5YbHHP8uN7_.mp4"></video>
20
+
21
+
22
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/1xS27bvECvGTKntvOa1SQ.png)
23
+
24
+ ### Introduction
25
+
26
+ Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
27
+
28
+ Specifically, (1) we first construct **a large-scale and high-quality human-related image-text (caption) dataset** extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct **a multi-granularity caption for human-related images** (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our **Human-LLaVA** achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
29
+
30
+
31
+ ## Result
32
+ human-llava has a good performance in both general and special fields
33
+
34
+
35
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/X-712oVUBPXbfLcAz83fb.png)
36
+
37
+ ## News and Update πŸ”₯πŸ”₯πŸ”₯
38
+ * Oct.23, 2024. **πŸ€—[HumanCaption-HQ-311K](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K), is released!πŸ‘πŸ‘πŸ‘**
39
+ * Sep.12, 2024. **πŸ€—[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M), is released!πŸ‘πŸ‘πŸ‘**
40
+ * Sep.8, 2024. **πŸ€—[HumanVLM](https://huggingface.co/OpenFace-CQUPT/Human_LLaVA), is released!πŸ‘πŸ‘πŸ‘**
41
+
42
+
43
+
44
+ ## πŸ€— Transformers
45
+ To use Human-LLaVA for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using latest code.
46
+ ``` python
47
+ import requests
48
+ from PIL import Image
49
+
50
+ import torch
51
+ from transformers import AutoProcessor, AutoModelForPreTraining
52
+
53
+
54
+ model_id = "OpenFace-CQUPT/Human_LLaVA"
55
+ cuda = 0
56
+ model = AutoModelForPreTraining.from_pretrained("OpenFace-CQUPT/Human_LLaVA", torch_dtype=torch.float16).to(cuda)
57
+
58
+ processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True)
59
+
60
+
61
+ text = "Please describe this picture"
62
+ prompt = "USER: <image>\n" + text + "\nASSISTANT:"
63
+ image_file = "./test1.jpg"
64
+ raw_image = Image.open(image_file)
65
+ # raw_image = Image.open(requests.get(image_file, stream=True).raw)
66
+ inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(cuda, torch.float16)
67
+
68
+ output = model.generate(**inputs, max_new_tokens=400, do_sample=False)
69
+ predict = processor.decode(output[0][:], skip_special_tokens=True)
70
+ print(predict)
71
+ ```
72
+
73
+ Our training code have been released publicly on github.[ddw2AIGROUP2CQUPT/Human-LLaVA-8B(github.com)](https://github.com/ddw2AIGROUP2CQUPT/Human-LLaVA-8B)
74
+ ## Get the Dataset
75
+ #### Dataset Example
76
+
77
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/-gTV7ym_gmNmJqNRDzlCx.png)
78
+
79
+ #### Domain Alignment Stage
80
+ [HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M)(self construct): is released!
81
+
82
+ #### Instruction Tuning Stage
83
+ **All public data sets have been filtered, and we will consider publishing all processed text in the future**
84
+
85
+ [HumanCaption-HQ](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K)(self construct): is released!
86
+
87
+ [FaceCaptionA](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M)(self construct): is released!
88
+
89
+ CelebA: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
90
+
91
+ ShareGPT4V:https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md
92
+
93
+ LLaVA-Instruct_zh : https://huggingface.co/datasets/openbmb/llava_zh
94
+
95
+ verified_ref3rec: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json
96
+
97
+ verified_ref3reg: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json
98
+
99
+ verified_shikra: https://github.com/shikras/shikra
100
+
101
+
102
+
103
+ ## Citation
104
+
105
+ ```
106
+ @misc{dai2024humanvlmfoundationhumanscenevisionlanguage,
107
+ title={HumanVLM: Foundation for Human-Scene Vision-Language Model},
108
+ author={Dawei Dai and Xu Long and Li Yutang and Zhang Yuanhui and Shuyin Xia},
109
+ year={2024},
110
+ eprint={2411.03034},
111
+ archivePrefix={arXiv},
112
+ primaryClass={cs.AI},
113
+ url={https://arxiv.org/abs/2411.03034},
114
+ }
115
+ ```
116
+
117
+ ## contact
118
+
119