tuandunghcmut commited on
Commit
b203e1e
·
verified ·
1 Parent(s): 64a4334

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. LLaVA/docs/Data.md +29 -0
  2. LLaVA/llava/eval/eval_gpt_review_bench.py +121 -0
  3. LLaVA/llava/eval/eval_gpt_review_visual.py +118 -0
  4. LLaVA/llava/eval/eval_pope.py +81 -0
  5. LLaVA/llava/eval/eval_science_qa_gpt4.py +104 -0
  6. LLaVA/llava/eval/eval_textvqa.py +65 -0
  7. LLaVA/llava/eval/generate_webpage_data_from_table.py +111 -0
  8. LLaVA/llava/eval/m4c_evaluator.py +334 -0
  9. LLaVA/llava/eval/model_vqa.py +101 -0
  10. LLaVA/llava/eval/model_vqa_loader.py +144 -0
  11. LLaVA/llava/eval/model_vqa_mmbench.py +160 -0
  12. LLaVA/llava/eval/summarize_gpt_review.py +60 -0
  13. LLaVA/llava/eval/webpage/styles.css +105 -0
  14. LLaVA/llava/model/__init__.py +6 -0
  15. LLaVA/llava/model/apply_delta.py +48 -0
  16. LLaVA/llava/model/builder.py +167 -0
  17. LLaVA/llava/model/consolidate.py +29 -0
  18. LLaVA/llava/model/llava_arch.py +368 -0
  19. LLaVA/llava/model/make_delta.py +52 -0
  20. LLaVA/llava/model/utils.py +20 -0
  21. LLaVA/llava/serve/__init__.py +0 -0
  22. LLaVA/llava/serve/cli.py +126 -0
  23. LLaVA/llava/serve/controller.py +298 -0
  24. LLaVA/llava/serve/gradio_web_server.py +479 -0
  25. LLaVA/llava/serve/model_worker.py +288 -0
  26. LLaVA/llava/serve/register_worker.py +26 -0
  27. LLaVA/llava/serve/sglang_worker.py +244 -0
  28. LLaVA/llava/serve/test_message.py +62 -0
  29. LLaVA/llava/train/llama_flash_attn_monkey_patch.py +115 -0
  30. LLaVA/llava/train/llama_xformers_attn_monkey_patch.py +129 -0
  31. LLaVA/llava/train/llava_trainer.py +255 -0
  32. LLaVA/llava/train/train.py +991 -0
  33. LLaVA/llava/train/train_mem.py +4 -0
  34. LLaVA/llava/train/train_xformers.py +13 -0
  35. LLaVA/playground/data/prompts/complex_reasoning/000_caps.txt +18 -0
  36. LLaVA/playground/data/prompts/complex_reasoning/000_conv.txt +5 -0
  37. LLaVA/playground/data/prompts/complex_reasoning/001_caps.txt +18 -0
  38. LLaVA/playground/data/prompts/complex_reasoning/001_conv.txt +5 -0
  39. LLaVA/playground/data/prompts/complex_reasoning/002_caps.txt +7 -0
  40. LLaVA/playground/data/prompts/complex_reasoning/002_conv.txt +5 -0
  41. LLaVA/playground/data/prompts/complex_reasoning/system_message.txt +10 -0
  42. LLaVA/playground/data/prompts/conversation/001_caps.txt +5 -0
  43. LLaVA/playground/data/prompts/conversation/001_conv.txt +37 -0
  44. LLaVA/playground/data/prompts/detail_description/000_caps.txt +18 -0
  45. LLaVA/playground/data/prompts/detail_description/000_conv.txt +3 -0
  46. LLaVA/playground/data/prompts/detail_description/001_caps.txt +18 -0
  47. LLaVA/playground/data/prompts/detail_description/001_conv.txt +5 -0
  48. LLaVA/playground/data/prompts/detail_description/002_caps.txt +15 -0
  49. LLaVA/playground/data/prompts/detail_description/002_conv.txt +3 -0
  50. LLaVA/playground/data/prompts/detail_description/system_message.txt +7 -0
LLaVA/docs/Data.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Data
2
+
3
+ | Data file name | Size |
4
+ | --- | ---: |
5
+ | [llava_instruct_150k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json) | 229 MB |
6
+ | [llava_instruct_80k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_80k.json) | 229 MB |
7
+ | [conversation_58k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/conversation_58k.json) | 126 MB |
8
+ | [detail_23k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/detail_23k.json) | 20.5 MB |
9
+ | [complex_reasoning_77k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/complex_reasoning_77k.json) | 79.6 MB |
10
+
11
+ ### Pretraining Dataset
12
+ The pretraining dataset used in this release is a subset of CC-3M dataset, filtered with a more balanced concept coverage distribution. Please see [here](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K) for a detailed description of the dataset structure and how to download the images.
13
+
14
+ If you already have CC-3M dataset on your disk, the image names follow this format: `GCC_train_000000000.jpg`. You may edit the `image` field correspondingly if necessary.
15
+
16
+ | Data | Chat File | Meta Data | Size |
17
+ | --- | --- | --- | ---: |
18
+ | CC-3M Concept-balanced 595K | [chat.json](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/blob/main/chat.json) | [metadata.json](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/blob/main/metadata.json) | 211 MB
19
+ | LAION/CC/SBU BLIP-Caption Concept-balanced 558K | [blip_laion_cc_sbu_558k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/blip_laion_cc_sbu_558k.json) | [metadata.json](#) | 181 MB
20
+
21
+ **Important notice**: Upon the request from the community, as ~15% images of the original CC-3M dataset are no longer accessible, we upload [`images.zip`](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/blob/main/images.zip) for better reproducing our work in research community. It must not be used for any other purposes. The use of these images must comply with the CC-3M license. This may be taken down at any time when requested by the original CC-3M dataset owner or owners of the referenced images.
22
+
23
+ ### GPT-4 Prompts
24
+
25
+ We provide our prompts and few-shot samples for GPT-4 queries, to better facilitate research in this domain. Please check out the [`prompts`](https://github.com/haotian-liu/LLaVA/tree/main/playground/data/prompts) folder for three kinds of questions: conversation, detail description, and complex reasoning.
26
+
27
+ They are organized in a format of `system_message.txt` for system message, pairs of `abc_caps.txt` for few-shot sample user input, and `abc_conv.txt` for few-shot sample reference output.
28
+
29
+ Note that you may find them in different format. For example, `conversation` is in `jsonl`, and detail description is answer-only. The selected format in our preliminary experiments works slightly better than a limited set of alternatives that we tried: `jsonl`, more natural format, answer-only. If interested, you may try other variants or conduct more careful study in this. Contributions are welcomed!
LLaVA/llava/eval/eval_gpt_review_bench.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import json
3
+ import os
4
+
5
+ import openai
6
+ import time
7
+
8
+ NUM_SECONDS_TO_SLEEP = 0.5
9
+
10
+
11
+ def get_eval(content: str, max_tokens: int):
12
+ while True:
13
+ try:
14
+ response = openai.ChatCompletion.create(
15
+ model='gpt-4-0314',
16
+ messages=[{
17
+ 'role': 'system',
18
+ 'content': 'You are a helpful and precise assistant for checking the quality of the answer.'
19
+ }, {
20
+ 'role': 'user',
21
+ 'content': content,
22
+ }],
23
+ temperature=0.2, # TODO: figure out which temperature is best for evaluation
24
+ max_tokens=max_tokens,
25
+ )
26
+ break
27
+ except openai.error.RateLimitError:
28
+ pass
29
+ except Exception as e:
30
+ print(e)
31
+ time.sleep(NUM_SECONDS_TO_SLEEP)
32
+
33
+ return response['choices'][0]['message']['content']
34
+
35
+
36
+ def parse_score(review):
37
+ try:
38
+ score_pair = review.split('\n')[0]
39
+ score_pair = score_pair.replace(',', ' ')
40
+ sp = score_pair.split(' ')
41
+ if len(sp) == 2:
42
+ return [float(sp[0]), float(sp[1])]
43
+ else:
44
+ print('error', review)
45
+ return [-1, -1]
46
+ except Exception as e:
47
+ print(e)
48
+ print('error', review)
49
+ return [-1, -1]
50
+
51
+
52
+ if __name__ == '__main__':
53
+ parser = argparse.ArgumentParser(description='ChatGPT-based QA evaluation.')
54
+ parser.add_argument('-q', '--question')
55
+ parser.add_argument('-c', '--context')
56
+ parser.add_argument('-a', '--answer-list', nargs='+', default=[])
57
+ parser.add_argument('-r', '--rule')
58
+ parser.add_argument('-o', '--output')
59
+ parser.add_argument('--max-tokens', type=int, default=1024, help='maximum number of tokens produced in the output')
60
+ args = parser.parse_args()
61
+
62
+ f_q = open(os.path.expanduser(args.question))
63
+ f_ans1 = open(os.path.expanduser(args.answer_list[0]))
64
+ f_ans2 = open(os.path.expanduser(args.answer_list[1]))
65
+ rule_dict = json.load(open(os.path.expanduser(args.rule), 'r'))
66
+
67
+ if os.path.isfile(os.path.expanduser(args.output)):
68
+ cur_reviews = [json.loads(line) for line in open(os.path.expanduser(args.output))]
69
+ else:
70
+ cur_reviews = []
71
+
72
+ review_file = open(f'{args.output}', 'a')
73
+
74
+ context_list = [json.loads(line) for line in open(os.path.expanduser(args.context))]
75
+ image_to_context = {context['image']: context for context in context_list}
76
+
77
+ handles = []
78
+ idx = 0
79
+ for ques_js, ans1_js, ans2_js in zip(f_q, f_ans1, f_ans2):
80
+ ques = json.loads(ques_js)
81
+ ans1 = json.loads(ans1_js)
82
+ ans2 = json.loads(ans2_js)
83
+
84
+ inst = image_to_context[ques['image']]
85
+
86
+ if isinstance(inst['caption'], list):
87
+ cap_str = '\n'.join(inst['caption'])
88
+ else:
89
+ cap_str = inst['caption']
90
+
91
+ category = 'llava_bench_' + json.loads(ques_js)['category']
92
+ if category in rule_dict:
93
+ rule = rule_dict[category]
94
+ else:
95
+ assert False, f"Visual QA category not found in rule file: {category}."
96
+ prompt = rule['prompt']
97
+ role = rule['role']
98
+ content = (f'[Context]\n{cap_str}\n\n'
99
+ f'[Question]\n{ques["text"]}\n\n'
100
+ f'[{role} 1]\n{ans1["text"]}\n\n[End of {role} 1]\n\n'
101
+ f'[{role} 2]\n{ans2["text"]}\n\n[End of {role} 2]\n\n'
102
+ f'[System]\n{prompt}\n\n')
103
+ cur_js = {
104
+ 'id': idx+1,
105
+ 'question_id': ques['question_id'],
106
+ 'answer1_id': ans1.get('answer_id', ans1['question_id']),
107
+ 'answer2_id': ans2.get('answer_id', ans2['answer_id']),
108
+ 'category': category
109
+ }
110
+ if idx >= len(cur_reviews):
111
+ review = get_eval(content, args.max_tokens)
112
+ scores = parse_score(review)
113
+ cur_js['content'] = review
114
+ cur_js['tuple'] = scores
115
+ review_file.write(json.dumps(cur_js) + '\n')
116
+ review_file.flush()
117
+ else:
118
+ print(f'Skipping {idx} as we already have it.')
119
+ idx += 1
120
+ print(idx)
121
+ review_file.close()
LLaVA/llava/eval/eval_gpt_review_visual.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import json
3
+ import os
4
+
5
+ import openai
6
+ import time
7
+
8
+ NUM_SECONDS_TO_SLEEP = 0.5
9
+
10
+
11
+ def get_eval(content: str, max_tokens: int):
12
+ while True:
13
+ try:
14
+ response = openai.ChatCompletion.create(
15
+ model='gpt-4-0314',
16
+ messages=[{
17
+ 'role': 'system',
18
+ 'content': 'You are a helpful and precise assistant for checking the quality of the answer.'
19
+ }, {
20
+ 'role': 'user',
21
+ 'content': content,
22
+ }],
23
+ temperature=0.2, # TODO: figure out which temperature is best for evaluation
24
+ max_tokens=max_tokens,
25
+ )
26
+ break
27
+ except openai.error.RateLimitError:
28
+ pass
29
+ except Exception as e:
30
+ print(e)
31
+ time.sleep(NUM_SECONDS_TO_SLEEP)
32
+
33
+ return response['choices'][0]['message']['content']
34
+
35
+
36
+ def parse_score(review):
37
+ try:
38
+ score_pair = review.split('\n')[0]
39
+ score_pair = score_pair.replace(',', ' ')
40
+ sp = score_pair.split(' ')
41
+ if len(sp) == 2:
42
+ return [float(sp[0]), float(sp[1])]
43
+ else:
44
+ print('error', review)
45
+ return [-1, -1]
46
+ except Exception as e:
47
+ print(e)
48
+ print('error', review)
49
+ return [-1, -1]
50
+
51
+
52
+ if __name__ == '__main__':
53
+ parser = argparse.ArgumentParser(description='ChatGPT-based QA evaluation.')
54
+ parser.add_argument('-q', '--question')
55
+ parser.add_argument('-c', '--context')
56
+ parser.add_argument('-a', '--answer-list', nargs='+', default=[])
57
+ parser.add_argument('-r', '--rule')
58
+ parser.add_argument('-o', '--output')
59
+ parser.add_argument('--max-tokens', type=int, default=1024, help='maximum number of tokens produced in the output')
60
+ args = parser.parse_args()
61
+
62
+ f_q = open(os.path.expanduser(args.question))
63
+ f_ans1 = open(os.path.expanduser(args.answer_list[0]))
64
+ f_ans2 = open(os.path.expanduser(args.answer_list[1]))
65
+ rule_dict = json.load(open(os.path.expanduser(args.rule), 'r'))
66
+
67
+ if os.path.isfile(os.path.expanduser(args.output)):
68
+ cur_reviews = [json.loads(line) for line in open(os.path.expanduser(args.output))]
69
+ else:
70
+ cur_reviews = []
71
+
72
+ review_file = open(f'{args.output}', 'a')
73
+
74
+ context_list = [json.loads(line) for line in open(os.path.expanduser(args.context))]
75
+ image_to_context = {context['image']: context for context in context_list}
76
+
77
+ handles = []
78
+ idx = 0
79
+ for ques_js, ans1_js, ans2_js in zip(f_q, f_ans1, f_ans2):
80
+ ques = json.loads(ques_js)
81
+ ans1 = json.loads(ans1_js)
82
+ ans2 = json.loads(ans2_js)
83
+
84
+ inst = image_to_context[ques['image']]
85
+ cap_str = '\n'.join(inst['captions'])
86
+ box_str = '\n'.join([f'{instance["category"]}: {instance["bbox"]}' for instance in inst['instances']])
87
+
88
+ category = json.loads(ques_js)['category']
89
+ if category in rule_dict:
90
+ rule = rule_dict[category]
91
+ else:
92
+ assert False, f"Visual QA category not found in rule file: {category}."
93
+ prompt = rule['prompt']
94
+ role = rule['role']
95
+ content = (f'[Context]\n{cap_str}\n\n{box_str}\n\n'
96
+ f'[Question]\n{ques["text"]}\n\n'
97
+ f'[{role} 1]\n{ans1["text"]}\n\n[End of {role} 1]\n\n'
98
+ f'[{role} 2]\n{ans2["text"]}\n\n[End of {role} 2]\n\n'
99
+ f'[System]\n{prompt}\n\n')
100
+ cur_js = {
101
+ 'id': idx+1,
102
+ 'question_id': ques['question_id'],
103
+ 'answer1_id': ans1.get('answer_id', ans1['question_id']),
104
+ 'answer2_id': ans2.get('answer_id', ans2['answer_id']),
105
+ 'category': category
106
+ }
107
+ if idx >= len(cur_reviews):
108
+ review = get_eval(content, args.max_tokens)
109
+ scores = parse_score(review)
110
+ cur_js['content'] = review
111
+ cur_js['tuple'] = scores
112
+ review_file.write(json.dumps(cur_js) + '\n')
113
+ review_file.flush()
114
+ else:
115
+ print(f'Skipping {idx} as we already have it.')
116
+ idx += 1
117
+ print(idx)
118
+ review_file.close()
LLaVA/llava/eval/eval_pope.py ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import argparse
4
+
5
+ def eval_pope(answers, label_file):
6
+ label_list = [json.loads(q)['label'] for q in open(label_file, 'r')]
7
+
8
+ for answer in answers:
9
+ text = answer['text']
10
+
11
+ # Only keep the first sentence
12
+ if text.find('.') != -1:
13
+ text = text.split('.')[0]
14
+
15
+ text = text.replace(',', '')
16
+ words = text.split(' ')
17
+ if 'No' in words or 'not' in words or 'no' in words:
18
+ answer['text'] = 'no'
19
+ else:
20
+ answer['text'] = 'yes'
21
+
22
+ for i in range(len(label_list)):
23
+ if label_list[i] == 'no':
24
+ label_list[i] = 0
25
+ else:
26
+ label_list[i] = 1
27
+
28
+ pred_list = []
29
+ for answer in answers:
30
+ if answer['text'] == 'no':
31
+ pred_list.append(0)
32
+ else:
33
+ pred_list.append(1)
34
+
35
+ pos = 1
36
+ neg = 0
37
+ yes_ratio = pred_list.count(1) / len(pred_list)
38
+
39
+ TP, TN, FP, FN = 0, 0, 0, 0
40
+ for pred, label in zip(pred_list, label_list):
41
+ if pred == pos and label == pos:
42
+ TP += 1
43
+ elif pred == pos and label == neg:
44
+ FP += 1
45
+ elif pred == neg and label == neg:
46
+ TN += 1
47
+ elif pred == neg and label == pos:
48
+ FN += 1
49
+
50
+ print('TP\tFP\tTN\tFN\t')
51
+ print('{}\t{}\t{}\t{}'.format(TP, FP, TN, FN))
52
+
53
+ precision = float(TP) / float(TP + FP)
54
+ recall = float(TP) / float(TP + FN)
55
+ f1 = 2*precision*recall / (precision + recall)
56
+ acc = (TP + TN) / (TP + TN + FP + FN)
57
+ print('Accuracy: {}'.format(acc))
58
+ print('Precision: {}'.format(precision))
59
+ print('Recall: {}'.format(recall))
60
+ print('F1 score: {}'.format(f1))
61
+ print('Yes ratio: {}'.format(yes_ratio))
62
+ print('%.3f, %.3f, %.3f, %.3f, %.3f' % (f1, acc, precision, recall, yes_ratio) )
63
+
64
+ if __name__ == "__main__":
65
+ parser = argparse.ArgumentParser()
66
+ parser.add_argument("--annotation-dir", type=str)
67
+ parser.add_argument("--question-file", type=str)
68
+ parser.add_argument("--result-file", type=str)
69
+ args = parser.parse_args()
70
+
71
+ questions = [json.loads(line) for line in open(args.question_file)]
72
+ questions = {question['question_id']: question for question in questions}
73
+ answers = [json.loads(q) for q in open(args.result_file)]
74
+ for file in os.listdir(args.annotation_dir):
75
+ assert file.startswith('coco_pope_')
76
+ assert file.endswith('.json')
77
+ category = file[10:-5]
78
+ cur_answers = [x for x in answers if questions[x['question_id']]['category'] == category]
79
+ print('Category: {}, # samples: {}'.format(category, len(cur_answers)))
80
+ eval_pope(cur_answers, os.path.join(args.annotation_dir, file))
81
+ print("====================================")
LLaVA/llava/eval/eval_science_qa_gpt4.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import json
3
+ import os
4
+ import re
5
+ import random
6
+ from collections import defaultdict
7
+
8
+
9
+ def get_args():
10
+ parser = argparse.ArgumentParser()
11
+ parser.add_argument('--base-dir', type=str)
12
+ parser.add_argument('--gpt4-result', type=str)
13
+ parser.add_argument('--our-result', type=str)
14
+ parser.add_argument('--split', type=str, default='test')
15
+ parser.add_argument('--options', type=list, default=["A", "B", "C", "D", "E"])
16
+ return parser.parse_args()
17
+
18
+
19
+ def convert_caps(results):
20
+ fakecaps = []
21
+ for result in results:
22
+ image_id = result['question_id']
23
+ caption = result['text']
24
+ fakecaps.append({"image_id": int(image_id), "caption": caption})
25
+ return fakecaps
26
+
27
+
28
+ def get_pred_idx(prediction, choices, options):
29
+ """
30
+ Get the index (e.g. 2) from the prediction (e.g. 'C')
31
+ """
32
+ if prediction in options[:len(choices)]:
33
+ return options.index(prediction)
34
+ else:
35
+ return random.choice(range(len(choices)))
36
+
37
+
38
+ if __name__ == "__main__":
39
+ args = get_args()
40
+
41
+ base_dir = args.base_dir
42
+ split_indices = json.load(open(os.path.join(base_dir, "pid_splits.json")))[args.split]
43
+ problems = json.load(open(os.path.join(base_dir, "problems.json")))
44
+ our_predictions = [json.loads(line) for line in open(args.our_result)]
45
+ our_predictions = {pred['question_id']: pred for pred in our_predictions}
46
+ split_problems = {idx: problems[idx] for idx in split_indices}
47
+
48
+ gpt4_predictions = json.load(open(args.gpt4_result))['outputs']
49
+
50
+ results = defaultdict(lambda: 0)
51
+
52
+ for prob_id, prob in split_problems.items():
53
+ if prob_id not in our_predictions:
54
+ continue
55
+ if prob_id not in gpt4_predictions:
56
+ continue
57
+ our_pred = our_predictions[prob_id]['text']
58
+ gpt4_pred = gpt4_predictions[prob_id]
59
+
60
+ pattern = re.compile(r'The answer is ([A-Z]).')
61
+ our_res = pattern.findall(our_pred)
62
+ if len(our_res) == 1:
63
+ our_answer = our_res[0] # 'A', 'B', ...
64
+ else:
65
+ our_answer = "FAILED"
66
+ gpt4_res = pattern.findall(gpt4_pred)
67
+ if len(gpt4_res) == 1:
68
+ gpt4_answer = gpt4_res[0] # 'A', 'B', ...
69
+ else:
70
+ gpt4_answer = "FAILED"
71
+
72
+ our_pred_idx = get_pred_idx(our_answer, prob['choices'], args.options)
73
+ gpt4_pred_idx = get_pred_idx(gpt4_answer, prob['choices'], args.options)
74
+
75
+ if gpt4_answer == 'FAILED':
76
+ results['gpt4_failed'] += 1
77
+ # continue
78
+ gpt4_pred_idx = our_pred_idx
79
+ # if our_pred_idx != prob['answer']:
80
+ # print(our_predictions[prob_id]['prompt'])
81
+ # print('-----------------')
82
+ # print(f'LECTURE: {prob["lecture"]}')
83
+ # print(f'SOLUTION: {prob["solution"]}')
84
+ # print('=====================')
85
+ else:
86
+ # continue
87
+ pass
88
+ # gpt4_pred_idx = our_pred_idx
89
+
90
+ if gpt4_pred_idx == prob['answer']:
91
+ results['correct'] += 1
92
+ else:
93
+ results['incorrect'] += 1
94
+
95
+
96
+ if gpt4_pred_idx == prob['answer'] or our_pred_idx == prob['answer']:
97
+ results['correct_upperbound'] += 1
98
+
99
+ correct = results['correct']
100
+ total = results['correct'] + results['incorrect']
101
+ print(f'Total: {total}, Correct: {correct}, Accuracy: {correct / total * 100:.2f}%')
102
+ print(f'Total: {total}, Correct (upper): {results["correct_upperbound"]}, Accuracy: {results["correct_upperbound"] / total * 100:.2f}%')
103
+ print(f'Total: {total}, GPT-4 NO-ANS (RANDOM): {results["gpt4_failed"]}, Percentage: {results["gpt4_failed"] / total * 100:.2f}%')
104
+
LLaVA/llava/eval/eval_textvqa.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import argparse
3
+ import json
4
+ import re
5
+
6
+ from llava.eval.m4c_evaluator import TextVQAAccuracyEvaluator
7
+
8
+
9
+ def get_args():
10
+ parser = argparse.ArgumentParser()
11
+ parser.add_argument('--annotation-file', type=str)
12
+ parser.add_argument('--result-file', type=str)
13
+ parser.add_argument('--result-dir', type=str)
14
+ return parser.parse_args()
15
+
16
+
17
+ def prompt_processor(prompt):
18
+ if prompt.startswith('OCR tokens: '):
19
+ pattern = r"Question: (.*?) Short answer:"
20
+ match = re.search(pattern, prompt, re.DOTALL)
21
+ question = match.group(1)
22
+ elif 'Reference OCR token: ' in prompt and len(prompt.split('\n')) == 3:
23
+ if prompt.startswith('Reference OCR token:'):
24
+ question = prompt.split('\n')[1]
25
+ else:
26
+ question = prompt.split('\n')[0]
27
+ elif len(prompt.split('\n')) == 2:
28
+ question = prompt.split('\n')[0]
29
+ else:
30
+ assert False
31
+
32
+ return question.lower()
33
+
34
+
35
+ def eval_single(annotation_file, result_file):
36
+ experiment_name = os.path.splitext(os.path.basename(result_file))[0]
37
+ print(experiment_name)
38
+ annotations = json.load(open(annotation_file))['data']
39
+ annotations = {(annotation['image_id'], annotation['question'].lower()): annotation for annotation in annotations}
40
+ results = [json.loads(line) for line in open(result_file)]
41
+
42
+ pred_list = []
43
+ for result in results:
44
+ annotation = annotations[(result['question_id'], prompt_processor(result['prompt']))]
45
+ pred_list.append({
46
+ "pred_answer": result['text'],
47
+ "gt_answers": annotation['answers'],
48
+ })
49
+
50
+ evaluator = TextVQAAccuracyEvaluator()
51
+ print('Samples: {}\nAccuracy: {:.2f}%\n'.format(len(pred_list), 100. * evaluator.eval_pred_list(pred_list)))
52
+
53
+
54
+ if __name__ == "__main__":
55
+ args = get_args()
56
+
57
+ if args.result_file is not None:
58
+ eval_single(args.annotation_file, args.result_file)
59
+
60
+ if args.result_dir is not None:
61
+ for result_file in sorted(os.listdir(args.result_dir)):
62
+ if not result_file.endswith('.jsonl'):
63
+ print(f'Skipping {result_file}')
64
+ continue
65
+ eval_single(args.annotation_file, os.path.join(args.result_dir, result_file))
LLaVA/llava/eval/generate_webpage_data_from_table.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Generate json file for webpage."""
2
+ import json
3
+ import os
4
+ import re
5
+
6
+ # models = ['llama', 'alpaca', 'gpt35', 'bard']
7
+ models = ['vicuna']
8
+
9
+
10
+ def read_jsonl(path: str, key: str=None):
11
+ data = []
12
+ with open(os.path.expanduser(path)) as f:
13
+ for line in f:
14
+ if not line:
15
+ continue
16
+ data.append(json.loads(line))
17
+ if key is not None:
18
+ data.sort(key=lambda x: x[key])
19
+ data = {item[key]: item for item in data}
20
+ return data
21
+
22
+
23
+ def trim_hanging_lines(s: str, n: int) -> str:
24
+ s = s.strip()
25
+ for _ in range(n):
26
+ s = s.split('\n', 1)[1].strip()
27
+ return s
28
+
29
+
30
+ if __name__ == '__main__':
31
+ questions = read_jsonl('table/question.jsonl', key='question_id')
32
+
33
+ # alpaca_answers = read_jsonl('table/answer/answer_alpaca-13b.jsonl', key='question_id')
34
+ # bard_answers = read_jsonl('table/answer/answer_bard.jsonl', key='question_id')
35
+ # gpt35_answers = read_jsonl('table/answer/answer_gpt35.jsonl', key='question_id')
36
+ # llama_answers = read_jsonl('table/answer/answer_llama-13b.jsonl', key='question_id')
37
+ vicuna_answers = read_jsonl('table/answer/answer_vicuna-13b.jsonl', key='question_id')
38
+ ours_answers = read_jsonl('table/results/llama-13b-hf-alpaca.jsonl', key='question_id')
39
+
40
+ review_vicuna = read_jsonl('table/review/review_vicuna-13b_llama-13b-hf-alpaca.jsonl', key='question_id')
41
+ # review_alpaca = read_jsonl('table/review/review_alpaca-13b_vicuna-13b.jsonl', key='question_id')
42
+ # review_bard = read_jsonl('table/review/review_bard_vicuna-13b.jsonl', key='question_id')
43
+ # review_gpt35 = read_jsonl('table/review/review_gpt35_vicuna-13b.jsonl', key='question_id')
44
+ # review_llama = read_jsonl('table/review/review_llama-13b_vicuna-13b.jsonl', key='question_id')
45
+
46
+ records = []
47
+ for qid in questions.keys():
48
+ r = {
49
+ 'id': qid,
50
+ 'category': questions[qid]['category'],
51
+ 'question': questions[qid]['text'],
52
+ 'answers': {
53
+ # 'alpaca': alpaca_answers[qid]['text'],
54
+ # 'llama': llama_answers[qid]['text'],
55
+ # 'bard': bard_answers[qid]['text'],
56
+ # 'gpt35': gpt35_answers[qid]['text'],
57
+ 'vicuna': vicuna_answers[qid]['text'],
58
+ 'ours': ours_answers[qid]['text'],
59
+ },
60
+ 'evaluations': {
61
+ # 'alpaca': review_alpaca[qid]['text'],
62
+ # 'llama': review_llama[qid]['text'],
63
+ # 'bard': review_bard[qid]['text'],
64
+ 'vicuna': review_vicuna[qid]['content'],
65
+ # 'gpt35': review_gpt35[qid]['text'],
66
+ },
67
+ 'scores': {
68
+ 'vicuna': review_vicuna[qid]['tuple'],
69
+ # 'alpaca': review_alpaca[qid]['score'],
70
+ # 'llama': review_llama[qid]['score'],
71
+ # 'bard': review_bard[qid]['score'],
72
+ # 'gpt35': review_gpt35[qid]['score'],
73
+ },
74
+ }
75
+
76
+ # cleanup data
77
+ cleaned_evals = {}
78
+ for k, v in r['evaluations'].items():
79
+ v = v.strip()
80
+ lines = v.split('\n')
81
+ # trim the first line if it's a pair of numbers
82
+ if re.match(r'\d+[, ]+\d+', lines[0]):
83
+ lines = lines[1:]
84
+ v = '\n'.join(lines)
85
+ cleaned_evals[k] = v.replace('Assistant 1', "**Assistant 1**").replace('Assistant 2', '**Assistant 2**')
86
+
87
+ r['evaluations'] = cleaned_evals
88
+ records.append(r)
89
+
90
+ # Reorder the records, this is optional
91
+ for r in records:
92
+ if r['id'] <= 20:
93
+ r['id'] += 60
94
+ else:
95
+ r['id'] -= 20
96
+ for r in records:
97
+ if r['id'] <= 50:
98
+ r['id'] += 10
99
+ elif 50 < r['id'] <= 60:
100
+ r['id'] -= 50
101
+ for r in records:
102
+ if r['id'] == 7:
103
+ r['id'] = 1
104
+ elif r['id'] < 7:
105
+ r['id'] += 1
106
+
107
+ records.sort(key=lambda x: x['id'])
108
+
109
+ # Write to file
110
+ with open('webpage/data.json', 'w') as f:
111
+ json.dump({'questions': records, 'models': models}, f, indent=2)
LLaVA/llava/eval/m4c_evaluator.py ADDED
@@ -0,0 +1,334 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Facebook, Inc. and its affiliates.
2
+ import re
3
+
4
+ from tqdm import tqdm
5
+
6
+
7
+ class EvalAIAnswerProcessor:
8
+ """
9
+ Processes an answer similar to Eval AI
10
+ copied from
11
+ https://github.com/facebookresearch/mmf/blob/c46b3b3391275b4181567db80943473a89ab98ab/pythia/tasks/processors.py#L897
12
+ """
13
+
14
+ CONTRACTIONS = {
15
+ "aint": "ain't",
16
+ "arent": "aren't",
17
+ "cant": "can't",
18
+ "couldve": "could've",
19
+ "couldnt": "couldn't",
20
+ "couldn'tve": "couldn't've",
21
+ "couldnt've": "couldn't've",
22
+ "didnt": "didn't",
23
+ "doesnt": "doesn't",
24
+ "dont": "don't",
25
+ "hadnt": "hadn't",
26
+ "hadnt've": "hadn't've",
27
+ "hadn'tve": "hadn't've",
28
+ "hasnt": "hasn't",
29
+ "havent": "haven't",
30
+ "hed": "he'd",
31
+ "hed've": "he'd've",
32
+ "he'dve": "he'd've",
33
+ "hes": "he's",
34
+ "howd": "how'd",
35
+ "howll": "how'll",
36
+ "hows": "how's",
37
+ "Id've": "I'd've",
38
+ "I'dve": "I'd've",
39
+ "Im": "I'm",
40
+ "Ive": "I've",
41
+ "isnt": "isn't",
42
+ "itd": "it'd",
43
+ "itd've": "it'd've",
44
+ "it'dve": "it'd've",
45
+ "itll": "it'll",
46
+ "let's": "let's",
47
+ "maam": "ma'am",
48
+ "mightnt": "mightn't",
49
+ "mightnt've": "mightn't've",
50
+ "mightn'tve": "mightn't've",
51
+ "mightve": "might've",
52
+ "mustnt": "mustn't",
53
+ "mustve": "must've",
54
+ "neednt": "needn't",
55
+ "notve": "not've",
56
+ "oclock": "o'clock",
57
+ "oughtnt": "oughtn't",
58
+ "ow's'at": "'ow's'at",
59
+ "'ows'at": "'ow's'at",
60
+ "'ow'sat": "'ow's'at",
61
+ "shant": "shan't",
62
+ "shed've": "she'd've",
63
+ "she'dve": "she'd've",
64
+ "she's": "she's",
65
+ "shouldve": "should've",
66
+ "shouldnt": "shouldn't",
67
+ "shouldnt've": "shouldn't've",
68
+ "shouldn'tve": "shouldn't've",
69
+ "somebody'd": "somebodyd",
70
+ "somebodyd've": "somebody'd've",
71
+ "somebody'dve": "somebody'd've",
72
+ "somebodyll": "somebody'll",
73
+ "somebodys": "somebody's",
74
+ "someoned": "someone'd",
75
+ "someoned've": "someone'd've",
76
+ "someone'dve": "someone'd've",
77
+ "someonell": "someone'll",
78
+ "someones": "someone's",
79
+ "somethingd": "something'd",
80
+ "somethingd've": "something'd've",
81
+ "something'dve": "something'd've",
82
+ "somethingll": "something'll",
83
+ "thats": "that's",
84
+ "thered": "there'd",
85
+ "thered've": "there'd've",
86
+ "there'dve": "there'd've",
87
+ "therere": "there're",
88
+ "theres": "there's",
89
+ "theyd": "they'd",
90
+ "theyd've": "they'd've",
91
+ "they'dve": "they'd've",
92
+ "theyll": "they'll",
93
+ "theyre": "they're",
94
+ "theyve": "they've",
95
+ "twas": "'twas",
96
+ "wasnt": "wasn't",
97
+ "wed've": "we'd've",
98
+ "we'dve": "we'd've",
99
+ "weve": "we've",
100
+ "werent": "weren't",
101
+ "whatll": "what'll",
102
+ "whatre": "what're",
103
+ "whats": "what's",
104
+ "whatve": "what've",
105
+ "whens": "when's",
106
+ "whered": "where'd",
107
+ "wheres": "where's",
108
+ "whereve": "where've",
109
+ "whod": "who'd",
110
+ "whod've": "who'd've",
111
+ "who'dve": "who'd've",
112
+ "wholl": "who'll",
113
+ "whos": "who's",
114
+ "whove": "who've",
115
+ "whyll": "why'll",
116
+ "whyre": "why're",
117
+ "whys": "why's",
118
+ "wont": "won't",
119
+ "wouldve": "would've",
120
+ "wouldnt": "wouldn't",
121
+ "wouldnt've": "wouldn't've",
122
+ "wouldn'tve": "wouldn't've",
123
+ "yall": "y'all",
124
+ "yall'll": "y'all'll",
125
+ "y'allll": "y'all'll",
126
+ "yall'd've": "y'all'd've",
127
+ "y'alld've": "y'all'd've",
128
+ "y'all'dve": "y'all'd've",
129
+ "youd": "you'd",
130
+ "youd've": "you'd've",
131
+ "you'dve": "you'd've",
132
+ "youll": "you'll",
133
+ "youre": "you're",
134
+ "youve": "you've",
135
+ }
136
+
137
+ NUMBER_MAP = {
138
+ "none": "0",
139
+ "zero": "0",
140
+ "one": "1",
141
+ "two": "2",
142
+ "three": "3",
143
+ "four": "4",
144
+ "five": "5",
145
+ "six": "6",
146
+ "seven": "7",
147
+ "eight": "8",
148
+ "nine": "9",
149
+ "ten": "10",
150
+ }
151
+ ARTICLES = ["a", "an", "the"]
152
+ PERIOD_STRIP = re.compile(r"(?!<=\d)(\.)(?!\d)")
153
+ COMMA_STRIP = re.compile(r"(?<=\d)(\,)+(?=\d)")
154
+ PUNCTUATIONS = [
155
+ ";",
156
+ r"/",
157
+ "[",
158
+ "]",
159
+ '"',
160
+ "{",
161
+ "}",
162
+ "(",
163
+ ")",
164
+ "=",
165
+ "+",
166
+ "\\",
167
+ "_",
168
+ "-",
169
+ ">",
170
+ "<",
171
+ "@",
172
+ "`",
173
+ ",",
174
+ "?",
175
+ "!",
176
+ ]
177
+
178
+ def __init__(self, *args, **kwargs):
179
+ pass
180
+
181
+ def word_tokenize(self, word):
182
+ word = word.lower()
183
+ word = word.replace(",", "").replace("?", "").replace("'s", " 's")
184
+ return word.strip()
185
+
186
+ def process_punctuation(self, in_text):
187
+ out_text = in_text
188
+ for p in self.PUNCTUATIONS:
189
+ if (p + " " in in_text or " " + p in in_text) or (
190
+ re.search(self.COMMA_STRIP, in_text) is not None
191
+ ):
192
+ out_text = out_text.replace(p, "")
193
+ else:
194
+ out_text = out_text.replace(p, " ")
195
+ out_text = self.PERIOD_STRIP.sub("", out_text, re.UNICODE)
196
+ return out_text
197
+
198
+ def process_digit_article(self, in_text):
199
+ out_text = []
200
+ temp_text = in_text.lower().split()
201
+ for word in temp_text:
202
+ word = self.NUMBER_MAP.setdefault(word, word)
203
+ if word not in self.ARTICLES:
204
+ out_text.append(word)
205
+ else:
206
+ pass
207
+ for word_id, word in enumerate(out_text):
208
+ if word in self.CONTRACTIONS:
209
+ out_text[word_id] = self.CONTRACTIONS[word]
210
+ out_text = " ".join(out_text)
211
+ return out_text
212
+
213
+ def __call__(self, item):
214
+ item = self.word_tokenize(item)
215
+ item = item.replace("\n", " ").replace("\t", " ").strip()
216
+ item = self.process_punctuation(item)
217
+ item = self.process_digit_article(item)
218
+ return item
219
+
220
+
221
+ class TextVQAAccuracyEvaluator:
222
+ def __init__(self):
223
+ self.answer_processor = EvalAIAnswerProcessor()
224
+
225
+ def _compute_answer_scores(self, raw_answers):
226
+ """
227
+ compute the accuracy (soft score) of human answers
228
+ """
229
+ answers = [self.answer_processor(a) for a in raw_answers]
230
+ assert len(answers) == 10
231
+ gt_answers = list(enumerate(answers))
232
+ unique_answers = set(answers)
233
+ unique_answer_scores = {}
234
+
235
+ for unique_answer in unique_answers:
236
+ accs = []
237
+ for gt_answer in gt_answers:
238
+ other_answers = [item for item in gt_answers if item != gt_answer]
239
+ matching_answers = [
240
+ item for item in other_answers if item[1] == unique_answer
241
+ ]
242
+ acc = min(1, float(len(matching_answers)) / 3)
243
+ accs.append(acc)
244
+ unique_answer_scores[unique_answer] = sum(accs) / len(accs)
245
+
246
+ return unique_answer_scores
247
+
248
+ def eval_pred_list(self, pred_list):
249
+ pred_scores = []
250
+ for entry in tqdm(pred_list):
251
+ pred_answer = self.answer_processor(entry["pred_answer"])
252
+ unique_answer_scores = self._compute_answer_scores(entry["gt_answers"])
253
+ score = unique_answer_scores.get(pred_answer, 0.0)
254
+ pred_scores.append(score)
255
+
256
+ accuracy = sum(pred_scores) / len(pred_scores)
257
+ return accuracy
258
+
259
+
260
+ class STVQAAccuracyEvaluator:
261
+ def __init__(self):
262
+ self.answer_processor = EvalAIAnswerProcessor()
263
+
264
+ def eval_pred_list(self, pred_list):
265
+ pred_scores = []
266
+ for entry in pred_list:
267
+ pred_answer = self.answer_processor(entry["pred_answer"])
268
+ gts = [self.answer_processor(a) for a in entry["gt_answers"]]
269
+ score = 1.0 if pred_answer in gts else 0.0
270
+ pred_scores.append(score)
271
+
272
+ accuracy = sum(pred_scores) / len(pred_scores)
273
+ return accuracy
274
+
275
+
276
+ class STVQAANLSEvaluator:
277
+ def __init__(self):
278
+ import editdistance # install with `pip install editdistance`
279
+
280
+ self.get_edit_distance = editdistance.eval
281
+
282
+ def get_anls(self, s1, s2):
283
+ s1 = s1.lower().strip()
284
+ s2 = s2.lower().strip()
285
+ iou = 1 - self.get_edit_distance(s1, s2) / max(len(s1), len(s2))
286
+ anls = iou if iou >= 0.5 else 0.0
287
+ return anls
288
+
289
+ def eval_pred_list(self, pred_list):
290
+ pred_scores = []
291
+ for entry in pred_list:
292
+ anls = max(
293
+ self.get_anls(entry["pred_answer"], gt) for gt in entry["gt_answers"]
294
+ )
295
+ pred_scores.append(anls)
296
+
297
+ accuracy = sum(pred_scores) / len(pred_scores)
298
+ return accuracy
299
+
300
+
301
+ class TextCapsBleu4Evaluator:
302
+ def __init__(self):
303
+ # The following script requires Java 1.8.0 and pycocotools installed.
304
+ # The pycocoevalcap can be installed with pip as
305
+ # pip install git+https://github.com/ronghanghu/coco-caption.git@python23
306
+ # Original pycocoevalcap code is at https://github.com/tylin/coco-caption
307
+ # but has no python3 support yet.
308
+ try:
309
+ from pycocoevalcap.bleu.bleu import Bleu
310
+ from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
311
+ except ModuleNotFoundError:
312
+ print(
313
+ "Please install pycocoevalcap module using "
314
+ "pip install git+https://github.com/ronghanghu/coco-caption.git@python23" # noqa
315
+ )
316
+ raise
317
+
318
+ self.tokenizer = PTBTokenizer()
319
+ self.scorer = Bleu(4)
320
+
321
+ def eval_pred_list(self, pred_list):
322
+ # Create reference and hypotheses captions.
323
+ gts = {}
324
+ res = {}
325
+ for idx, entry in enumerate(pred_list):
326
+ gts[idx] = [{"caption": a} for a in entry["gt_answers"]]
327
+ res[idx] = [{"caption": entry["pred_answer"]}]
328
+
329
+ gts = self.tokenizer.tokenize(gts)
330
+ res = self.tokenizer.tokenize(res)
331
+ score, _ = self.scorer.compute_score(gts, res)
332
+
333
+ bleu4 = score[3] # score is (Bleu-1, Bleu-2, Bleu-3, Bleu-4)
334
+ return bleu4
LLaVA/llava/eval/model_vqa.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import torch
3
+ import os
4
+ import json
5
+ from tqdm import tqdm
6
+ import shortuuid
7
+
8
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
9
+ from llava.conversation import conv_templates, SeparatorStyle
10
+ from llava.model.builder import load_pretrained_model
11
+ from llava.utils import disable_torch_init
12
+ from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
13
+
14
+ from PIL import Image
15
+ import math
16
+
17
+
18
+ def split_list(lst, n):
19
+ """Split a list into n (roughly) equal-sized chunks"""
20
+ chunk_size = math.ceil(len(lst) / n) # integer division
21
+ return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
22
+
23
+
24
+ def get_chunk(lst, n, k):
25
+ chunks = split_list(lst, n)
26
+ return chunks[k]
27
+
28
+
29
+ def eval_model(args):
30
+ # Model
31
+ disable_torch_init()
32
+ model_path = os.path.expanduser(args.model_path)
33
+ model_name = get_model_name_from_path(model_path)
34
+ tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
35
+
36
+ questions = [json.loads(q) for q in open(os.path.expanduser(args.question_file), "r")]
37
+ questions = get_chunk(questions, args.num_chunks, args.chunk_idx)
38
+ answers_file = os.path.expanduser(args.answers_file)
39
+ os.makedirs(os.path.dirname(answers_file), exist_ok=True)
40
+ ans_file = open(answers_file, "w")
41
+ for line in tqdm(questions):
42
+ idx = line["question_id"]
43
+ image_file = line["image"]
44
+ qs = line["text"]
45
+ cur_prompt = qs
46
+ if model.config.mm_use_im_start_end:
47
+ qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
48
+ else:
49
+ qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
50
+
51
+ conv = conv_templates[args.conv_mode].copy()
52
+ conv.append_message(conv.roles[0], qs)
53
+ conv.append_message(conv.roles[1], None)
54
+ prompt = conv.get_prompt()
55
+
56
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
57
+
58
+ image = Image.open(os.path.join(args.image_folder, image_file)).convert('RGB')
59
+ image_tensor = process_images([image], image_processor, model.config)[0]
60
+
61
+ with torch.inference_mode():
62
+ output_ids = model.generate(
63
+ input_ids,
64
+ images=image_tensor.unsqueeze(0).half().cuda(),
65
+ image_sizes=[image.size],
66
+ do_sample=True if args.temperature > 0 else False,
67
+ temperature=args.temperature,
68
+ top_p=args.top_p,
69
+ num_beams=args.num_beams,
70
+ # no_repeat_ngram_size=3,
71
+ max_new_tokens=1024,
72
+ use_cache=True)
73
+
74
+ outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
75
+
76
+ ans_id = shortuuid.uuid()
77
+ ans_file.write(json.dumps({"question_id": idx,
78
+ "prompt": cur_prompt,
79
+ "text": outputs,
80
+ "answer_id": ans_id,
81
+ "model_id": model_name,
82
+ "metadata": {}}) + "\n")
83
+ ans_file.flush()
84
+ ans_file.close()
85
+
86
+ if __name__ == "__main__":
87
+ parser = argparse.ArgumentParser()
88
+ parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
89
+ parser.add_argument("--model-base", type=str, default=None)
90
+ parser.add_argument("--image-folder", type=str, default="")
91
+ parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
92
+ parser.add_argument("--answers-file", type=str, default="answer.jsonl")
93
+ parser.add_argument("--conv-mode", type=str, default="llava_v1")
94
+ parser.add_argument("--num-chunks", type=int, default=1)
95
+ parser.add_argument("--chunk-idx", type=int, default=0)
96
+ parser.add_argument("--temperature", type=float, default=0.2)
97
+ parser.add_argument("--top_p", type=float, default=None)
98
+ parser.add_argument("--num_beams", type=int, default=1)
99
+ args = parser.parse_args()
100
+
101
+ eval_model(args)
LLaVA/llava/eval/model_vqa_loader.py ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import torch
3
+ import os
4
+ import json
5
+ from tqdm import tqdm
6
+ import shortuuid
7
+
8
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
9
+ from llava.conversation import conv_templates, SeparatorStyle
10
+ from llava.model.builder import load_pretrained_model
11
+ from llava.utils import disable_torch_init
12
+ from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
13
+ from torch.utils.data import Dataset, DataLoader
14
+
15
+ from PIL import Image
16
+ import math
17
+
18
+
19
+ def split_list(lst, n):
20
+ """Split a list into n (roughly) equal-sized chunks"""
21
+ chunk_size = math.ceil(len(lst) / n) # integer division
22
+ return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
23
+
24
+
25
+ def get_chunk(lst, n, k):
26
+ chunks = split_list(lst, n)
27
+ return chunks[k]
28
+
29
+
30
+ # Custom dataset class
31
+ class CustomDataset(Dataset):
32
+ def __init__(self, questions, image_folder, tokenizer, image_processor, model_config):
33
+ self.questions = questions
34
+ self.image_folder = image_folder
35
+ self.tokenizer = tokenizer
36
+ self.image_processor = image_processor
37
+ self.model_config = model_config
38
+
39
+ def __getitem__(self, index):
40
+ line = self.questions[index]
41
+ image_file = line["image"]
42
+ qs = line["text"]
43
+ if self.model_config.mm_use_im_start_end:
44
+ qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
45
+ else:
46
+ qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
47
+
48
+ conv = conv_templates[args.conv_mode].copy()
49
+ conv.append_message(conv.roles[0], qs)
50
+ conv.append_message(conv.roles[1], None)
51
+ prompt = conv.get_prompt()
52
+
53
+ image = Image.open(os.path.join(self.image_folder, image_file)).convert('RGB')
54
+ image_tensor = process_images([image], self.image_processor, self.model_config)[0]
55
+
56
+ input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
57
+
58
+ return input_ids, image_tensor, image.size
59
+
60
+ def __len__(self):
61
+ return len(self.questions)
62
+
63
+
64
+ def collate_fn(batch):
65
+ input_ids, image_tensors, image_sizes = zip(*batch)
66
+ input_ids = torch.stack(input_ids, dim=0)
67
+ image_tensors = torch.stack(image_tensors, dim=0)
68
+ return input_ids, image_tensors, image_sizes
69
+
70
+
71
+ # DataLoader
72
+ def create_data_loader(questions, image_folder, tokenizer, image_processor, model_config, batch_size=1, num_workers=4):
73
+ assert batch_size == 1, "batch_size must be 1"
74
+ dataset = CustomDataset(questions, image_folder, tokenizer, image_processor, model_config)
75
+ data_loader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers, shuffle=False, collate_fn=collate_fn)
76
+ return data_loader
77
+
78
+
79
+ def eval_model(args):
80
+ # Model
81
+ disable_torch_init()
82
+ model_path = os.path.expanduser(args.model_path)
83
+ model_name = get_model_name_from_path(model_path)
84
+ tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
85
+
86
+ questions = [json.loads(q) for q in open(os.path.expanduser(args.question_file), "r")]
87
+ questions = get_chunk(questions, args.num_chunks, args.chunk_idx)
88
+ answers_file = os.path.expanduser(args.answers_file)
89
+ os.makedirs(os.path.dirname(answers_file), exist_ok=True)
90
+ ans_file = open(answers_file, "w")
91
+
92
+ if 'plain' in model_name and 'finetune' not in model_name.lower() and 'mmtag' not in args.conv_mode:
93
+ args.conv_mode = args.conv_mode + '_mmtag'
94
+ print(f'It seems that this is a plain model, but it is not using a mmtag prompt, auto switching to {args.conv_mode}.')
95
+
96
+ data_loader = create_data_loader(questions, args.image_folder, tokenizer, image_processor, model.config)
97
+
98
+ for (input_ids, image_tensor, image_sizes), line in tqdm(zip(data_loader, questions), total=len(questions)):
99
+ idx = line["question_id"]
100
+ cur_prompt = line["text"]
101
+
102
+ input_ids = input_ids.to(device='cuda', non_blocking=True)
103
+
104
+ with torch.inference_mode():
105
+ output_ids = model.generate(
106
+ input_ids,
107
+ images=image_tensor.to(dtype=torch.float16, device='cuda', non_blocking=True),
108
+ image_sizes=image_sizes,
109
+ do_sample=True if args.temperature > 0 else False,
110
+ temperature=args.temperature,
111
+ top_p=args.top_p,
112
+ num_beams=args.num_beams,
113
+ max_new_tokens=args.max_new_tokens,
114
+ use_cache=True)
115
+
116
+ outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
117
+
118
+ ans_id = shortuuid.uuid()
119
+ ans_file.write(json.dumps({"question_id": idx,
120
+ "prompt": cur_prompt,
121
+ "text": outputs,
122
+ "answer_id": ans_id,
123
+ "model_id": model_name,
124
+ "metadata": {}}) + "\n")
125
+ # ans_file.flush()
126
+ ans_file.close()
127
+
128
+ if __name__ == "__main__":
129
+ parser = argparse.ArgumentParser()
130
+ parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
131
+ parser.add_argument("--model-base", type=str, default=None)
132
+ parser.add_argument("--image-folder", type=str, default="")
133
+ parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
134
+ parser.add_argument("--answers-file", type=str, default="answer.jsonl")
135
+ parser.add_argument("--conv-mode", type=str, default="llava_v1")
136
+ parser.add_argument("--num-chunks", type=int, default=1)
137
+ parser.add_argument("--chunk-idx", type=int, default=0)
138
+ parser.add_argument("--temperature", type=float, default=0.2)
139
+ parser.add_argument("--top_p", type=float, default=None)
140
+ parser.add_argument("--num_beams", type=int, default=1)
141
+ parser.add_argument("--max_new_tokens", type=int, default=128)
142
+ args = parser.parse_args()
143
+
144
+ eval_model(args)
LLaVA/llava/eval/model_vqa_mmbench.py ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import torch
3
+ import os
4
+ import json
5
+ import pandas as pd
6
+ from tqdm import tqdm
7
+ import shortuuid
8
+
9
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
10
+ from llava.conversation import conv_templates, SeparatorStyle
11
+ from llava.model.builder import load_pretrained_model
12
+ from llava.utils import disable_torch_init
13
+ from llava.mm_utils import tokenizer_image_token, process_images, load_image_from_base64, get_model_name_from_path
14
+
15
+ from PIL import Image
16
+ import math
17
+
18
+
19
+ all_options = ['A', 'B', 'C', 'D']
20
+
21
+
22
+ def split_list(lst, n):
23
+ """Split a list into n (roughly) equal-sized chunks"""
24
+ chunk_size = math.ceil(len(lst) / n) # integer division
25
+ return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
26
+
27
+
28
+ def get_chunk(lst, n, k):
29
+ chunks = split_list(lst, n)
30
+ return chunks[k]
31
+
32
+
33
+ def is_none(value):
34
+ if value is None:
35
+ return True
36
+ if type(value) is float and math.isnan(value):
37
+ return True
38
+ if type(value) is str and value.lower() == 'nan':
39
+ return True
40
+ if type(value) is str and value.lower() == 'none':
41
+ return True
42
+ return False
43
+
44
+ def get_options(row, options):
45
+ parsed_options = []
46
+ for option in options:
47
+ option_value = row[option]
48
+ if is_none(option_value):
49
+ break
50
+ parsed_options.append(option_value)
51
+ return parsed_options
52
+
53
+
54
+ def eval_model(args):
55
+ # Model
56
+ disable_torch_init()
57
+ model_path = os.path.expanduser(args.model_path)
58
+ model_name = get_model_name_from_path(model_path)
59
+ tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
60
+
61
+ questions = pd.read_table(os.path.expanduser(args.question_file))
62
+ questions = get_chunk(questions, args.num_chunks, args.chunk_idx)
63
+ answers_file = os.path.expanduser(args.answers_file)
64
+ os.makedirs(os.path.dirname(answers_file), exist_ok=True)
65
+ ans_file = open(answers_file, "w")
66
+
67
+ if 'plain' in model_name and 'finetune' not in model_name.lower() and 'mmtag' not in args.conv_mode:
68
+ args.conv_mode = args.conv_mode + '_mmtag'
69
+ print(f'It seems that this is a plain model, but it is not using a mmtag prompt, auto switching to {args.conv_mode}.')
70
+
71
+ for index, row in tqdm(questions.iterrows(), total=len(questions)):
72
+ options = get_options(row, all_options)
73
+ cur_option_char = all_options[:len(options)]
74
+
75
+ if args.all_rounds:
76
+ num_rounds = len(options)
77
+ else:
78
+ num_rounds = 1
79
+
80
+ for round_idx in range(num_rounds):
81
+ idx = row['index']
82
+ question = row['question']
83
+ hint = row['hint']
84
+ image = load_image_from_base64(row['image'])
85
+ if not is_none(hint):
86
+ question = hint + '\n' + question
87
+ for option_char, option in zip(all_options[:len(options)], options):
88
+ question = question + '\n' + option_char + '. ' + option
89
+ qs = cur_prompt = question
90
+ if model.config.mm_use_im_start_end:
91
+ qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
92
+ else:
93
+ qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
94
+
95
+ if args.single_pred_prompt:
96
+ if args.lang == 'cn':
97
+ qs = qs + '\n' + "请直接回答选项字母。"
98
+ else:
99
+ qs = qs + '\n' + "Answer with the option's letter from the given choices directly."
100
+
101
+ conv = conv_templates[args.conv_mode].copy()
102
+ conv.append_message(conv.roles[0], qs)
103
+ conv.append_message(conv.roles[1], None)
104
+ prompt = conv.get_prompt()
105
+
106
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
107
+
108
+ image_tensor = process_images([image], image_processor, model.config)[0]
109
+
110
+ with torch.inference_mode():
111
+ output_ids = model.generate(
112
+ input_ids,
113
+ images=image_tensor.unsqueeze(0).half().cuda(),
114
+ image_sizes=[image.size],
115
+ do_sample=True if args.temperature > 0 else False,
116
+ temperature=args.temperature,
117
+ top_p=args.top_p,
118
+ num_beams=args.num_beams,
119
+ # no_repeat_ngram_size=3,
120
+ max_new_tokens=1024,
121
+ use_cache=True)
122
+
123
+ outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
124
+
125
+ ans_id = shortuuid.uuid()
126
+ ans_file.write(json.dumps({"question_id": idx,
127
+ "round_id": round_idx,
128
+ "prompt": cur_prompt,
129
+ "text": outputs,
130
+ "options": options,
131
+ "option_char": cur_option_char,
132
+ "answer_id": ans_id,
133
+ "model_id": model_name,
134
+ "metadata": {}}) + "\n")
135
+ ans_file.flush()
136
+
137
+ # rotate options
138
+ options = options[1:] + options[:1]
139
+ cur_option_char = cur_option_char[1:] + cur_option_char[:1]
140
+ ans_file.close()
141
+
142
+ if __name__ == "__main__":
143
+ parser = argparse.ArgumentParser()
144
+ parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
145
+ parser.add_argument("--model-base", type=str, default=None)
146
+ parser.add_argument("--image-folder", type=str, default="")
147
+ parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
148
+ parser.add_argument("--answers-file", type=str, default="answer.jsonl")
149
+ parser.add_argument("--conv-mode", type=str, default="llava_v1")
150
+ parser.add_argument("--num-chunks", type=int, default=1)
151
+ parser.add_argument("--chunk-idx", type=int, default=0)
152
+ parser.add_argument("--temperature", type=float, default=0.2)
153
+ parser.add_argument("--top_p", type=float, default=None)
154
+ parser.add_argument("--num_beams", type=int, default=1)
155
+ parser.add_argument("--all-rounds", action="store_true")
156
+ parser.add_argument("--single-pred-prompt", action="store_true")
157
+ parser.add_argument("--lang", type=str, default="en")
158
+ args = parser.parse_args()
159
+
160
+ eval_model(args)
LLaVA/llava/eval/summarize_gpt_review.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ from collections import defaultdict
4
+
5
+ import numpy as np
6
+
7
+ import argparse
8
+
9
+ def parse_args():
10
+ parser = argparse.ArgumentParser(description='ChatGPT-based QA evaluation.')
11
+ parser.add_argument('-d', '--dir', default=None)
12
+ parser.add_argument('-v', '--version', default=None)
13
+ parser.add_argument('-s', '--select', nargs='*', default=None)
14
+ parser.add_argument('-f', '--files', nargs='*', default=[])
15
+ parser.add_argument('-i', '--ignore', nargs='*', default=[])
16
+ return parser.parse_args()
17
+
18
+
19
+ if __name__ == '__main__':
20
+ args = parse_args()
21
+
22
+ if args.ignore is not None:
23
+ args.ignore = [int(x) for x in args.ignore]
24
+
25
+ if len(args.files) > 0:
26
+ review_files = args.files
27
+ else:
28
+ review_files = [x for x in os.listdir(args.dir) if x.endswith('.jsonl') and (x.startswith('gpt4_text') or x.startswith('reviews_') or x.startswith('review_') or 'review' in args.dir)]
29
+
30
+ for review_file in sorted(review_files):
31
+ config = os.path.basename(review_file).replace('gpt4_text_', '').replace('.jsonl', '')
32
+ if args.select is not None and any(x not in config for x in args.select):
33
+ continue
34
+ if '0613' in config:
35
+ version = '0613'
36
+ else:
37
+ version = '0314'
38
+ if args.version is not None and args.version != version:
39
+ continue
40
+ scores = defaultdict(list)
41
+ print(config)
42
+ with open(os.path.join(args.dir, review_file) if args.dir is not None else review_file) as f:
43
+ for review_str in f:
44
+ review = json.loads(review_str)
45
+ if review['question_id'] in args.ignore:
46
+ continue
47
+ if 'category' in review:
48
+ scores[review['category']].append(review['tuple'])
49
+ scores['all'].append(review['tuple'])
50
+ else:
51
+ if 'tuple' in review:
52
+ scores['all'].append(review['tuple'])
53
+ else:
54
+ scores['all'].append(review['score'])
55
+ for k, v in sorted(scores.items()):
56
+ stats = np.asarray(v).mean(0).tolist()
57
+ stats = [round(x, 3) for x in stats]
58
+ # print(k, stats, round(stats[1]/stats[0]*100, 1))
59
+ print(k, round(stats[1]/stats[0]*100, 1), round(stats[0] * 10, 1), round(stats[1] * 10, 1))
60
+ print('=================================')
LLaVA/llava/eval/webpage/styles.css ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ body {
2
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
3
+ background-color: #f8f9fa;
4
+ }
5
+
6
+ .navbar-dark .navbar-nav .nav-link {
7
+ color: #f1cf68;
8
+ font-size: 1.1rem;
9
+ padding: 0.5rem 0.6rem;
10
+ }
11
+
12
+ .card-header {
13
+ font-weight: bold;
14
+ }
15
+
16
+ .card {
17
+ box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
18
+ transition: 0.3s;
19
+ }
20
+
21
+ .card:hover {
22
+ box-shadow: 0 8px 16px rgba(0, 0, 0, 0.2);
23
+ }
24
+
25
+ button {
26
+ transition: background-color 0.3s;
27
+ }
28
+
29
+ button:hover {
30
+ background-color: #007bff;
31
+ }
32
+
33
+ @media (max-width: 767px) {
34
+ .form-row .form-group {
35
+ margin-bottom: 10px;
36
+ }
37
+ }
38
+
39
+ /* Extra styles */
40
+
41
+ .expandable-card .card-text-container {
42
+ max-height: 200px;
43
+ overflow-y: hidden;
44
+ position: relative;
45
+ }
46
+
47
+ .expandable-card.expanded .card-text-container {
48
+ max-height: none;
49
+ }
50
+
51
+ .expand-btn {
52
+ position: relative;
53
+ display: none;
54
+ background-color: rgba(255, 255, 255, 0.8);
55
+ color: #510c75;
56
+ border-color: transparent;
57
+ }
58
+
59
+ .expand-btn:hover {
60
+ background-color: rgba(200, 200, 200, 0.8);
61
+ text-decoration: none;
62
+ border-color: transparent;
63
+ color: #510c75;
64
+ }
65
+
66
+ .expand-btn:focus {
67
+ outline: none;
68
+ text-decoration: none;
69
+ }
70
+
71
+ .expandable-card:not(.expanded) .card-text-container:after {
72
+ content: "";
73
+ position: absolute;
74
+ bottom: 0;
75
+ left: 0;
76
+ width: 100%;
77
+ height: 90px;
78
+ background: linear-gradient(rgba(255, 255, 255, 0.2), rgba(255, 255, 255, 1));
79
+ }
80
+
81
+ .expandable-card:not(.expanded) .expand-btn {
82
+ margin-top: -40px;
83
+ }
84
+
85
+ .card-body {
86
+ padding-bottom: 5px;
87
+ }
88
+
89
+ .vertical-flex-layout {
90
+ justify-content: center;
91
+ align-items: center;
92
+ height: 100%;
93
+ display: flex;
94
+ flex-direction: column;
95
+ gap: 5px;
96
+ }
97
+
98
+ .figure-img {
99
+ max-width: 100%;
100
+ height: auto;
101
+ }
102
+
103
+ .adjustable-font-size {
104
+ font-size: calc(0.5rem + 2vw);
105
+ }
LLaVA/llava/model/__init__.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ try:
2
+ from .language_model.llava_llama import LlavaLlamaForCausalLM, LlavaConfig
3
+ from .language_model.llava_mpt import LlavaMptForCausalLM, LlavaMptConfig
4
+ from .language_model.llava_mistral import LlavaMistralForCausalLM, LlavaMistralConfig
5
+ except:
6
+ pass
LLaVA/llava/model/apply_delta.py ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Usage:
3
+ python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta
4
+ """
5
+ import argparse
6
+
7
+ import torch
8
+ from tqdm import tqdm
9
+ from transformers import AutoTokenizer, AutoModelForCausalLM
10
+ from llava import LlavaLlamaForCausalLM
11
+
12
+
13
+ def apply_delta(base_model_path, target_model_path, delta_path):
14
+ print("Loading base model")
15
+ base = AutoModelForCausalLM.from_pretrained(
16
+ base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
17
+
18
+ print("Loading delta")
19
+ delta = LlavaLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
20
+ delta_tokenizer = AutoTokenizer.from_pretrained(delta_path)
21
+
22
+ print("Applying delta")
23
+ for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"):
24
+ if name not in base.state_dict():
25
+ assert name in ['model.mm_projector.weight', 'model.mm_projector.bias'], f'{name} not in base model'
26
+ continue
27
+ if param.data.shape == base.state_dict()[name].shape:
28
+ param.data += base.state_dict()[name]
29
+ else:
30
+ assert name in ['model.embed_tokens.weight', 'lm_head.weight'], \
31
+ f'{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}'
32
+ bparam = base.state_dict()[name]
33
+ param.data[:bparam.shape[0], :bparam.shape[1]] += bparam
34
+
35
+ print("Saving target model")
36
+ delta.save_pretrained(target_model_path)
37
+ delta_tokenizer.save_pretrained(target_model_path)
38
+
39
+
40
+ if __name__ == "__main__":
41
+ parser = argparse.ArgumentParser()
42
+ parser.add_argument("--base-model-path", type=str, required=True)
43
+ parser.add_argument("--target-model-path", type=str, required=True)
44
+ parser.add_argument("--delta-path", type=str, required=True)
45
+
46
+ args = parser.parse_args()
47
+
48
+ apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
LLaVA/llava/model/builder.py ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 Haotian Liu
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+
16
+ import os
17
+ import warnings
18
+ import shutil
19
+
20
+ from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
21
+ import torch
22
+ from llava.model import *
23
+ from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
24
+
25
+
26
+ def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs):
27
+ kwargs = {"device_map": device_map, **kwargs}
28
+
29
+ if device != "cuda":
30
+ kwargs['device_map'] = {"": device}
31
+
32
+ if load_8bit:
33
+ kwargs['load_in_8bit'] = True
34
+ elif load_4bit:
35
+ kwargs['load_in_4bit'] = True
36
+ kwargs['quantization_config'] = BitsAndBytesConfig(
37
+ load_in_4bit=True,
38
+ bnb_4bit_compute_dtype=torch.float16,
39
+ bnb_4bit_use_double_quant=True,
40
+ bnb_4bit_quant_type='nf4'
41
+ )
42
+ else:
43
+ kwargs['torch_dtype'] = torch.float16
44
+
45
+ if use_flash_attn:
46
+ kwargs['attn_implementation'] = 'flash_attention_2'
47
+
48
+ if 'llava' in model_name.lower():
49
+ # Load LLaVA model
50
+ if 'lora' in model_name.lower() and model_base is None:
51
+ warnings.warn('There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged.')
52
+ if 'lora' in model_name.lower() and model_base is not None:
53
+ from llava.model.language_model.llava_llama import LlavaConfig
54
+ lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path)
55
+ tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
56
+ print('Loading LLaVA from base model...')
57
+ model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
58
+ token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
59
+ if model.lm_head.weight.shape[0] != token_num:
60
+ model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
61
+ model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
62
+
63
+ print('Loading additional LLaVA weights...')
64
+ if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
65
+ non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
66
+ else:
67
+ # this is probably from HF Hub
68
+ from huggingface_hub import hf_hub_download
69
+ def load_from_hf(repo_id, filename, subfolder=None):
70
+ cache_file = hf_hub_download(
71
+ repo_id=repo_id,
72
+ filename=filename,
73
+ subfolder=subfolder)
74
+ return torch.load(cache_file, map_location='cpu')
75
+ non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
76
+ non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
77
+ if any(k.startswith('model.model.') for k in non_lora_trainables):
78
+ non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
79
+ model.load_state_dict(non_lora_trainables, strict=False)
80
+
81
+ from peft import PeftModel
82
+ print('Loading LoRA weights...')
83
+ model = PeftModel.from_pretrained(model, model_path)
84
+ print('Merging LoRA weights...')
85
+ model = model.merge_and_unload()
86
+ print('Model is loaded...')
87
+ elif model_base is not None:
88
+ # this may be mm projector only
89
+ print('Loading LLaVA from base model...')
90
+ if 'mpt' in model_name.lower():
91
+ if not os.path.isfile(os.path.join(model_path, 'configuration_mpt.py')):
92
+ shutil.copyfile(os.path.join(model_base, 'configuration_mpt.py'), os.path.join(model_path, 'configuration_mpt.py'))
93
+ tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=True)
94
+ cfg_pretrained = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
95
+ model = LlavaMptForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
96
+ else:
97
+ tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
98
+ cfg_pretrained = AutoConfig.from_pretrained(model_path)
99
+ model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
100
+
101
+ mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
102
+ mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
103
+ model.load_state_dict(mm_projector_weights, strict=False)
104
+ else:
105
+ if 'mpt' in model_name.lower():
106
+ tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
107
+ model = LlavaMptForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
108
+ elif 'mistral' in model_name.lower():
109
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
110
+ model = LlavaMistralForCausalLM.from_pretrained(
111
+ model_path,
112
+ low_cpu_mem_usage=True,
113
+ **kwargs
114
+ )
115
+ else:
116
+ tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
117
+ model = LlavaLlamaForCausalLM.from_pretrained(
118
+ model_path,
119
+ low_cpu_mem_usage=True,
120
+ **kwargs
121
+ )
122
+ else:
123
+ # Load language model
124
+ if model_base is not None:
125
+ # PEFT model
126
+ from peft import PeftModel
127
+ tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
128
+ model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs)
129
+ print(f"Loading LoRA weights from {model_path}")
130
+ model = PeftModel.from_pretrained(model, model_path)
131
+ print(f"Merging weights")
132
+ model = model.merge_and_unload()
133
+ print('Convert to FP16...')
134
+ model.to(torch.float16)
135
+ else:
136
+ use_fast = False
137
+ if 'mpt' in model_name.lower():
138
+ tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
139
+ model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs)
140
+ else:
141
+ tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
142
+ model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
143
+
144
+ image_processor = None
145
+
146
+ if 'llava' in model_name.lower():
147
+ mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
148
+ mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
149
+ if mm_use_im_patch_token:
150
+ tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
151
+ if mm_use_im_start_end:
152
+ tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
153
+ model.resize_token_embeddings(len(tokenizer))
154
+
155
+ vision_tower = model.get_vision_tower()
156
+ if not vision_tower.is_loaded:
157
+ vision_tower.load_model(device_map=device_map)
158
+ if device_map != 'auto':
159
+ vision_tower.to(device=device_map, dtype=torch.float16)
160
+ image_processor = vision_tower.image_processor
161
+
162
+ if hasattr(model.config, "max_sequence_length"):
163
+ context_len = model.config.max_sequence_length
164
+ else:
165
+ context_len = 2048
166
+
167
+ return tokenizer, model, image_processor, context_len
LLaVA/llava/model/consolidate.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Usage:
3
+ python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate
4
+ """
5
+ import argparse
6
+
7
+ import torch
8
+ from transformers import AutoTokenizer, AutoModelForCausalLM
9
+ from llava.model import *
10
+ from llava.model.utils import auto_upgrade
11
+
12
+
13
+ def consolidate_ckpt(src_path, dst_path):
14
+ print("Loading model")
15
+ auto_upgrade(src_path)
16
+ src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
17
+ src_tokenizer = AutoTokenizer.from_pretrained(src_path, use_fast=False)
18
+ src_model.save_pretrained(dst_path)
19
+ src_tokenizer.save_pretrained(dst_path)
20
+
21
+
22
+ if __name__ == "__main__":
23
+ parser = argparse.ArgumentParser()
24
+ parser.add_argument("--src", type=str, required=True)
25
+ parser.add_argument("--dst", type=str, required=True)
26
+
27
+ args = parser.parse_args()
28
+
29
+ consolidate_ckpt(args.src, args.dst)
LLaVA/llava/model/llava_arch.py ADDED
@@ -0,0 +1,368 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 Haotian Liu
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+
16
+ from abc import ABC, abstractmethod
17
+
18
+ import torch
19
+ import torch.nn as nn
20
+
21
+ from .multimodal_encoder.builder import build_vision_tower
22
+ from .multimodal_projector.builder import build_vision_projector
23
+
24
+ from llava.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
25
+
26
+ from llava.mm_utils import get_anyres_image_grid_shape
27
+
28
+
29
+ class LlavaMetaModel:
30
+
31
+ def __init__(self, config):
32
+ super(LlavaMetaModel, self).__init__(config)
33
+
34
+ if hasattr(config, "mm_vision_tower"):
35
+ self.vision_tower = build_vision_tower(config, delay_load=True)
36
+ self.mm_projector = build_vision_projector(config)
37
+
38
+ if 'unpad' in getattr(config, 'mm_patch_merge_type', ''):
39
+ self.image_newline = nn.Parameter(
40
+ torch.empty(config.hidden_size, dtype=self.dtype)
41
+ )
42
+
43
+ def get_vision_tower(self):
44
+ vision_tower = getattr(self, 'vision_tower', None)
45
+ if type(vision_tower) is list:
46
+ vision_tower = vision_tower[0]
47
+ return vision_tower
48
+
49
+ def initialize_vision_modules(self, model_args, fsdp=None):
50
+ vision_tower = model_args.vision_tower
51
+ mm_vision_select_layer = model_args.mm_vision_select_layer
52
+ mm_vision_select_feature = model_args.mm_vision_select_feature
53
+ pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
54
+ mm_patch_merge_type = model_args.mm_patch_merge_type
55
+
56
+ self.config.mm_vision_tower = vision_tower
57
+
58
+ if self.get_vision_tower() is None:
59
+ vision_tower = build_vision_tower(model_args)
60
+
61
+ if fsdp is not None and len(fsdp) > 0:
62
+ self.vision_tower = [vision_tower]
63
+ else:
64
+ self.vision_tower = vision_tower
65
+ else:
66
+ if fsdp is not None and len(fsdp) > 0:
67
+ vision_tower = self.vision_tower[0]
68
+ else:
69
+ vision_tower = self.vision_tower
70
+ vision_tower.load_model()
71
+
72
+ self.config.use_mm_proj = True
73
+ self.config.mm_projector_type = getattr(model_args, 'mm_projector_type', 'linear')
74
+ self.config.mm_hidden_size = vision_tower.hidden_size
75
+ self.config.mm_vision_select_layer = mm_vision_select_layer
76
+ self.config.mm_vision_select_feature = mm_vision_select_feature
77
+ self.config.mm_patch_merge_type = mm_patch_merge_type
78
+
79
+ if getattr(self, 'mm_projector', None) is None:
80
+ self.mm_projector = build_vision_projector(self.config)
81
+
82
+ if 'unpad' in mm_patch_merge_type:
83
+ embed_std = 1 / torch.sqrt(torch.tensor(self.config.hidden_size, dtype=self.dtype))
84
+ self.image_newline = nn.Parameter(
85
+ torch.randn(self.config.hidden_size, dtype=self.dtype) * embed_std
86
+ )
87
+ else:
88
+ # In case it is frozen by LoRA
89
+ for p in self.mm_projector.parameters():
90
+ p.requires_grad = True
91
+
92
+ if pretrain_mm_mlp_adapter is not None:
93
+ mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu')
94
+ def get_w(weights, keyword):
95
+ return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}
96
+
97
+ self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))
98
+
99
+
100
+ def unpad_image(tensor, original_size):
101
+ """
102
+ Unpads a PyTorch tensor of a padded and resized image.
103
+
104
+ Args:
105
+ tensor (torch.Tensor): The image tensor, assumed to be in CxHxW format.
106
+ original_size (tuple): The original size of PIL image (width, height).
107
+
108
+ Returns:
109
+ torch.Tensor: The unpadded image tensor.
110
+ """
111
+ original_width, original_height = original_size
112
+ current_height, current_width = tensor.shape[1:]
113
+
114
+ original_aspect_ratio = original_width / original_height
115
+ current_aspect_ratio = current_width / current_height
116
+
117
+ if original_aspect_ratio > current_aspect_ratio:
118
+ scale_factor = current_width / original_width
119
+ new_height = int(original_height * scale_factor)
120
+ padding = (current_height - new_height) // 2
121
+ unpadded_tensor = tensor[:, padding:current_height - padding, :]
122
+ else:
123
+ scale_factor = current_height / original_height
124
+ new_width = int(original_width * scale_factor)
125
+ padding = (current_width - new_width) // 2
126
+ unpadded_tensor = tensor[:, :, padding:current_width - padding]
127
+
128
+ return unpadded_tensor
129
+
130
+
131
+ class LlavaMetaForCausalLM(ABC):
132
+
133
+ @abstractmethod
134
+ def get_model(self):
135
+ pass
136
+
137
+ def get_vision_tower(self):
138
+ return self.get_model().get_vision_tower()
139
+
140
+ def encode_images(self, images):
141
+ image_features = self.get_model().get_vision_tower()(images)
142
+ image_features = self.get_model().mm_projector(image_features)
143
+ return image_features
144
+
145
+ def prepare_inputs_labels_for_multimodal(
146
+ self, input_ids, position_ids, attention_mask, past_key_values, labels,
147
+ images, image_sizes=None
148
+ ):
149
+ vision_tower = self.get_vision_tower()
150
+ if vision_tower is None or images is None or input_ids.shape[1] == 1:
151
+ return input_ids, position_ids, attention_mask, past_key_values, None, labels
152
+
153
+ if type(images) is list or images.ndim == 5:
154
+ if type(images) is list:
155
+ images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
156
+ concat_images = torch.cat([image for image in images], dim=0)
157
+ image_features = self.encode_images(concat_images)
158
+ split_sizes = [image.shape[0] for image in images]
159
+ image_features = torch.split(image_features, split_sizes, dim=0)
160
+ mm_patch_merge_type = getattr(self.config, 'mm_patch_merge_type', 'flat')
161
+ image_aspect_ratio = getattr(self.config, 'image_aspect_ratio', 'square')
162
+ if mm_patch_merge_type == 'flat':
163
+ image_features = [x.flatten(0, 1) for x in image_features]
164
+ elif mm_patch_merge_type.startswith('spatial'):
165
+ new_image_features = []
166
+ for image_idx, image_feature in enumerate(image_features):
167
+ if image_feature.shape[0] > 1:
168
+ base_image_feature = image_feature[0]
169
+ image_feature = image_feature[1:]
170
+ height = width = self.get_vision_tower().num_patches_per_side
171
+ assert height * width == base_image_feature.shape[0]
172
+ if image_aspect_ratio == 'anyres':
173
+ num_patch_width, num_patch_height = get_anyres_image_grid_shape(image_sizes[image_idx], self.config.image_grid_pinpoints, self.get_vision_tower().config.image_size)
174
+ image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
175
+ else:
176
+ raise NotImplementedError
177
+ if 'unpad' in mm_patch_merge_type:
178
+ image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
179
+ image_feature = image_feature.flatten(1, 2).flatten(2, 3)
180
+ image_feature = unpad_image(image_feature, image_sizes[image_idx])
181
+ image_feature = torch.cat((
182
+ image_feature,
183
+ self.model.image_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)
184
+ ), dim=-1)
185
+ image_feature = image_feature.flatten(1, 2).transpose(0, 1)
186
+ else:
187
+ image_feature = image_feature.permute(0, 2, 1, 3, 4).contiguous()
188
+ image_feature = image_feature.flatten(0, 3)
189
+ image_feature = torch.cat((base_image_feature, image_feature), dim=0)
190
+ else:
191
+ image_feature = image_feature[0]
192
+ if 'unpad' in mm_patch_merge_type:
193
+ image_feature = torch.cat((
194
+ image_feature,
195
+ self.model.image_newline[None].to(image_feature.device)
196
+ ), dim=0)
197
+ new_image_features.append(image_feature)
198
+ image_features = new_image_features
199
+ else:
200
+ raise ValueError(f"Unexpected mm_patch_merge_type: {self.config.mm_patch_merge_type}")
201
+ else:
202
+ image_features = self.encode_images(images)
203
+
204
+ # TODO: image start / end is not implemented here to support pretraining.
205
+ if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
206
+ raise NotImplementedError
207
+
208
+ # Let's just add dummy tensors if they do not exist,
209
+ # it is a headache to deal with None all the time.
210
+ # But it is not ideal, and if you have a better idea,
211
+ # please open an issue / submit a PR, thanks.
212
+ _labels = labels
213
+ _position_ids = position_ids
214
+ _attention_mask = attention_mask
215
+ if attention_mask is None:
216
+ attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
217
+ else:
218
+ attention_mask = attention_mask.bool()
219
+ if position_ids is None:
220
+ position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
221
+ if labels is None:
222
+ labels = torch.full_like(input_ids, IGNORE_INDEX)
223
+
224
+ # remove the padding using attention_mask -- FIXME
225
+ _input_ids = input_ids
226
+ input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)]
227
+ labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
228
+
229
+ new_input_embeds = []
230
+ new_labels = []
231
+ cur_image_idx = 0
232
+ for batch_idx, cur_input_ids in enumerate(input_ids):
233
+ num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
234
+ if num_images == 0:
235
+ cur_image_features = image_features[cur_image_idx]
236
+ cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids)
237
+ cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
238
+ new_input_embeds.append(cur_input_embeds)
239
+ new_labels.append(labels[batch_idx])
240
+ cur_image_idx += 1
241
+ continue
242
+
243
+ image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]]
244
+ cur_input_ids_noim = []
245
+ cur_labels = labels[batch_idx]
246
+ cur_labels_noim = []
247
+ for i in range(len(image_token_indices) - 1):
248
+ cur_input_ids_noim.append(cur_input_ids[image_token_indices[i]+1:image_token_indices[i+1]])
249
+ cur_labels_noim.append(cur_labels[image_token_indices[i]+1:image_token_indices[i+1]])
250
+ split_sizes = [x.shape[0] for x in cur_labels_noim]
251
+ cur_input_embeds = self.get_model().embed_tokens(torch.cat(cur_input_ids_noim))
252
+ cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
253
+ cur_new_input_embeds = []
254
+ cur_new_labels = []
255
+
256
+ for i in range(num_images + 1):
257
+ cur_new_input_embeds.append(cur_input_embeds_no_im[i])
258
+ cur_new_labels.append(cur_labels_noim[i])
259
+ if i < num_images:
260
+ cur_image_features = image_features[cur_image_idx]
261
+ cur_image_idx += 1
262
+ cur_new_input_embeds.append(cur_image_features)
263
+ cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))
264
+
265
+ cur_new_input_embeds = [x.to(self.device) for x in cur_new_input_embeds]
266
+
267
+ cur_new_input_embeds = torch.cat(cur_new_input_embeds)
268
+ cur_new_labels = torch.cat(cur_new_labels)
269
+
270
+ new_input_embeds.append(cur_new_input_embeds)
271
+ new_labels.append(cur_new_labels)
272
+
273
+ # Truncate sequences to max length as image embeddings can make the sequence longer
274
+ tokenizer_model_max_length = getattr(self.config, 'tokenizer_model_max_length', None)
275
+ if tokenizer_model_max_length is not None:
276
+ new_input_embeds = [x[:tokenizer_model_max_length] for x in new_input_embeds]
277
+ new_labels = [x[:tokenizer_model_max_length] for x in new_labels]
278
+
279
+ # Combine them
280
+ max_len = max(x.shape[0] for x in new_input_embeds)
281
+ batch_size = len(new_input_embeds)
282
+
283
+ new_input_embeds_padded = []
284
+ new_labels_padded = torch.full((batch_size, max_len), IGNORE_INDEX, dtype=new_labels[0].dtype, device=new_labels[0].device)
285
+ attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device)
286
+ position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)
287
+
288
+ for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
289
+ cur_len = cur_new_embed.shape[0]
290
+ if getattr(self.config, 'tokenizer_padding_side', 'right') == "left":
291
+ new_input_embeds_padded.append(torch.cat((
292
+ torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device),
293
+ cur_new_embed
294
+ ), dim=0))
295
+ if cur_len > 0:
296
+ new_labels_padded[i, -cur_len:] = cur_new_labels
297
+ attention_mask[i, -cur_len:] = True
298
+ position_ids[i, -cur_len:] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
299
+ else:
300
+ new_input_embeds_padded.append(torch.cat((
301
+ cur_new_embed,
302
+ torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)
303
+ ), dim=0))
304
+ if cur_len > 0:
305
+ new_labels_padded[i, :cur_len] = cur_new_labels
306
+ attention_mask[i, :cur_len] = True
307
+ position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
308
+
309
+ new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)
310
+
311
+ if _labels is None:
312
+ new_labels = None
313
+ else:
314
+ new_labels = new_labels_padded
315
+
316
+ if _attention_mask is None:
317
+ attention_mask = None
318
+ else:
319
+ attention_mask = attention_mask.to(dtype=_attention_mask.dtype)
320
+
321
+ if _position_ids is None:
322
+ position_ids = None
323
+
324
+ return None, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels
325
+
326
+ def initialize_vision_tokenizer(self, model_args, tokenizer):
327
+ if model_args.mm_use_im_patch_token:
328
+ tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
329
+ self.resize_token_embeddings(len(tokenizer))
330
+
331
+ if model_args.mm_use_im_start_end:
332
+ num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
333
+ self.resize_token_embeddings(len(tokenizer))
334
+
335
+ if num_new_tokens > 0:
336
+ input_embeddings = self.get_input_embeddings().weight.data
337
+ output_embeddings = self.get_output_embeddings().weight.data
338
+
339
+ input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
340
+ dim=0, keepdim=True)
341
+ output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
342
+ dim=0, keepdim=True)
343
+
344
+ input_embeddings[-num_new_tokens:] = input_embeddings_avg
345
+ output_embeddings[-num_new_tokens:] = output_embeddings_avg
346
+
347
+ if model_args.tune_mm_mlp_adapter:
348
+ for p in self.get_input_embeddings().parameters():
349
+ p.requires_grad = True
350
+ for p in self.get_output_embeddings().parameters():
351
+ p.requires_grad = False
352
+
353
+ if model_args.pretrain_mm_mlp_adapter:
354
+ mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location='cpu')
355
+ embed_tokens_weight = mm_projector_weights['model.embed_tokens.weight']
356
+ assert num_new_tokens == 2
357
+ if input_embeddings.shape == embed_tokens_weight.shape:
358
+ input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
359
+ elif embed_tokens_weight.shape[0] == num_new_tokens:
360
+ input_embeddings[-num_new_tokens:] = embed_tokens_weight
361
+ else:
362
+ raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
363
+ elif model_args.mm_use_im_patch_token:
364
+ if model_args.tune_mm_mlp_adapter:
365
+ for p in self.get_input_embeddings().parameters():
366
+ p.requires_grad = False
367
+ for p in self.get_output_embeddings().parameters():
368
+ p.requires_grad = False
LLaVA/llava/model/make_delta.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Usage:
3
+ python3 -m llava.model.make_delta --base ~/model_weights/llama-7b --target ~/model_weights/llava-7b --delta ~/model_weights/llava-7b-delta --hub-repo-id liuhaotian/llava-7b-delta
4
+ """
5
+ import argparse
6
+
7
+ import torch
8
+ from tqdm import tqdm
9
+ from transformers import AutoTokenizer, AutoModelForCausalLM
10
+ from llava.model.utils import auto_upgrade
11
+
12
+
13
+ def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id):
14
+ print("Loading base model")
15
+ base = AutoModelForCausalLM.from_pretrained(
16
+ base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
17
+
18
+ print("Loading target model")
19
+ auto_upgrade(target_model_path)
20
+ target = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
21
+
22
+ print("Calculating delta")
23
+ for name, param in tqdm(target.state_dict().items(), desc="Calculating delta"):
24
+ if name not in base.state_dict():
25
+ assert name in ['model.mm_projector.weight', 'model.mm_projector.bias'], f'{name} not in base model'
26
+ continue
27
+ if param.data.shape == base.state_dict()[name].shape:
28
+ param.data -= base.state_dict()[name]
29
+ else:
30
+ assert name in ['model.embed_tokens.weight', 'lm_head.weight'], f'{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}'
31
+ bparam = base.state_dict()[name]
32
+ param.data[:bparam.shape[0], :bparam.shape[1]] -= bparam
33
+
34
+ print("Saving delta")
35
+ if hub_repo_id:
36
+ kwargs = {"push_to_hub": True, "repo_id": hub_repo_id}
37
+ else:
38
+ kwargs = {}
39
+ target.save_pretrained(delta_path, **kwargs)
40
+ target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
41
+ target_tokenizer.save_pretrained(delta_path, **kwargs)
42
+
43
+
44
+ if __name__ == "__main__":
45
+ parser = argparse.ArgumentParser()
46
+ parser.add_argument("--base-model-path", type=str, required=True)
47
+ parser.add_argument("--target-model-path", type=str, required=True)
48
+ parser.add_argument("--delta-path", type=str, required=True)
49
+ parser.add_argument("--hub-repo-id", type=str, default=None)
50
+ args = parser.parse_args()
51
+
52
+ make_delta(args.base_model_path, args.target_model_path, args.delta_path, args.hub_repo_id)
LLaVA/llava/model/utils.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoConfig
2
+
3
+
4
+ def auto_upgrade(config):
5
+ cfg = AutoConfig.from_pretrained(config)
6
+ if 'llava' in config and 'llava' not in cfg.model_type:
7
+ assert cfg.model_type == 'llama'
8
+ print("You are using newer LLaVA code base, while the checkpoint of v0 is from older code base.")
9
+ print("You must upgrade the checkpoint to the new code base (this can be done automatically).")
10
+ confirm = input("Please confirm that you want to upgrade the checkpoint. [Y/N]")
11
+ if confirm.lower() in ["y", "yes"]:
12
+ print("Upgrading checkpoint...")
13
+ assert len(cfg.architectures) == 1
14
+ setattr(cfg.__class__, "model_type", "llava")
15
+ cfg.architectures[0] = 'LlavaLlamaForCausalLM'
16
+ cfg.save_pretrained(config)
17
+ print("Checkpoint upgraded.")
18
+ else:
19
+ print("Checkpoint upgrade aborted.")
20
+ exit(1)
LLaVA/llava/serve/__init__.py ADDED
File without changes
LLaVA/llava/serve/cli.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import torch
3
+
4
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
5
+ from llava.conversation import conv_templates, SeparatorStyle
6
+ from llava.model.builder import load_pretrained_model
7
+ from llava.utils import disable_torch_init
8
+ from llava.mm_utils import process_images, tokenizer_image_token, get_model_name_from_path
9
+
10
+ from PIL import Image
11
+
12
+ import requests
13
+ from PIL import Image
14
+ from io import BytesIO
15
+ from transformers import TextStreamer
16
+
17
+
18
+ def load_image(image_file):
19
+ if image_file.startswith('http://') or image_file.startswith('https://'):
20
+ response = requests.get(image_file)
21
+ image = Image.open(BytesIO(response.content)).convert('RGB')
22
+ else:
23
+ image = Image.open(image_file).convert('RGB')
24
+ return image
25
+
26
+
27
+ def main(args):
28
+ # Model
29
+ disable_torch_init()
30
+
31
+ model_name = get_model_name_from_path(args.model_path)
32
+ tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device)
33
+
34
+ if "llama-2" in model_name.lower():
35
+ conv_mode = "llava_llama_2"
36
+ elif "mistral" in model_name.lower():
37
+ conv_mode = "mistral_instruct"
38
+ elif "v1.6-34b" in model_name.lower():
39
+ conv_mode = "chatml_direct"
40
+ elif "v1" in model_name.lower():
41
+ conv_mode = "llava_v1"
42
+ elif "mpt" in model_name.lower():
43
+ conv_mode = "mpt"
44
+ else:
45
+ conv_mode = "llava_v0"
46
+
47
+ if args.conv_mode is not None and conv_mode != args.conv_mode:
48
+ print('[WARNING] the auto inferred conversation mode is {}, while `--conv-mode` is {}, using {}'.format(conv_mode, args.conv_mode, args.conv_mode))
49
+ else:
50
+ args.conv_mode = conv_mode
51
+
52
+ conv = conv_templates[args.conv_mode].copy()
53
+ if "mpt" in model_name.lower():
54
+ roles = ('user', 'assistant')
55
+ else:
56
+ roles = conv.roles
57
+
58
+ image = load_image(args.image_file)
59
+ image_size = image.size
60
+ # Similar operation in model_worker.py
61
+ image_tensor = process_images([image], image_processor, model.config)
62
+ if type(image_tensor) is list:
63
+ image_tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
64
+ else:
65
+ image_tensor = image_tensor.to(model.device, dtype=torch.float16)
66
+
67
+ while True:
68
+ try:
69
+ inp = input(f"{roles[0]}: ")
70
+ except EOFError:
71
+ inp = ""
72
+ if not inp:
73
+ print("exit...")
74
+ break
75
+
76
+ print(f"{roles[1]}: ", end="")
77
+
78
+ if image is not None:
79
+ # first message
80
+ if model.config.mm_use_im_start_end:
81
+ inp = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + inp
82
+ else:
83
+ inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
84
+ image = None
85
+
86
+ conv.append_message(conv.roles[0], inp)
87
+ conv.append_message(conv.roles[1], None)
88
+ prompt = conv.get_prompt()
89
+
90
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)
91
+ stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
92
+ keywords = [stop_str]
93
+ streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
94
+
95
+ with torch.inference_mode():
96
+ output_ids = model.generate(
97
+ input_ids,
98
+ images=image_tensor,
99
+ image_sizes=[image_size],
100
+ do_sample=True if args.temperature > 0 else False,
101
+ temperature=args.temperature,
102
+ max_new_tokens=args.max_new_tokens,
103
+ streamer=streamer,
104
+ use_cache=True)
105
+
106
+ outputs = tokenizer.decode(output_ids[0]).strip()
107
+ conv.messages[-1][-1] = outputs
108
+
109
+ if args.debug:
110
+ print("\n", {"prompt": prompt, "outputs": outputs}, "\n")
111
+
112
+
113
+ if __name__ == "__main__":
114
+ parser = argparse.ArgumentParser()
115
+ parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
116
+ parser.add_argument("--model-base", type=str, default=None)
117
+ parser.add_argument("--image-file", type=str, required=True)
118
+ parser.add_argument("--device", type=str, default="cuda")
119
+ parser.add_argument("--conv-mode", type=str, default=None)
120
+ parser.add_argument("--temperature", type=float, default=0.2)
121
+ parser.add_argument("--max-new-tokens", type=int, default=512)
122
+ parser.add_argument("--load-8bit", action="store_true")
123
+ parser.add_argument("--load-4bit", action="store_true")
124
+ parser.add_argument("--debug", action="store_true")
125
+ args = parser.parse_args()
126
+ main(args)
LLaVA/llava/serve/controller.py ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ A controller manages distributed workers.
3
+ It sends worker addresses to clients.
4
+ """
5
+ import argparse
6
+ import asyncio
7
+ import dataclasses
8
+ from enum import Enum, auto
9
+ import json
10
+ import logging
11
+ import time
12
+ from typing import List, Union
13
+ import threading
14
+
15
+ from fastapi import FastAPI, Request
16
+ from fastapi.responses import StreamingResponse
17
+ import numpy as np
18
+ import requests
19
+ import uvicorn
20
+
21
+ from llava.constants import CONTROLLER_HEART_BEAT_EXPIRATION
22
+ from llava.utils import build_logger, server_error_msg
23
+
24
+
25
+ logger = build_logger("controller", "controller.log")
26
+
27
+
28
+ class DispatchMethod(Enum):
29
+ LOTTERY = auto()
30
+ SHORTEST_QUEUE = auto()
31
+
32
+ @classmethod
33
+ def from_str(cls, name):
34
+ if name == "lottery":
35
+ return cls.LOTTERY
36
+ elif name == "shortest_queue":
37
+ return cls.SHORTEST_QUEUE
38
+ else:
39
+ raise ValueError(f"Invalid dispatch method")
40
+
41
+
42
+ @dataclasses.dataclass
43
+ class WorkerInfo:
44
+ model_names: List[str]
45
+ speed: int
46
+ queue_length: int
47
+ check_heart_beat: bool
48
+ last_heart_beat: str
49
+
50
+
51
+ def heart_beat_controller(controller):
52
+ while True:
53
+ time.sleep(CONTROLLER_HEART_BEAT_EXPIRATION)
54
+ controller.remove_stable_workers_by_expiration()
55
+
56
+
57
+ class Controller:
58
+ def __init__(self, dispatch_method: str):
59
+ # Dict[str -> WorkerInfo]
60
+ self.worker_info = {}
61
+ self.dispatch_method = DispatchMethod.from_str(dispatch_method)
62
+
63
+ self.heart_beat_thread = threading.Thread(
64
+ target=heart_beat_controller, args=(self,), daemon=True)
65
+ self.heart_beat_thread.start()
66
+
67
+ logger.info("Init controller")
68
+
69
+ def register_worker(self, worker_name: str, check_heart_beat: bool,
70
+ worker_status: dict):
71
+ if worker_name not in self.worker_info:
72
+ logger.info(f"Register a new worker: {worker_name}")
73
+ else:
74
+ logger.info(f"Register an existing worker: {worker_name}")
75
+
76
+ if not worker_status:
77
+ worker_status = self.get_worker_status(worker_name)
78
+ if not worker_status:
79
+ return False
80
+
81
+ self.worker_info[worker_name] = WorkerInfo(
82
+ worker_status["model_names"], worker_status["speed"], worker_status["queue_length"],
83
+ check_heart_beat, time.time())
84
+
85
+ logger.info(f"Register done: {worker_name}, {worker_status}")
86
+ return True
87
+
88
+ def get_worker_status(self, worker_name: str):
89
+ try:
90
+ r = requests.post(worker_name + "/worker_get_status", timeout=5)
91
+ except requests.exceptions.RequestException as e:
92
+ logger.error(f"Get status fails: {worker_name}, {e}")
93
+ return None
94
+
95
+ if r.status_code != 200:
96
+ logger.error(f"Get status fails: {worker_name}, {r}")
97
+ return None
98
+
99
+ return r.json()
100
+
101
+ def remove_worker(self, worker_name: str):
102
+ del self.worker_info[worker_name]
103
+
104
+ def refresh_all_workers(self):
105
+ old_info = dict(self.worker_info)
106
+ self.worker_info = {}
107
+
108
+ for w_name, w_info in old_info.items():
109
+ if not self.register_worker(w_name, w_info.check_heart_beat, None):
110
+ logger.info(f"Remove stale worker: {w_name}")
111
+
112
+ def list_models(self):
113
+ model_names = set()
114
+
115
+ for w_name, w_info in self.worker_info.items():
116
+ model_names.update(w_info.model_names)
117
+
118
+ return list(model_names)
119
+
120
+ def get_worker_address(self, model_name: str):
121
+ if self.dispatch_method == DispatchMethod.LOTTERY:
122
+ worker_names = []
123
+ worker_speeds = []
124
+ for w_name, w_info in self.worker_info.items():
125
+ if model_name in w_info.model_names:
126
+ worker_names.append(w_name)
127
+ worker_speeds.append(w_info.speed)
128
+ worker_speeds = np.array(worker_speeds, dtype=np.float32)
129
+ norm = np.sum(worker_speeds)
130
+ if norm < 1e-4:
131
+ return ""
132
+ worker_speeds = worker_speeds / norm
133
+ if True: # Directly return address
134
+ pt = np.random.choice(np.arange(len(worker_names)),
135
+ p=worker_speeds)
136
+ worker_name = worker_names[pt]
137
+ return worker_name
138
+
139
+ # Check status before returning
140
+ while True:
141
+ pt = np.random.choice(np.arange(len(worker_names)),
142
+ p=worker_speeds)
143
+ worker_name = worker_names[pt]
144
+
145
+ if self.get_worker_status(worker_name):
146
+ break
147
+ else:
148
+ self.remove_worker(worker_name)
149
+ worker_speeds[pt] = 0
150
+ norm = np.sum(worker_speeds)
151
+ if norm < 1e-4:
152
+ return ""
153
+ worker_speeds = worker_speeds / norm
154
+ continue
155
+ return worker_name
156
+ elif self.dispatch_method == DispatchMethod.SHORTEST_QUEUE:
157
+ worker_names = []
158
+ worker_qlen = []
159
+ for w_name, w_info in self.worker_info.items():
160
+ if model_name in w_info.model_names:
161
+ worker_names.append(w_name)
162
+ worker_qlen.append(w_info.queue_length / w_info.speed)
163
+ if len(worker_names) == 0:
164
+ return ""
165
+ min_index = np.argmin(worker_qlen)
166
+ w_name = worker_names[min_index]
167
+ self.worker_info[w_name].queue_length += 1
168
+ logger.info(f"names: {worker_names}, queue_lens: {worker_qlen}, ret: {w_name}")
169
+ return w_name
170
+ else:
171
+ raise ValueError(f"Invalid dispatch method: {self.dispatch_method}")
172
+
173
+ def receive_heart_beat(self, worker_name: str, queue_length: int):
174
+ if worker_name not in self.worker_info:
175
+ logger.info(f"Receive unknown heart beat. {worker_name}")
176
+ return False
177
+
178
+ self.worker_info[worker_name].queue_length = queue_length
179
+ self.worker_info[worker_name].last_heart_beat = time.time()
180
+ logger.info(f"Receive heart beat. {worker_name}")
181
+ return True
182
+
183
+ def remove_stable_workers_by_expiration(self):
184
+ expire = time.time() - CONTROLLER_HEART_BEAT_EXPIRATION
185
+ to_delete = []
186
+ for worker_name, w_info in self.worker_info.items():
187
+ if w_info.check_heart_beat and w_info.last_heart_beat < expire:
188
+ to_delete.append(worker_name)
189
+
190
+ for worker_name in to_delete:
191
+ self.remove_worker(worker_name)
192
+
193
+ def worker_api_generate_stream(self, params):
194
+ worker_addr = self.get_worker_address(params["model"])
195
+ if not worker_addr:
196
+ logger.info(f"no worker: {params['model']}")
197
+ ret = {
198
+ "text": server_error_msg,
199
+ "error_code": 2,
200
+ }
201
+ yield json.dumps(ret).encode() + b"\0"
202
+
203
+ try:
204
+ response = requests.post(worker_addr + "/worker_generate_stream",
205
+ json=params, stream=True, timeout=5)
206
+ for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
207
+ if chunk:
208
+ yield chunk + b"\0"
209
+ except requests.exceptions.RequestException as e:
210
+ logger.info(f"worker timeout: {worker_addr}")
211
+ ret = {
212
+ "text": server_error_msg,
213
+ "error_code": 3,
214
+ }
215
+ yield json.dumps(ret).encode() + b"\0"
216
+
217
+
218
+ # Let the controller act as a worker to achieve hierarchical
219
+ # management. This can be used to connect isolated sub networks.
220
+ def worker_api_get_status(self):
221
+ model_names = set()
222
+ speed = 0
223
+ queue_length = 0
224
+
225
+ for w_name in self.worker_info:
226
+ worker_status = self.get_worker_status(w_name)
227
+ if worker_status is not None:
228
+ model_names.update(worker_status["model_names"])
229
+ speed += worker_status["speed"]
230
+ queue_length += worker_status["queue_length"]
231
+
232
+ return {
233
+ "model_names": list(model_names),
234
+ "speed": speed,
235
+ "queue_length": queue_length,
236
+ }
237
+
238
+
239
+ app = FastAPI()
240
+
241
+
242
+ @app.post("/register_worker")
243
+ async def register_worker(request: Request):
244
+ data = await request.json()
245
+ controller.register_worker(
246
+ data["worker_name"], data["check_heart_beat"],
247
+ data.get("worker_status", None))
248
+
249
+
250
+ @app.post("/refresh_all_workers")
251
+ async def refresh_all_workers():
252
+ models = controller.refresh_all_workers()
253
+
254
+
255
+ @app.post("/list_models")
256
+ async def list_models():
257
+ models = controller.list_models()
258
+ return {"models": models}
259
+
260
+
261
+ @app.post("/get_worker_address")
262
+ async def get_worker_address(request: Request):
263
+ data = await request.json()
264
+ addr = controller.get_worker_address(data["model"])
265
+ return {"address": addr}
266
+
267
+
268
+ @app.post("/receive_heart_beat")
269
+ async def receive_heart_beat(request: Request):
270
+ data = await request.json()
271
+ exist = controller.receive_heart_beat(
272
+ data["worker_name"], data["queue_length"])
273
+ return {"exist": exist}
274
+
275
+
276
+ @app.post("/worker_generate_stream")
277
+ async def worker_api_generate_stream(request: Request):
278
+ params = await request.json()
279
+ generator = controller.worker_api_generate_stream(params)
280
+ return StreamingResponse(generator)
281
+
282
+
283
+ @app.post("/worker_get_status")
284
+ async def worker_api_get_status(request: Request):
285
+ return controller.worker_api_get_status()
286
+
287
+
288
+ if __name__ == "__main__":
289
+ parser = argparse.ArgumentParser()
290
+ parser.add_argument("--host", type=str, default="localhost")
291
+ parser.add_argument("--port", type=int, default=21001)
292
+ parser.add_argument("--dispatch-method", type=str, choices=[
293
+ "lottery", "shortest_queue"], default="shortest_queue")
294
+ args = parser.parse_args()
295
+ logger.info(f"args: {args}")
296
+
297
+ controller = Controller(args.dispatch_method)
298
+ uvicorn.run(app, host=args.host, port=args.port, log_level="info")
LLaVA/llava/serve/gradio_web_server.py ADDED
@@ -0,0 +1,479 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import datetime
3
+ import json
4
+ import os
5
+ import time
6
+
7
+ import gradio as gr
8
+ import requests
9
+
10
+ from llava.conversation import (default_conversation, conv_templates,
11
+ SeparatorStyle)
12
+ from llava.constants import LOGDIR
13
+ from llava.utils import (build_logger, server_error_msg,
14
+ violates_moderation, moderation_msg)
15
+ import hashlib
16
+
17
+
18
+ logger = build_logger("gradio_web_server", "gradio_web_server.log")
19
+
20
+ headers = {"User-Agent": "LLaVA Client"}
21
+
22
+ no_change_btn = gr.Button()
23
+ enable_btn = gr.Button(interactive=True)
24
+ disable_btn = gr.Button(interactive=False)
25
+
26
+ priority = {
27
+ "vicuna-13b": "aaaaaaa",
28
+ "koala-13b": "aaaaaab",
29
+ }
30
+
31
+
32
+ def get_conv_log_filename():
33
+ t = datetime.datetime.now()
34
+ name = os.path.join(LOGDIR, f"{t.year}-{t.month:02d}-{t.day:02d}-conv.json")
35
+ return name
36
+
37
+
38
+ def get_model_list():
39
+ ret = requests.post(args.controller_url + "/refresh_all_workers")
40
+ assert ret.status_code == 200
41
+ ret = requests.post(args.controller_url + "/list_models")
42
+ models = ret.json()["models"]
43
+ models.sort(key=lambda x: priority.get(x, x))
44
+ logger.info(f"Models: {models}")
45
+ return models
46
+
47
+
48
+ get_window_url_params = """
49
+ function() {
50
+ const params = new URLSearchParams(window.location.search);
51
+ url_params = Object.fromEntries(params);
52
+ console.log(url_params);
53
+ return url_params;
54
+ }
55
+ """
56
+
57
+
58
+ def load_demo(url_params, request: gr.Request):
59
+ logger.info(f"load_demo. ip: {request.client.host}. params: {url_params}")
60
+
61
+ dropdown_update = gr.Dropdown(visible=True)
62
+ if "model" in url_params:
63
+ model = url_params["model"]
64
+ if model in models:
65
+ dropdown_update = gr.Dropdown(value=model, visible=True)
66
+
67
+ state = default_conversation.copy()
68
+ return state, dropdown_update
69
+
70
+
71
+ def load_demo_refresh_model_list(request: gr.Request):
72
+ logger.info(f"load_demo. ip: {request.client.host}")
73
+ models = get_model_list()
74
+ state = default_conversation.copy()
75
+ dropdown_update = gr.Dropdown(
76
+ choices=models,
77
+ value=models[0] if len(models) > 0 else ""
78
+ )
79
+ return state, dropdown_update
80
+
81
+
82
+ def vote_last_response(state, vote_type, model_selector, request: gr.Request):
83
+ with open(get_conv_log_filename(), "a") as fout:
84
+ data = {
85
+ "tstamp": round(time.time(), 4),
86
+ "type": vote_type,
87
+ "model": model_selector,
88
+ "state": state.dict(),
89
+ "ip": request.client.host,
90
+ }
91
+ fout.write(json.dumps(data) + "\n")
92
+
93
+
94
+ def upvote_last_response(state, model_selector, request: gr.Request):
95
+ logger.info(f"upvote. ip: {request.client.host}")
96
+ vote_last_response(state, "upvote", model_selector, request)
97
+ return ("",) + (disable_btn,) * 3
98
+
99
+
100
+ def downvote_last_response(state, model_selector, request: gr.Request):
101
+ logger.info(f"downvote. ip: {request.client.host}")
102
+ vote_last_response(state, "downvote", model_selector, request)
103
+ return ("",) + (disable_btn,) * 3
104
+
105
+
106
+ def flag_last_response(state, model_selector, request: gr.Request):
107
+ logger.info(f"flag. ip: {request.client.host}")
108
+ vote_last_response(state, "flag", model_selector, request)
109
+ return ("",) + (disable_btn,) * 3
110
+
111
+
112
+ def regenerate(state, image_process_mode, request: gr.Request):
113
+ logger.info(f"regenerate. ip: {request.client.host}")
114
+ state.messages[-1][-1] = None
115
+ prev_human_msg = state.messages[-2]
116
+ if type(prev_human_msg[1]) in (tuple, list):
117
+ prev_human_msg[1] = (*prev_human_msg[1][:2], image_process_mode)
118
+ state.skip_next = False
119
+ return (state, state.to_gradio_chatbot(), "", None) + (disable_btn,) * 5
120
+
121
+
122
+ def clear_history(request: gr.Request):
123
+ logger.info(f"clear_history. ip: {request.client.host}")
124
+ state = default_conversation.copy()
125
+ return (state, state.to_gradio_chatbot(), "", None) + (disable_btn,) * 5
126
+
127
+
128
+ def add_text(state, text, image, image_process_mode, request: gr.Request):
129
+ logger.info(f"add_text. ip: {request.client.host}. len: {len(text)}")
130
+ if len(text) <= 0 and image is None:
131
+ state.skip_next = True
132
+ return (state, state.to_gradio_chatbot(), "", None) + (no_change_btn,) * 5
133
+ if args.moderate:
134
+ flagged = violates_moderation(text)
135
+ if flagged:
136
+ state.skip_next = True
137
+ return (state, state.to_gradio_chatbot(), moderation_msg, None) + (
138
+ no_change_btn,) * 5
139
+
140
+ text = text[:1536] # Hard cut-off
141
+ if image is not None:
142
+ text = text[:1200] # Hard cut-off for images
143
+ if '<image>' not in text:
144
+ # text = '<Image><image></Image>' + text
145
+ text = text + '\n<image>'
146
+ text = (text, image, image_process_mode)
147
+ state = default_conversation.copy()
148
+ state.append_message(state.roles[0], text)
149
+ state.append_message(state.roles[1], None)
150
+ state.skip_next = False
151
+ return (state, state.to_gradio_chatbot(), "", None) + (disable_btn,) * 5
152
+
153
+
154
+ def http_bot(state, model_selector, temperature, top_p, max_new_tokens, request: gr.Request):
155
+ logger.info(f"http_bot. ip: {request.client.host}")
156
+ start_tstamp = time.time()
157
+ model_name = model_selector
158
+
159
+ if state.skip_next:
160
+ # This generate call is skipped due to invalid inputs
161
+ yield (state, state.to_gradio_chatbot()) + (no_change_btn,) * 5
162
+ return
163
+
164
+ if len(state.messages) == state.offset + 2:
165
+ # First round of conversation
166
+ if "llava" in model_name.lower():
167
+ if 'llama-2' in model_name.lower():
168
+ template_name = "llava_llama_2"
169
+ elif "mistral" in model_name.lower() or "mixtral" in model_name.lower():
170
+ if 'orca' in model_name.lower():
171
+ template_name = "mistral_orca"
172
+ elif 'hermes' in model_name.lower():
173
+ template_name = "chatml_direct"
174
+ else:
175
+ template_name = "mistral_instruct"
176
+ elif 'llava-v1.6-34b' in model_name.lower():
177
+ template_name = "chatml_direct"
178
+ elif "v1" in model_name.lower():
179
+ if 'mmtag' in model_name.lower():
180
+ template_name = "v1_mmtag"
181
+ elif 'plain' in model_name.lower() and 'finetune' not in model_name.lower():
182
+ template_name = "v1_mmtag"
183
+ else:
184
+ template_name = "llava_v1"
185
+ elif "mpt" in model_name.lower():
186
+ template_name = "mpt"
187
+ else:
188
+ if 'mmtag' in model_name.lower():
189
+ template_name = "v0_mmtag"
190
+ elif 'plain' in model_name.lower() and 'finetune' not in model_name.lower():
191
+ template_name = "v0_mmtag"
192
+ else:
193
+ template_name = "llava_v0"
194
+ elif "mpt" in model_name:
195
+ template_name = "mpt_text"
196
+ elif "llama-2" in model_name:
197
+ template_name = "llama_2"
198
+ else:
199
+ template_name = "vicuna_v1"
200
+ new_state = conv_templates[template_name].copy()
201
+ new_state.append_message(new_state.roles[0], state.messages[-2][1])
202
+ new_state.append_message(new_state.roles[1], None)
203
+ state = new_state
204
+
205
+ # Query worker address
206
+ controller_url = args.controller_url
207
+ ret = requests.post(controller_url + "/get_worker_address",
208
+ json={"model": model_name})
209
+ worker_addr = ret.json()["address"]
210
+ logger.info(f"model_name: {model_name}, worker_addr: {worker_addr}")
211
+
212
+ # No available worker
213
+ if worker_addr == "":
214
+ state.messages[-1][-1] = server_error_msg
215
+ yield (state, state.to_gradio_chatbot(), disable_btn, disable_btn, disable_btn, enable_btn, enable_btn)
216
+ return
217
+
218
+ # Construct prompt
219
+ prompt = state.get_prompt()
220
+
221
+ all_images = state.get_images(return_pil=True)
222
+ all_image_hash = [hashlib.md5(image.tobytes()).hexdigest() for image in all_images]
223
+ for image, hash in zip(all_images, all_image_hash):
224
+ t = datetime.datetime.now()
225
+ filename = os.path.join(LOGDIR, "serve_images", f"{t.year}-{t.month:02d}-{t.day:02d}", f"{hash}.jpg")
226
+ if not os.path.isfile(filename):
227
+ os.makedirs(os.path.dirname(filename), exist_ok=True)
228
+ image.save(filename)
229
+
230
+ # Make requests
231
+ pload = {
232
+ "model": model_name,
233
+ "prompt": prompt,
234
+ "temperature": float(temperature),
235
+ "top_p": float(top_p),
236
+ "max_new_tokens": min(int(max_new_tokens), 1536),
237
+ "stop": state.sep if state.sep_style in [SeparatorStyle.SINGLE, SeparatorStyle.MPT] else state.sep2,
238
+ "images": f'List of {len(state.get_images())} images: {all_image_hash}',
239
+ }
240
+ logger.info(f"==== request ====\n{pload}")
241
+
242
+ pload['images'] = state.get_images()
243
+
244
+ state.messages[-1][-1] = "▌"
245
+ yield (state, state.to_gradio_chatbot()) + (disable_btn,) * 5
246
+
247
+ try:
248
+ # Stream output
249
+ response = requests.post(worker_addr + "/worker_generate_stream",
250
+ headers=headers, json=pload, stream=True, timeout=10)
251
+ for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
252
+ if chunk:
253
+ data = json.loads(chunk.decode())
254
+ if data["error_code"] == 0:
255
+ output = data["text"][len(prompt):].strip()
256
+ state.messages[-1][-1] = output + "▌"
257
+ yield (state, state.to_gradio_chatbot()) + (disable_btn,) * 5
258
+ else:
259
+ output = data["text"] + f" (error_code: {data['error_code']})"
260
+ state.messages[-1][-1] = output
261
+ yield (state, state.to_gradio_chatbot()) + (disable_btn, disable_btn, disable_btn, enable_btn, enable_btn)
262
+ return
263
+ time.sleep(0.03)
264
+ except requests.exceptions.RequestException as e:
265
+ state.messages[-1][-1] = server_error_msg
266
+ yield (state, state.to_gradio_chatbot()) + (disable_btn, disable_btn, disable_btn, enable_btn, enable_btn)
267
+ return
268
+
269
+ state.messages[-1][-1] = state.messages[-1][-1][:-1]
270
+ yield (state, state.to_gradio_chatbot()) + (enable_btn,) * 5
271
+
272
+ finish_tstamp = time.time()
273
+ logger.info(f"{output}")
274
+
275
+ with open(get_conv_log_filename(), "a") as fout:
276
+ data = {
277
+ "tstamp": round(finish_tstamp, 4),
278
+ "type": "chat",
279
+ "model": model_name,
280
+ "start": round(start_tstamp, 4),
281
+ "finish": round(finish_tstamp, 4),
282
+ "state": state.dict(),
283
+ "images": all_image_hash,
284
+ "ip": request.client.host,
285
+ }
286
+ fout.write(json.dumps(data) + "\n")
287
+
288
+ title_markdown = ("""
289
+ # 🌋 LLaVA: Large Language and Vision Assistant
290
+ [[Project Page](https://llava-vl.github.io)] [[Code](https://github.com/haotian-liu/LLaVA)] [[Model](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)] | 📚 [[LLaVA](https://arxiv.org/abs/2304.08485)] [[LLaVA-v1.5](https://arxiv.org/abs/2310.03744)] [[LLaVA-v1.6](https://llava-vl.github.io/blog/2024-01-30-llava-1-6/)]
291
+ """)
292
+
293
+ tos_markdown = ("""
294
+ ### Terms of use
295
+ By using this service, users are required to agree to the following terms:
296
+ The service is a research preview intended for non-commercial use only. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. The service may collect user dialogue data for future research.
297
+ Please click the "Flag" button if you get any inappropriate answer! We will collect those to keep improving our moderator.
298
+ For an optimal experience, please use desktop computers for this demo, as mobile devices may compromise its quality.
299
+ """)
300
+
301
+
302
+ learn_more_markdown = ("""
303
+ ### License
304
+ The service is a research preview intended for non-commercial use only, subject to the model [License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA, [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and [Privacy Practices](https://chrome.google.com/webstore/detail/sharegpt-share-your-chatg/daiacboceoaocpibfodeljbdfacokfjb) of ShareGPT. Please contact us if you find any potential violation.
305
+ """)
306
+
307
+ block_css = """
308
+
309
+ #buttons button {
310
+ min-width: min(120px,100%);
311
+ }
312
+
313
+ """
314
+
315
+ def build_demo(embed_mode, cur_dir=None, concurrency_count=10):
316
+ textbox = gr.Textbox(show_label=False, placeholder="Enter text and press ENTER", container=False)
317
+ with gr.Blocks(title="LLaVA", theme=gr.themes.Default(), css=block_css) as demo:
318
+ state = gr.State()
319
+
320
+ if not embed_mode:
321
+ gr.Markdown(title_markdown)
322
+
323
+ with gr.Row():
324
+ with gr.Column(scale=3):
325
+ with gr.Row(elem_id="model_selector_row"):
326
+ model_selector = gr.Dropdown(
327
+ choices=models,
328
+ value=models[0] if len(models) > 0 else "",
329
+ interactive=True,
330
+ show_label=False,
331
+ container=False)
332
+
333
+ imagebox = gr.Image(type="pil")
334
+ image_process_mode = gr.Radio(
335
+ ["Crop", "Resize", "Pad", "Default"],
336
+ value="Default",
337
+ label="Preprocess for non-square image", visible=False)
338
+
339
+ if cur_dir is None:
340
+ cur_dir = os.path.dirname(os.path.abspath(__file__))
341
+ gr.Examples(examples=[
342
+ [f"{cur_dir}/examples/extreme_ironing.jpg", "What is unusual about this image?"],
343
+ [f"{cur_dir}/examples/waterview.jpg", "What are the things I should be cautious about when I visit here?"],
344
+ ], inputs=[imagebox, textbox])
345
+
346
+ with gr.Accordion("Parameters", open=False) as parameter_row:
347
+ temperature = gr.Slider(minimum=0.0, maximum=1.0, value=0.2, step=0.1, interactive=True, label="Temperature",)
348
+ top_p = gr.Slider(minimum=0.0, maximum=1.0, value=0.7, step=0.1, interactive=True, label="Top P",)
349
+ max_output_tokens = gr.Slider(minimum=0, maximum=1024, value=512, step=64, interactive=True, label="Max output tokens",)
350
+
351
+ with gr.Column(scale=8):
352
+ chatbot = gr.Chatbot(
353
+ elem_id="chatbot",
354
+ label="LLaVA Chatbot",
355
+ height=650,
356
+ layout="panel",
357
+ )
358
+ with gr.Row():
359
+ with gr.Column(scale=8):
360
+ textbox.render()
361
+ with gr.Column(scale=1, min_width=50):
362
+ submit_btn = gr.Button(value="Send", variant="primary")
363
+ with gr.Row(elem_id="buttons") as button_row:
364
+ upvote_btn = gr.Button(value="👍 Upvote", interactive=False)
365
+ downvote_btn = gr.Button(value="👎 Downvote", interactive=False)
366
+ flag_btn = gr.Button(value="⚠️ Flag", interactive=False)
367
+ #stop_btn = gr.Button(value="⏹️ Stop Generation", interactive=False)
368
+ regenerate_btn = gr.Button(value="🔄 Regenerate", interactive=False)
369
+ clear_btn = gr.Button(value="🗑️ Clear", interactive=False)
370
+
371
+ if not embed_mode:
372
+ gr.Markdown(tos_markdown)
373
+ gr.Markdown(learn_more_markdown)
374
+ url_params = gr.JSON(visible=False)
375
+
376
+ # Register listeners
377
+ btn_list = [upvote_btn, downvote_btn, flag_btn, regenerate_btn, clear_btn]
378
+ upvote_btn.click(
379
+ upvote_last_response,
380
+ [state, model_selector],
381
+ [textbox, upvote_btn, downvote_btn, flag_btn]
382
+ )
383
+ downvote_btn.click(
384
+ downvote_last_response,
385
+ [state, model_selector],
386
+ [textbox, upvote_btn, downvote_btn, flag_btn]
387
+ )
388
+ flag_btn.click(
389
+ flag_last_response,
390
+ [state, model_selector],
391
+ [textbox, upvote_btn, downvote_btn, flag_btn]
392
+ )
393
+
394
+ regenerate_btn.click(
395
+ regenerate,
396
+ [state, image_process_mode],
397
+ [state, chatbot, textbox, imagebox] + btn_list
398
+ ).then(
399
+ http_bot,
400
+ [state, model_selector, temperature, top_p, max_output_tokens],
401
+ [state, chatbot] + btn_list,
402
+ concurrency_limit=concurrency_count
403
+ )
404
+
405
+ clear_btn.click(
406
+ clear_history,
407
+ None,
408
+ [state, chatbot, textbox, imagebox] + btn_list,
409
+ queue=False
410
+ )
411
+
412
+ textbox.submit(
413
+ add_text,
414
+ [state, textbox, imagebox, image_process_mode],
415
+ [state, chatbot, textbox, imagebox] + btn_list,
416
+ queue=False
417
+ ).then(
418
+ http_bot,
419
+ [state, model_selector, temperature, top_p, max_output_tokens],
420
+ [state, chatbot] + btn_list,
421
+ concurrency_limit=concurrency_count
422
+ )
423
+
424
+ submit_btn.click(
425
+ add_text,
426
+ [state, textbox, imagebox, image_process_mode],
427
+ [state, chatbot, textbox, imagebox] + btn_list
428
+ ).then(
429
+ http_bot,
430
+ [state, model_selector, temperature, top_p, max_output_tokens],
431
+ [state, chatbot] + btn_list,
432
+ concurrency_limit=concurrency_count
433
+ )
434
+
435
+ if args.model_list_mode == "once":
436
+ demo.load(
437
+ load_demo,
438
+ [url_params],
439
+ [state, model_selector],
440
+ js=get_window_url_params
441
+ )
442
+ elif args.model_list_mode == "reload":
443
+ demo.load(
444
+ load_demo_refresh_model_list,
445
+ None,
446
+ [state, model_selector],
447
+ queue=False
448
+ )
449
+ else:
450
+ raise ValueError(f"Unknown model list mode: {args.model_list_mode}")
451
+
452
+ return demo
453
+
454
+
455
+ if __name__ == "__main__":
456
+ parser = argparse.ArgumentParser()
457
+ parser.add_argument("--host", type=str, default="0.0.0.0")
458
+ parser.add_argument("--port", type=int)
459
+ parser.add_argument("--controller-url", type=str, default="http://localhost:21001")
460
+ parser.add_argument("--concurrency-count", type=int, default=16)
461
+ parser.add_argument("--model-list-mode", type=str, default="once",
462
+ choices=["once", "reload"])
463
+ parser.add_argument("--share", action="store_true")
464
+ parser.add_argument("--moderate", action="store_true")
465
+ parser.add_argument("--embed", action="store_true")
466
+ args = parser.parse_args()
467
+ logger.info(f"args: {args}")
468
+
469
+ models = get_model_list()
470
+
471
+ logger.info(args)
472
+ demo = build_demo(args.embed, concurrency_count=args.concurrency_count)
473
+ demo.queue(
474
+ api_open=False
475
+ ).launch(
476
+ server_name=args.host,
477
+ server_port=args.port,
478
+ share=args.share
479
+ )
LLaVA/llava/serve/model_worker.py ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ A model worker executes the model.
3
+ """
4
+ import argparse
5
+ import asyncio
6
+ import json
7
+ import time
8
+ import threading
9
+ import uuid
10
+
11
+ from fastapi import FastAPI, Request, BackgroundTasks
12
+ from fastapi.responses import StreamingResponse
13
+ import requests
14
+ import torch
15
+ import uvicorn
16
+ from functools import partial
17
+
18
+ from llava.constants import WORKER_HEART_BEAT_INTERVAL
19
+ from llava.utils import (build_logger, server_error_msg,
20
+ pretty_print_semaphore)
21
+ from llava.model.builder import load_pretrained_model
22
+ from llava.mm_utils import process_images, load_image_from_base64, tokenizer_image_token
23
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
24
+ from transformers import TextIteratorStreamer
25
+ from threading import Thread
26
+
27
+
28
+ GB = 1 << 30
29
+
30
+ worker_id = str(uuid.uuid4())[:6]
31
+ logger = build_logger("model_worker", f"model_worker_{worker_id}.log")
32
+ global_counter = 0
33
+
34
+ model_semaphore = None
35
+
36
+
37
+ def heart_beat_worker(controller):
38
+
39
+ while True:
40
+ time.sleep(WORKER_HEART_BEAT_INTERVAL)
41
+ controller.send_heart_beat()
42
+
43
+
44
+ class ModelWorker:
45
+ def __init__(self, controller_addr, worker_addr,
46
+ worker_id, no_register,
47
+ model_path, model_base, model_name,
48
+ load_8bit, load_4bit, device, use_flash_attn=False):
49
+ self.controller_addr = controller_addr
50
+ self.worker_addr = worker_addr
51
+ self.worker_id = worker_id
52
+ if model_path.endswith("/"):
53
+ model_path = model_path[:-1]
54
+ if model_name is None:
55
+ model_paths = model_path.split("/")
56
+ if model_paths[-1].startswith('checkpoint-'):
57
+ self.model_name = model_paths[-2] + "_" + model_paths[-1]
58
+ else:
59
+ self.model_name = model_paths[-1]
60
+ else:
61
+ self.model_name = model_name
62
+
63
+ self.device = device
64
+ logger.info(f"Loading the model {self.model_name} on worker {worker_id} ...")
65
+ self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model(
66
+ model_path, model_base, self.model_name, load_8bit, load_4bit, device=self.device, use_flash_attn=use_flash_attn)
67
+ self.is_multimodal = 'llava' in self.model_name.lower()
68
+
69
+ if not no_register:
70
+ self.register_to_controller()
71
+ self.heart_beat_thread = threading.Thread(
72
+ target=heart_beat_worker, args=(self,), daemon=True)
73
+ self.heart_beat_thread.start()
74
+
75
+ def register_to_controller(self):
76
+ logger.info("Register to controller")
77
+
78
+ url = self.controller_addr + "/register_worker"
79
+ data = {
80
+ "worker_name": self.worker_addr,
81
+ "check_heart_beat": True,
82
+ "worker_status": self.get_status()
83
+ }
84
+ r = requests.post(url, json=data)
85
+ assert r.status_code == 200
86
+
87
+ def send_heart_beat(self):
88
+ logger.info(f"Send heart beat. Models: {[self.model_name]}. "
89
+ f"Semaphore: {pretty_print_semaphore(model_semaphore)}. "
90
+ f"global_counter: {global_counter}")
91
+
92
+ url = self.controller_addr + "/receive_heart_beat"
93
+
94
+ while True:
95
+ try:
96
+ ret = requests.post(url, json={
97
+ "worker_name": self.worker_addr,
98
+ "queue_length": self.get_queue_length()}, timeout=5)
99
+ exist = ret.json()["exist"]
100
+ break
101
+ except requests.exceptions.RequestException as e:
102
+ logger.error(f"heart beat error: {e}")
103
+ time.sleep(5)
104
+
105
+ if not exist:
106
+ self.register_to_controller()
107
+
108
+ def get_queue_length(self):
109
+ if model_semaphore is None:
110
+ return 0
111
+ else:
112
+ return args.limit_model_concurrency - model_semaphore._value + (len(
113
+ model_semaphore._waiters) if model_semaphore._waiters is not None else 0)
114
+
115
+ def get_status(self):
116
+ return {
117
+ "model_names": [self.model_name],
118
+ "speed": 1,
119
+ "queue_length": self.get_queue_length(),
120
+ }
121
+
122
+ @torch.inference_mode()
123
+ def generate_stream(self, params):
124
+ tokenizer, model, image_processor = self.tokenizer, self.model, self.image_processor
125
+
126
+ prompt = params["prompt"]
127
+ ori_prompt = prompt
128
+ images = params.get("images", None)
129
+ num_image_tokens = 0
130
+ if images is not None and len(images) > 0 and self.is_multimodal:
131
+ if len(images) > 0:
132
+ if len(images) != prompt.count(DEFAULT_IMAGE_TOKEN):
133
+ raise ValueError("Number of images does not match number of <image> tokens in prompt")
134
+
135
+ images = [load_image_from_base64(image) for image in images]
136
+ image_sizes = [image.size for image in images]
137
+ images = process_images(images, image_processor, model.config)
138
+
139
+ if type(images) is list:
140
+ images = [image.to(self.model.device, dtype=torch.float16) for image in images]
141
+ else:
142
+ images = images.to(self.model.device, dtype=torch.float16)
143
+
144
+ replace_token = DEFAULT_IMAGE_TOKEN
145
+ if getattr(self.model.config, 'mm_use_im_start_end', False):
146
+ replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
147
+ prompt = prompt.replace(DEFAULT_IMAGE_TOKEN, replace_token)
148
+
149
+ num_image_tokens = prompt.count(replace_token) * model.get_vision_tower().num_patches
150
+ else:
151
+ images = None
152
+ image_sizes = None
153
+ image_args = {"images": images, "image_sizes": image_sizes}
154
+ else:
155
+ images = None
156
+ image_args = {}
157
+
158
+ temperature = float(params.get("temperature", 1.0))
159
+ top_p = float(params.get("top_p", 1.0))
160
+ max_context_length = getattr(model.config, 'max_position_embeddings', 2048)
161
+ max_new_tokens = min(int(params.get("max_new_tokens", 256)), 1024)
162
+ stop_str = params.get("stop", None)
163
+ do_sample = True if temperature > 0.001 else False
164
+
165
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(self.device)
166
+ keywords = [stop_str]
167
+ # stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
168
+ streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=15)
169
+
170
+ max_new_tokens = min(max_new_tokens, max_context_length - input_ids.shape[-1] - num_image_tokens)
171
+
172
+ if max_new_tokens < 1:
173
+ yield json.dumps({"text": ori_prompt + "Exceeds max token length. Please start a new conversation, thanks.", "error_code": 0}).encode() + b"\0"
174
+ return
175
+
176
+ thread = Thread(target=model.generate, kwargs=dict(
177
+ inputs=input_ids,
178
+ do_sample=do_sample,
179
+ temperature=temperature,
180
+ top_p=top_p,
181
+ max_new_tokens=max_new_tokens,
182
+ streamer=streamer,
183
+ use_cache=True,
184
+ **image_args
185
+ ))
186
+ thread.start()
187
+
188
+ generated_text = ori_prompt
189
+ for new_text in streamer:
190
+ generated_text += new_text
191
+ if generated_text.endswith(stop_str):
192
+ generated_text = generated_text[:-len(stop_str)]
193
+ yield json.dumps({"text": generated_text, "error_code": 0}).encode() + b"\0"
194
+
195
+ def generate_stream_gate(self, params):
196
+ try:
197
+ for x in self.generate_stream(params):
198
+ yield x
199
+ except ValueError as e:
200
+ print("Caught ValueError:", e)
201
+ ret = {
202
+ "text": server_error_msg,
203
+ "error_code": 1,
204
+ }
205
+ yield json.dumps(ret).encode() + b"\0"
206
+ except torch.cuda.CudaError as e:
207
+ print("Caught torch.cuda.CudaError:", e)
208
+ ret = {
209
+ "text": server_error_msg,
210
+ "error_code": 1,
211
+ }
212
+ yield json.dumps(ret).encode() + b"\0"
213
+ except Exception as e:
214
+ print("Caught Unknown Error", e)
215
+ ret = {
216
+ "text": server_error_msg,
217
+ "error_code": 1,
218
+ }
219
+ yield json.dumps(ret).encode() + b"\0"
220
+
221
+
222
+ app = FastAPI()
223
+
224
+
225
+ def release_model_semaphore(fn=None):
226
+ model_semaphore.release()
227
+ if fn is not None:
228
+ fn()
229
+
230
+
231
+ @app.post("/worker_generate_stream")
232
+ async def generate_stream(request: Request):
233
+ global model_semaphore, global_counter
234
+ global_counter += 1
235
+ params = await request.json()
236
+
237
+ if model_semaphore is None:
238
+ model_semaphore = asyncio.Semaphore(args.limit_model_concurrency)
239
+ await model_semaphore.acquire()
240
+ worker.send_heart_beat()
241
+ generator = worker.generate_stream_gate(params)
242
+ background_tasks = BackgroundTasks()
243
+ background_tasks.add_task(partial(release_model_semaphore, fn=worker.send_heart_beat))
244
+ return StreamingResponse(generator, background=background_tasks)
245
+
246
+
247
+ @app.post("/worker_get_status")
248
+ async def get_status(request: Request):
249
+ return worker.get_status()
250
+
251
+
252
+ if __name__ == "__main__":
253
+ parser = argparse.ArgumentParser()
254
+ parser.add_argument("--host", type=str, default="localhost")
255
+ parser.add_argument("--port", type=int, default=21002)
256
+ parser.add_argument("--worker-address", type=str,
257
+ default="http://localhost:21002")
258
+ parser.add_argument("--controller-address", type=str,
259
+ default="http://localhost:21001")
260
+ parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
261
+ parser.add_argument("--model-base", type=str, default=None)
262
+ parser.add_argument("--model-name", type=str)
263
+ parser.add_argument("--device", type=str, default="cuda")
264
+ parser.add_argument("--multi-modal", action="store_true", help="Multimodal mode is automatically detected with model name, please make sure `llava` is included in the model path.")
265
+ parser.add_argument("--limit-model-concurrency", type=int, default=5)
266
+ parser.add_argument("--stream-interval", type=int, default=1)
267
+ parser.add_argument("--no-register", action="store_true")
268
+ parser.add_argument("--load-8bit", action="store_true")
269
+ parser.add_argument("--load-4bit", action="store_true")
270
+ parser.add_argument("--use-flash-attn", action="store_true")
271
+ args = parser.parse_args()
272
+ logger.info(f"args: {args}")
273
+
274
+ if args.multi_modal:
275
+ logger.warning("Multimodal mode is automatically detected with model name, please make sure `llava` is included in the model path.")
276
+
277
+ worker = ModelWorker(args.controller_address,
278
+ args.worker_address,
279
+ worker_id,
280
+ args.no_register,
281
+ args.model_path,
282
+ args.model_base,
283
+ args.model_name,
284
+ args.load_8bit,
285
+ args.load_4bit,
286
+ args.device,
287
+ use_flash_attn=args.use_flash_attn)
288
+ uvicorn.run(app, host=args.host, port=args.port, log_level="info")
LLaVA/llava/serve/register_worker.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Manually register workers.
3
+
4
+ Usage:
5
+ python3 -m fastchat.serve.register_worker --controller http://localhost:21001 --worker-name http://localhost:21002
6
+ """
7
+
8
+ import argparse
9
+
10
+ import requests
11
+
12
+ if __name__ == "__main__":
13
+ parser = argparse.ArgumentParser()
14
+ parser.add_argument("--controller-address", type=str)
15
+ parser.add_argument("--worker-name", type=str)
16
+ parser.add_argument("--check-heart-beat", action="store_true")
17
+ args = parser.parse_args()
18
+
19
+ url = args.controller_address + "/register_worker"
20
+ data = {
21
+ "worker_name": args.worker_name,
22
+ "check_heart_beat": args.check_heart_beat,
23
+ "worker_status": None,
24
+ }
25
+ r = requests.post(url, json=data)
26
+ assert r.status_code == 200
LLaVA/llava/serve/sglang_worker.py ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ A model worker executes the model.
3
+ """
4
+ import argparse
5
+ import asyncio
6
+ from concurrent.futures import ThreadPoolExecutor
7
+ import json
8
+ import time
9
+ import threading
10
+ import uuid
11
+
12
+ from fastapi import FastAPI, Request, BackgroundTasks
13
+ from fastapi.responses import StreamingResponse
14
+ import requests
15
+ import re
16
+ import uvicorn
17
+ from functools import partial
18
+
19
+ from llava.constants import WORKER_HEART_BEAT_INTERVAL
20
+ from llava.utils import (build_logger, server_error_msg,
21
+ pretty_print_semaphore)
22
+ from llava.mm_utils import process_images, load_image_from_base64, tokenizer_image_token, expand2square
23
+ from llava.constants import DEFAULT_IMAGE_TOKEN
24
+
25
+ import sglang as sgl
26
+ from sglang.backend.runtime_endpoint import RuntimeEndpoint
27
+
28
+
29
+ GB = 1 << 30
30
+
31
+ worker_id = str(uuid.uuid4())[:6]
32
+ logger = build_logger("model_worker", f"model_worker_{worker_id}.log")
33
+ global_counter = 0
34
+
35
+ model_semaphore = None
36
+
37
+
38
+ def heart_beat_worker(controller):
39
+ while True:
40
+ time.sleep(WORKER_HEART_BEAT_INTERVAL)
41
+ controller.send_heart_beat()
42
+
43
+
44
+ @sgl.function
45
+ def pipeline(s, prompt, max_tokens):
46
+ for p in prompt:
47
+ if type(p) is str:
48
+ s += p
49
+ else:
50
+ s += sgl.image(p)
51
+ s += sgl.gen("response", max_tokens=max_tokens)
52
+
53
+
54
+ class ModelWorker:
55
+ def __init__(self, controller_addr, worker_addr, sgl_endpoint,
56
+ worker_id, no_register, model_name):
57
+ self.controller_addr = controller_addr
58
+ self.worker_addr = worker_addr
59
+ self.worker_id = worker_id
60
+
61
+ # Select backend
62
+ backend = RuntimeEndpoint(sgl_endpoint)
63
+ sgl.set_default_backend(backend)
64
+ model_path = backend.model_info["model_path"]
65
+
66
+ if model_path.endswith("/"):
67
+ model_path = model_path[:-1]
68
+ if model_name is None:
69
+ model_paths = model_path.split("/")
70
+ if model_paths[-1].startswith('checkpoint-'):
71
+ self.model_name = model_paths[-2] + "_" + model_paths[-1]
72
+ else:
73
+ self.model_name = model_paths[-1]
74
+ else:
75
+ self.model_name = model_name
76
+
77
+ logger.info(f"Loading the SGLANG model {self.model_name} on worker {worker_id} ...")
78
+
79
+ if not no_register:
80
+ self.register_to_controller()
81
+ self.heart_beat_thread = threading.Thread(
82
+ target=heart_beat_worker, args=(self,), daemon=True)
83
+ self.heart_beat_thread.start()
84
+
85
+ def register_to_controller(self):
86
+ logger.info("Register to controller")
87
+
88
+ url = self.controller_addr + "/register_worker"
89
+ data = {
90
+ "worker_name": self.worker_addr,
91
+ "check_heart_beat": True,
92
+ "worker_status": self.get_status()
93
+ }
94
+ r = requests.post(url, json=data)
95
+ assert r.status_code == 200
96
+
97
+ def send_heart_beat(self):
98
+ logger.info(f"Send heart beat. Models: {[self.model_name]}. "
99
+ f"Semaphore: {pretty_print_semaphore(model_semaphore)}. "
100
+ f"global_counter: {global_counter}")
101
+
102
+ url = self.controller_addr + "/receive_heart_beat"
103
+
104
+ while True:
105
+ try:
106
+ ret = requests.post(url, json={
107
+ "worker_name": self.worker_addr,
108
+ "queue_length": self.get_queue_length()}, timeout=5)
109
+ exist = ret.json()["exist"]
110
+ break
111
+ except requests.exceptions.RequestException as e:
112
+ logger.error(f"heart beat error: {e}")
113
+ time.sleep(5)
114
+
115
+ if not exist:
116
+ self.register_to_controller()
117
+
118
+ def get_queue_length(self):
119
+ if model_semaphore is None:
120
+ return 0
121
+ else:
122
+ return args.limit_model_concurrency - model_semaphore._value + (len(
123
+ model_semaphore._waiters) if model_semaphore._waiters is not None else 0)
124
+
125
+ def get_status(self):
126
+ return {
127
+ "model_names": [self.model_name],
128
+ "speed": 1,
129
+ "queue_length": self.get_queue_length(),
130
+ }
131
+
132
+ async def generate_stream(self, params):
133
+ ori_prompt = prompt = params["prompt"]
134
+ images = params.get("images", None)
135
+ if images is not None and len(images) > 0:
136
+ if len(images) > 0:
137
+ if len(images) != prompt.count(DEFAULT_IMAGE_TOKEN):
138
+ raise ValueError("Number of images does not match number of <image> tokens in prompt")
139
+
140
+ images = [load_image_from_base64(image) for image in images]
141
+
142
+ # FIXME: for image-start/end token
143
+ # replace_token = DEFAULT_IMAGE_TOKEN
144
+ # if getattr(self.model.config, 'mm_use_im_start_end', False):
145
+ # replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
146
+ # prompt = prompt.replace(DEFAULT_IMAGE_TOKEN, replace_token)
147
+ prompt = prompt.replace(' ' + DEFAULT_IMAGE_TOKEN + '\n', DEFAULT_IMAGE_TOKEN)
148
+ prompt_split = prompt.split(DEFAULT_IMAGE_TOKEN)
149
+ prompt = []
150
+ for i in range(len(prompt_split)):
151
+ prompt.append(prompt_split[i])
152
+ if i < len(images):
153
+ prompt.append(images[i])
154
+ else:
155
+ prompt = [prompt]
156
+
157
+ temperature = float(params.get("temperature", 1.0))
158
+ top_p = float(params.get("top_p", 1.0))
159
+ # max_context_length = getattr(model.config, 'max_position_embeddings', 2048)
160
+ max_new_tokens = min(int(params.get("max_new_tokens", 256)), 1024)
161
+ stop_str = params.get("stop", None)
162
+ stop_str = [stop_str] if stop_str is not None else None
163
+
164
+ print({'prompt': prompt, 'max_new_tokens': max_new_tokens, 'temperature': temperature, 'top_p': top_p})
165
+ state = pipeline.run(prompt, max_new_tokens, temperature=temperature, top_p=top_p, stream=True)
166
+
167
+ generated_text = ori_prompt
168
+ async for text_outputs in state.text_async_iter(var_name="response"):
169
+ generated_text += text_outputs
170
+ yield json.dumps({"text": generated_text, "error_code": 0}).encode() + b"\0"
171
+
172
+ async def generate_stream_gate(self, params):
173
+ try:
174
+ async for x in self.generate_stream(params):
175
+ yield x
176
+ except ValueError as e:
177
+ print("Caught ValueError:", e)
178
+ ret = {
179
+ "text": server_error_msg,
180
+ "error_code": 1,
181
+ }
182
+ yield json.dumps(ret).encode() + b"\0"
183
+ except Exception as e:
184
+ print("Caught Unknown Error", e)
185
+ ret = {
186
+ "text": server_error_msg,
187
+ "error_code": 1,
188
+ }
189
+ yield json.dumps(ret).encode() + b"\0"
190
+
191
+
192
+ app = FastAPI()
193
+
194
+
195
+ def release_model_semaphore(fn=None):
196
+ model_semaphore.release()
197
+ if fn is not None:
198
+ fn()
199
+
200
+
201
+ @app.post("/worker_generate_stream")
202
+ async def generate_stream(request: Request):
203
+ global model_semaphore, global_counter
204
+ global_counter += 1
205
+ params = await request.json()
206
+
207
+ if model_semaphore is None:
208
+ model_semaphore = asyncio.Semaphore(args.limit_model_concurrency)
209
+ await model_semaphore.acquire()
210
+ worker.send_heart_beat()
211
+ generator = worker.generate_stream_gate(params)
212
+ background_tasks = BackgroundTasks()
213
+ background_tasks.add_task(partial(release_model_semaphore, fn=worker.send_heart_beat))
214
+ return StreamingResponse(generator, background=background_tasks)
215
+
216
+
217
+ @app.post("/worker_get_status")
218
+ async def get_status(request: Request):
219
+ return worker.get_status()
220
+
221
+
222
+ if __name__ == "__main__":
223
+ parser = argparse.ArgumentParser()
224
+ parser.add_argument("--host", type=str, default="localhost")
225
+ parser.add_argument("--port", type=int, default=21002)
226
+ parser.add_argument("--worker-address", type=str,
227
+ default="http://localhost:21002")
228
+ parser.add_argument("--controller-address", type=str,
229
+ default="http://localhost:21001")
230
+ parser.add_argument("--model-name", type=str)
231
+ parser.add_argument("--sgl-endpoint", type=str)
232
+ parser.add_argument("--limit-model-concurrency", type=int, default=5)
233
+ parser.add_argument("--stream-interval", type=int, default=1)
234
+ parser.add_argument("--no-register", action="store_true")
235
+ args = parser.parse_args()
236
+ logger.info(f"args: {args}")
237
+
238
+ worker = ModelWorker(args.controller_address,
239
+ args.worker_address,
240
+ args.sgl_endpoint,
241
+ worker_id,
242
+ args.no_register,
243
+ args.model_name)
244
+ uvicorn.run(app, host=args.host, port=args.port, log_level="info")
LLaVA/llava/serve/test_message.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import json
3
+
4
+ import requests
5
+
6
+ from llava.conversation import default_conversation
7
+
8
+
9
+ def main():
10
+ if args.worker_address:
11
+ worker_addr = args.worker_address
12
+ else:
13
+ controller_addr = args.controller_address
14
+ ret = requests.post(controller_addr + "/refresh_all_workers")
15
+ ret = requests.post(controller_addr + "/list_models")
16
+ models = ret.json()["models"]
17
+ models.sort()
18
+ print(f"Models: {models}")
19
+
20
+ ret = requests.post(controller_addr + "/get_worker_address",
21
+ json={"model": args.model_name})
22
+ worker_addr = ret.json()["address"]
23
+ print(f"worker_addr: {worker_addr}")
24
+
25
+ if worker_addr == "":
26
+ return
27
+
28
+ conv = default_conversation.copy()
29
+ conv.append_message(conv.roles[0], args.message)
30
+ prompt = conv.get_prompt()
31
+
32
+ headers = {"User-Agent": "LLaVA Client"}
33
+ pload = {
34
+ "model": args.model_name,
35
+ "prompt": prompt,
36
+ "max_new_tokens": args.max_new_tokens,
37
+ "temperature": 0.7,
38
+ "stop": conv.sep,
39
+ }
40
+ response = requests.post(worker_addr + "/worker_generate_stream", headers=headers,
41
+ json=pload, stream=True)
42
+
43
+ print(prompt.replace(conv.sep, "\n"), end="")
44
+ for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"):
45
+ if chunk:
46
+ data = json.loads(chunk.decode("utf-8"))
47
+ output = data["text"].split(conv.sep)[-1]
48
+ print(output, end="\r")
49
+ print("")
50
+
51
+
52
+ if __name__ == "__main__":
53
+ parser = argparse.ArgumentParser()
54
+ parser.add_argument("--controller-address", type=str, default="http://localhost:21001")
55
+ parser.add_argument("--worker-address", type=str)
56
+ parser.add_argument("--model-name", type=str, default="facebook/opt-350m")
57
+ parser.add_argument("--max-new-tokens", type=int, default=32)
58
+ parser.add_argument("--message", type=str, default=
59
+ "Tell me a story with more than 1000 words.")
60
+ args = parser.parse_args()
61
+
62
+ main()
LLaVA/llava/train/llama_flash_attn_monkey_patch.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional, Tuple
2
+ import warnings
3
+
4
+ import torch
5
+
6
+ import transformers
7
+ from transformers.models.llama.modeling_llama import apply_rotary_pos_emb, repeat_kv
8
+
9
+ try:
10
+ from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
11
+ except ImportError:
12
+ from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
13
+ from flash_attn.bert_padding import unpad_input, pad_input
14
+
15
+
16
+ def forward(
17
+ self,
18
+ hidden_states: torch.Tensor,
19
+ attention_mask: Optional[torch.Tensor] = None,
20
+ position_ids: Optional[torch.Tensor] = None,
21
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
22
+ output_attentions: bool = False,
23
+ use_cache: bool = False,
24
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
25
+ if output_attentions:
26
+ warnings.warn(
27
+ "Output attentions is not supported for patched `LlamaAttention`, returning `None` instead."
28
+ )
29
+
30
+ bsz, q_len, _ = hidden_states.size()
31
+
32
+ query_states = (
33
+ self.q_proj(hidden_states)
34
+ .view(bsz, q_len, self.num_heads, self.head_dim)
35
+ .transpose(1, 2)
36
+ )
37
+ key_states = (
38
+ self.k_proj(hidden_states)
39
+ .view(bsz, q_len, self.num_key_value_heads, self.head_dim)
40
+ .transpose(1, 2)
41
+ )
42
+ value_states = (
43
+ self.v_proj(hidden_states)
44
+ .view(bsz, q_len, self.num_key_value_heads, self.head_dim)
45
+ .transpose(1, 2)
46
+ ) # shape: (b, num_heads, s, head_dim)
47
+
48
+ kv_seq_len = key_states.shape[-2]
49
+ if past_key_value is not None:
50
+ kv_seq_len += past_key_value[0].shape[-2]
51
+
52
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
53
+ query_states, key_states = apply_rotary_pos_emb(
54
+ query_states, key_states, cos, sin, position_ids
55
+ )
56
+
57
+ if past_key_value is not None:
58
+ # reuse k, v
59
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
60
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
61
+
62
+ past_key_value = (key_states, value_states) if use_cache else None
63
+
64
+ # repeat k/v heads if n_kv_heads < n_heads
65
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
66
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
67
+
68
+ # Transform the data into the format required by flash attention
69
+ qkv = torch.stack([query_states, key_states, value_states], dim=2)
70
+ qkv = qkv.transpose(1, 3) # shape: [b, s, 3, num_heads, head_dim]
71
+ key_padding_mask = attention_mask
72
+
73
+ if key_padding_mask is None:
74
+ qkv = qkv.reshape(-1, 3, self.num_heads, self.head_dim)
75
+ cu_q_lens = torch.arange(
76
+ 0, (bsz + 1) * q_len, step=q_len, dtype=torch.int32, device=qkv.device
77
+ )
78
+ max_s = q_len
79
+ output = flash_attn_unpadded_qkvpacked_func(
80
+ qkv, cu_q_lens, max_s, 0.0, softmax_scale=None, causal=True
81
+ )
82
+ output = output.view(bsz, q_len, -1)
83
+ else:
84
+ qkv = qkv.reshape(bsz, q_len, -1)
85
+ qkv, indices, cu_q_lens, max_s = unpad_input(qkv, key_padding_mask)
86
+ qkv = qkv.view(-1, 3, self.num_heads, self.head_dim)
87
+ output_unpad = flash_attn_unpadded_qkvpacked_func(
88
+ qkv, cu_q_lens, max_s, 0.0, softmax_scale=None, causal=True
89
+ )
90
+ output_unpad = output_unpad.reshape(-1, self.num_heads * self.head_dim)
91
+ output = pad_input(output_unpad, indices, bsz, q_len)
92
+
93
+ return self.o_proj(output), None, past_key_value
94
+
95
+
96
+ # Disable the transformation of the attention mask in LlamaModel as the flash attention
97
+ # requires the attention mask to be the same as the key_padding_mask
98
+ def _prepare_decoder_attention_mask(
99
+ self, attention_mask, input_shape, inputs_embeds, past_key_values_length
100
+ ):
101
+ # [bsz, seq_len]
102
+ return attention_mask
103
+
104
+
105
+ def replace_llama_attn_with_flash_attn():
106
+ cuda_major, cuda_minor = torch.cuda.get_device_capability()
107
+ if cuda_major < 8:
108
+ warnings.warn(
109
+ "Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward."
110
+ "ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593"
111
+ )
112
+ transformers.models.llama.modeling_llama.LlamaModel._prepare_decoder_attention_mask = (
113
+ _prepare_decoder_attention_mask
114
+ )
115
+ transformers.models.llama.modeling_llama.LlamaAttention.forward = forward
LLaVA/llava/train/llama_xformers_attn_monkey_patch.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Directly copied the code from https://raw.githubusercontent.com/oobabooga/text-generation-webui/main/modules/llama_attn_hijack.py and made some adjustments
3
+ """
4
+
5
+ import logging
6
+ import math
7
+ from typing import Optional, Tuple
8
+
9
+ import torch
10
+ import transformers.models.llama.modeling_llama
11
+ from torch import nn
12
+
13
+ try:
14
+ import xformers.ops
15
+ except ImportError:
16
+ logging.error("xformers not found! Please install it before trying to use it.")
17
+
18
+
19
+ def replace_llama_attn_with_xformers_attn():
20
+ transformers.models.llama.modeling_llama.LlamaAttention.forward = xformers_forward
21
+
22
+
23
+ def xformers_forward(
24
+ self,
25
+ hidden_states: torch.Tensor,
26
+ attention_mask: Optional[torch.Tensor] = None,
27
+ position_ids: Optional[torch.LongTensor] = None,
28
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
29
+ output_attentions: bool = False,
30
+ use_cache: bool = False,
31
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
32
+ # pylint: disable=duplicate-code
33
+ bsz, q_len, _ = hidden_states.size()
34
+
35
+ query_states = (
36
+ self.q_proj(hidden_states)
37
+ .view(bsz, q_len, self.num_heads, self.head_dim)
38
+ .transpose(1, 2)
39
+ )
40
+ key_states = (
41
+ self.k_proj(hidden_states)
42
+ .view(bsz, q_len, self.num_heads, self.head_dim)
43
+ .transpose(1, 2)
44
+ )
45
+ value_states = (
46
+ self.v_proj(hidden_states)
47
+ .view(bsz, q_len, self.num_heads, self.head_dim)
48
+ .transpose(1, 2)
49
+ )
50
+
51
+ kv_seq_len = key_states.shape[-2]
52
+ if past_key_value is not None:
53
+ kv_seq_len += past_key_value[0].shape[-2]
54
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
55
+ (
56
+ query_states,
57
+ key_states,
58
+ ) = transformers.models.llama.modeling_llama.apply_rotary_pos_emb(
59
+ query_states, key_states, cos, sin, position_ids
60
+ )
61
+ # [bsz, nh, t, hd]
62
+
63
+ if past_key_value is not None:
64
+ # reuse k, v, self_attention
65
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
66
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
67
+
68
+ past_key_value = (key_states, value_states) if use_cache else None
69
+
70
+ # We only apply xformers optimizations if we don't need to output the whole attention matrix
71
+ if not output_attentions:
72
+ query_states = query_states.transpose(1, 2)
73
+ key_states = key_states.transpose(1, 2)
74
+ value_states = value_states.transpose(1, 2)
75
+
76
+ # This is a nasty hack. We know attention_mask in transformers is either LowerTriangular or all Zeros.
77
+ # We therefore check if one element in the upper triangular portion is zero. If it is, then the mask is all zeros.
78
+ if attention_mask is None or attention_mask[0, 0, 0, 1] == 0:
79
+ # input and output should be of form (bsz, q_len, num_heads, head_dim)
80
+ attn_output = xformers.ops.memory_efficient_attention(
81
+ query_states, key_states, value_states, attn_bias=None
82
+ )
83
+ else:
84
+ # input and output should be of form (bsz, q_len, num_heads, head_dim)
85
+ attn_output = xformers.ops.memory_efficient_attention(
86
+ query_states,
87
+ key_states,
88
+ value_states,
89
+ attn_bias=xformers.ops.LowerTriangularMask(),
90
+ )
91
+ attn_weights = None
92
+ else:
93
+ attn_weights = torch.matmul(
94
+ query_states, key_states.transpose(2, 3)
95
+ ) / math.sqrt(self.head_dim)
96
+
97
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
98
+ raise ValueError(
99
+ f"Attention weights should be of size {(bsz * self.num_heads, q_len, kv_seq_len)}, but is"
100
+ f" {attn_weights.size()}"
101
+ )
102
+
103
+ if attention_mask is not None:
104
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
105
+ raise ValueError(
106
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
107
+ )
108
+ attn_weights = attn_weights + attention_mask
109
+ attn_weights = torch.max(
110
+ attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min)
111
+ )
112
+
113
+ # upcast attention to fp32
114
+ attn_weights = nn.functional.softmax(
115
+ attn_weights, dim=-1, dtype=torch.float32
116
+ ).to(query_states.dtype)
117
+ attn_output = torch.matmul(attn_weights, value_states)
118
+
119
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
120
+ raise ValueError(
121
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
122
+ f" {attn_output.size()}"
123
+ )
124
+
125
+ attn_output = attn_output.transpose(1, 2)
126
+
127
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
128
+ attn_output = self.o_proj(attn_output)
129
+ return attn_output, attn_weights, past_key_value
LLaVA/llava/train/llava_trainer.py ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ import torch.nn as nn
4
+
5
+ from torch.utils.data import Sampler
6
+
7
+ from transformers import Trainer
8
+ from transformers.trainer import (
9
+ is_sagemaker_mp_enabled,
10
+ get_parameter_names,
11
+ has_length,
12
+ ALL_LAYERNORM_LAYERS,
13
+ logger,
14
+ )
15
+ from typing import List, Optional
16
+
17
+
18
+ def maybe_zero_3(param, ignore_status=False, name=None):
19
+ from deepspeed import zero
20
+ from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
21
+ if hasattr(param, "ds_id"):
22
+ if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
23
+ if not ignore_status:
24
+ print(name, 'no ignore status')
25
+ with zero.GatheredParameters([param]):
26
+ param = param.data.detach().cpu().clone()
27
+ else:
28
+ param = param.detach().cpu().clone()
29
+ return param
30
+
31
+
32
+ def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
33
+ to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
34
+ to_return = {k: maybe_zero_3(v, ignore_status=True, name=k).cpu() for k, v in to_return.items()}
35
+ return to_return
36
+
37
+
38
+ def split_to_even_chunks(indices, lengths, num_chunks):
39
+ """
40
+ Split a list of indices into `chunks` chunks of roughly equal lengths.
41
+ """
42
+
43
+ if len(indices) % num_chunks != 0:
44
+ return [indices[i::num_chunks] for i in range(num_chunks)]
45
+
46
+ num_indices_per_chunk = len(indices) // num_chunks
47
+
48
+ chunks = [[] for _ in range(num_chunks)]
49
+ chunks_lengths = [0 for _ in range(num_chunks)]
50
+ for index in indices:
51
+ shortest_chunk = chunks_lengths.index(min(chunks_lengths))
52
+ chunks[shortest_chunk].append(index)
53
+ chunks_lengths[shortest_chunk] += lengths[index]
54
+ if len(chunks[shortest_chunk]) == num_indices_per_chunk:
55
+ chunks_lengths[shortest_chunk] = float("inf")
56
+
57
+ return chunks
58
+
59
+
60
+ def get_modality_length_grouped_indices(lengths, batch_size, world_size, generator=None):
61
+ # We need to use torch for the random part as a distributed sampler will set the random seed for torch.
62
+ assert all(l != 0 for l in lengths), "Should not have zero length."
63
+ if all(l > 0 for l in lengths) or all(l < 0 for l in lengths):
64
+ # all samples are in the same modality
65
+ return get_length_grouped_indices(lengths, batch_size, world_size, generator=generator)
66
+ mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0])
67
+ lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0])
68
+
69
+ mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices(mm_lengths, batch_size, world_size, generator=None)]
70
+ lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices(lang_lengths, batch_size, world_size, generator=None)]
71
+ megabatch_size = world_size * batch_size
72
+ mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)]
73
+ lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)]
74
+
75
+ last_mm = mm_megabatches[-1]
76
+ last_lang = lang_megabatches[-1]
77
+ additional_batch = last_mm + last_lang
78
+ megabatches = mm_megabatches[:-1] + lang_megabatches[:-1]
79
+ megabatch_indices = torch.randperm(len(megabatches), generator=generator)
80
+ megabatches = [megabatches[i] for i in megabatch_indices]
81
+
82
+ if len(additional_batch) > 0:
83
+ megabatches.append(sorted(additional_batch))
84
+
85
+ return [i for megabatch in megabatches for i in megabatch]
86
+
87
+
88
+ def get_length_grouped_indices(lengths, batch_size, world_size, generator=None, merge=True):
89
+ # We need to use torch for the random part as a distributed sampler will set the random seed for torch.
90
+ indices = torch.randperm(len(lengths), generator=generator)
91
+ megabatch_size = world_size * batch_size
92
+ megabatches = [indices[i : i + megabatch_size].tolist() for i in range(0, len(lengths), megabatch_size)]
93
+ megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches]
94
+ megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches]
95
+
96
+ return [i for megabatch in megabatches for batch in megabatch for i in batch]
97
+
98
+
99
+ class LengthGroupedSampler(Sampler):
100
+ r"""
101
+ Sampler that samples indices in a way that groups together features of the dataset of roughly the same length while
102
+ keeping a bit of randomness.
103
+ """
104
+
105
+ def __init__(
106
+ self,
107
+ batch_size: int,
108
+ world_size: int,
109
+ lengths: Optional[List[int]] = None,
110
+ generator=None,
111
+ group_by_modality: bool = False,
112
+ ):
113
+ if lengths is None:
114
+ raise ValueError("Lengths must be provided.")
115
+
116
+ self.batch_size = batch_size
117
+ self.world_size = world_size
118
+ self.lengths = lengths
119
+ self.generator = generator
120
+ self.group_by_modality = group_by_modality
121
+
122
+ def __len__(self):
123
+ return len(self.lengths)
124
+
125
+ def __iter__(self):
126
+ if self.group_by_modality:
127
+ indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
128
+ else:
129
+ indices = get_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
130
+ return iter(indices)
131
+
132
+
133
+ class LLaVATrainer(Trainer):
134
+
135
+ def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
136
+ if self.train_dataset is None or not has_length(self.train_dataset):
137
+ return None
138
+
139
+ if self.args.group_by_modality_length:
140
+ lengths = self.train_dataset.modality_lengths
141
+ return LengthGroupedSampler(
142
+ self.args.train_batch_size,
143
+ world_size=self.args.world_size * self.args.gradient_accumulation_steps,
144
+ lengths=lengths,
145
+ group_by_modality=True,
146
+ )
147
+ else:
148
+ return super()._get_train_sampler()
149
+
150
+ def create_optimizer(self):
151
+ """
152
+ Setup the optimizer.
153
+
154
+ We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
155
+ Trainer's init through `optimizers`, or subclass and override this method in a subclass.
156
+ """
157
+ if is_sagemaker_mp_enabled():
158
+ return super().create_optimizer()
159
+
160
+ opt_model = self.model
161
+
162
+ if self.optimizer is None:
163
+ decay_parameters = get_parameter_names(opt_model, ALL_LAYERNORM_LAYERS)
164
+ decay_parameters = [name for name in decay_parameters if "bias" not in name]
165
+ if self.args.mm_projector_lr is not None:
166
+ projector_parameters = [name for name, _ in opt_model.named_parameters() if "mm_projector" in name]
167
+ optimizer_grouped_parameters = [
168
+ {
169
+ "params": [
170
+ p for n, p in opt_model.named_parameters() if (n in decay_parameters and n not in projector_parameters and p.requires_grad)
171
+ ],
172
+ "weight_decay": self.args.weight_decay,
173
+ },
174
+ {
175
+ "params": [
176
+ p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n not in projector_parameters and p.requires_grad)
177
+ ],
178
+ "weight_decay": 0.0,
179
+ },
180
+ {
181
+ "params": [
182
+ p for n, p in opt_model.named_parameters() if (n in decay_parameters and n in projector_parameters and p.requires_grad)
183
+ ],
184
+ "weight_decay": self.args.weight_decay,
185
+ "lr": self.args.mm_projector_lr,
186
+ },
187
+ {
188
+ "params": [
189
+ p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n in projector_parameters and p.requires_grad)
190
+ ],
191
+ "weight_decay": 0.0,
192
+ "lr": self.args.mm_projector_lr,
193
+ },
194
+ ]
195
+ else:
196
+ optimizer_grouped_parameters = [
197
+ {
198
+ "params": [
199
+ p for n, p in opt_model.named_parameters() if (n in decay_parameters and p.requires_grad)
200
+ ],
201
+ "weight_decay": self.args.weight_decay,
202
+ },
203
+ {
204
+ "params": [
205
+ p for n, p in opt_model.named_parameters() if (n not in decay_parameters and p.requires_grad)
206
+ ],
207
+ "weight_decay": 0.0,
208
+ },
209
+ ]
210
+
211
+ optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(self.args)
212
+
213
+ self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
214
+ if optimizer_cls.__name__ == "Adam8bit":
215
+ import bitsandbytes
216
+
217
+ manager = bitsandbytes.optim.GlobalOptimManager.get_instance()
218
+
219
+ skipped = 0
220
+ for module in opt_model.modules():
221
+ if isinstance(module, nn.Embedding):
222
+ skipped += sum({p.data_ptr(): p.numel() for p in module.parameters()}.values())
223
+ logger.info(f"skipped {module}: {skipped/2**20}M params")
224
+ manager.register_module_override(module, "weight", {"optim_bits": 32})
225
+ logger.debug(f"bitsandbytes: will optimize {module} in fp32")
226
+ logger.info(f"skipped: {skipped/2**20}M params")
227
+
228
+ return self.optimizer
229
+
230
+ def _save_checkpoint(self, model, trial, metrics=None):
231
+ if getattr(self.args, 'tune_mm_mlp_adapter', False):
232
+ from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
233
+ checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
234
+
235
+ run_dir = self._get_output_dir(trial=trial)
236
+ output_dir = os.path.join(run_dir, checkpoint_folder)
237
+
238
+ # Only save Adapter
239
+ keys_to_match = ['mm_projector', 'vision_resampler']
240
+ if getattr(self.args, "use_im_start_end", False):
241
+ keys_to_match.extend(['embed_tokens', 'embed_in'])
242
+
243
+ weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match)
244
+
245
+ if self.args.local_rank == 0 or self.args.local_rank == -1:
246
+ self.model.config.save_pretrained(output_dir)
247
+ torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin'))
248
+ else:
249
+ super(LLaVATrainer, self)._save_checkpoint(model, trial, metrics)
250
+
251
+ def _save(self, output_dir: Optional[str] = None, state_dict=None):
252
+ if getattr(self.args, 'tune_mm_mlp_adapter', False):
253
+ pass
254
+ else:
255
+ super(LLaVATrainer, self)._save(output_dir, state_dict)
LLaVA/llava/train/train.py ADDED
@@ -0,0 +1,991 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
2
+ # Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
3
+ # Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+
17
+ import os
18
+ import copy
19
+ from dataclasses import dataclass, field
20
+ import json
21
+ import logging
22
+ import pathlib
23
+ from typing import Dict, Optional, Sequence, List
24
+
25
+ import torch
26
+
27
+ import transformers
28
+ import tokenizers
29
+
30
+ from llava.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
31
+ from torch.utils.data import Dataset
32
+ from llava.train.llava_trainer import LLaVATrainer
33
+
34
+ from llava import conversation as conversation_lib
35
+ from llava.model import *
36
+ from llava.mm_utils import tokenizer_image_token
37
+
38
+ from PIL import Image
39
+
40
+
41
+ local_rank = None
42
+
43
+
44
+ def rank0_print(*args):
45
+ if local_rank == 0:
46
+ print(*args)
47
+
48
+
49
+ from packaging import version
50
+ IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse('0.14')
51
+
52
+
53
+ @dataclass
54
+ class ModelArguments:
55
+ model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
56
+ version: Optional[str] = field(default="v0")
57
+ freeze_backbone: bool = field(default=False)
58
+ tune_mm_mlp_adapter: bool = field(default=False)
59
+ vision_tower: Optional[str] = field(default=None)
60
+ mm_vision_select_layer: Optional[int] = field(default=-1) # default to the last layer
61
+ pretrain_mm_mlp_adapter: Optional[str] = field(default=None)
62
+ mm_projector_type: Optional[str] = field(default='linear')
63
+ mm_use_im_start_end: bool = field(default=False)
64
+ mm_use_im_patch_token: bool = field(default=True)
65
+ mm_patch_merge_type: Optional[str] = field(default='flat')
66
+ mm_vision_select_feature: Optional[str] = field(default="patch")
67
+
68
+
69
+ @dataclass
70
+ class DataArguments:
71
+ data_path: str = field(default=None,
72
+ metadata={"help": "Path to the training data."})
73
+ lazy_preprocess: bool = False
74
+ is_multimodal: bool = False
75
+ image_folder: Optional[str] = field(default=None)
76
+ image_aspect_ratio: str = 'square'
77
+
78
+
79
+ @dataclass
80
+ class TrainingArguments(transformers.TrainingArguments):
81
+ cache_dir: Optional[str] = field(default=None)
82
+ optim: str = field(default="adamw_torch")
83
+ remove_unused_columns: bool = field(default=False)
84
+ freeze_mm_mlp_adapter: bool = field(default=False)
85
+ mpt_attn_impl: Optional[str] = field(default="triton")
86
+ model_max_length: int = field(
87
+ default=512,
88
+ metadata={
89
+ "help":
90
+ "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
91
+ },
92
+ )
93
+ double_quant: bool = field(
94
+ default=True,
95
+ metadata={"help": "Compress the quantization statistics through double quantization."}
96
+ )
97
+ quant_type: str = field(
98
+ default="nf4",
99
+ metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."}
100
+ )
101
+ bits: int = field(
102
+ default=16,
103
+ metadata={"help": "How many bits to use."}
104
+ )
105
+ lora_enable: bool = False
106
+ lora_r: int = 64
107
+ lora_alpha: int = 16
108
+ lora_dropout: float = 0.05
109
+ lora_weight_path: str = ""
110
+ lora_bias: str = "none"
111
+ mm_projector_lr: Optional[float] = None
112
+ group_by_modality_length: bool = field(default=False)
113
+
114
+
115
+ def maybe_zero_3(param, ignore_status=False, name=None):
116
+ from deepspeed import zero
117
+ from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
118
+ if hasattr(param, "ds_id"):
119
+ if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
120
+ if not ignore_status:
121
+ logging.warning(f"{name}: param.ds_status != ZeroParamStatus.NOT_AVAILABLE: {param.ds_status}")
122
+ with zero.GatheredParameters([param]):
123
+ param = param.data.detach().cpu().clone()
124
+ else:
125
+ param = param.detach().cpu().clone()
126
+ return param
127
+
128
+
129
+ # Borrowed from peft.utils.get_peft_model_state_dict
130
+ def get_peft_state_maybe_zero_3(named_params, bias):
131
+ if bias == "none":
132
+ to_return = {k: t for k, t in named_params if "lora_" in k}
133
+ elif bias == "all":
134
+ to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
135
+ elif bias == "lora_only":
136
+ to_return = {}
137
+ maybe_lora_bias = {}
138
+ lora_bias_names = set()
139
+ for k, t in named_params:
140
+ if "lora_" in k:
141
+ to_return[k] = t
142
+ bias_name = k.split("lora_")[0] + "bias"
143
+ lora_bias_names.add(bias_name)
144
+ elif "bias" in k:
145
+ maybe_lora_bias[k] = t
146
+ for k, t in maybe_lora_bias:
147
+ if bias_name in lora_bias_names:
148
+ to_return[bias_name] = t
149
+ else:
150
+ raise NotImplementedError
151
+ to_return = {k: maybe_zero_3(v, ignore_status=True) for k, v in to_return.items()}
152
+ return to_return
153
+
154
+
155
+ def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only=True):
156
+ to_return = {k: t for k, t in named_params if "lora_" not in k}
157
+ if require_grad_only:
158
+ to_return = {k: t for k, t in to_return.items() if t.requires_grad}
159
+ to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
160
+ return to_return
161
+
162
+
163
+ def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
164
+ to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
165
+ to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
166
+ return to_return
167
+
168
+
169
+ def find_all_linear_names(model):
170
+ cls = torch.nn.Linear
171
+ lora_module_names = set()
172
+ multimodal_keywords = ['mm_projector', 'vision_tower', 'vision_resampler']
173
+ for name, module in model.named_modules():
174
+ if any(mm_keyword in name for mm_keyword in multimodal_keywords):
175
+ continue
176
+ if isinstance(module, cls):
177
+ names = name.split('.')
178
+ lora_module_names.add(names[0] if len(names) == 1 else names[-1])
179
+
180
+ if 'lm_head' in lora_module_names: # needed for 16-bit
181
+ lora_module_names.remove('lm_head')
182
+ return list(lora_module_names)
183
+
184
+
185
+ def safe_save_model_for_hf_trainer(trainer: transformers.Trainer,
186
+ output_dir: str):
187
+ """Collects the state dict and dump to disk."""
188
+
189
+ if getattr(trainer.args, "tune_mm_mlp_adapter", False):
190
+ # Only save Adapter
191
+ keys_to_match = ['mm_projector']
192
+ if getattr(trainer.args, "use_im_start_end", False):
193
+ keys_to_match.extend(['embed_tokens', 'embed_in'])
194
+
195
+ weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
196
+ trainer.model.config.save_pretrained(output_dir)
197
+
198
+ current_folder = output_dir.split('/')[-1]
199
+ parent_folder = os.path.dirname(output_dir)
200
+ if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
201
+ if current_folder.startswith('checkpoint-'):
202
+ mm_projector_folder = os.path.join(parent_folder, "mm_projector")
203
+ os.makedirs(mm_projector_folder, exist_ok=True)
204
+ torch.save(weight_to_save, os.path.join(mm_projector_folder, f'{current_folder}.bin'))
205
+ else:
206
+ torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin'))
207
+ return
208
+
209
+ if trainer.deepspeed:
210
+ torch.cuda.synchronize()
211
+ trainer.save_model(output_dir)
212
+ return
213
+
214
+ state_dict = trainer.model.state_dict()
215
+ if trainer.args.should_save:
216
+ cpu_state_dict = {
217
+ key: value.cpu()
218
+ for key, value in state_dict.items()
219
+ }
220
+ del state_dict
221
+ trainer._save(output_dir, state_dict=cpu_state_dict) # noqa
222
+
223
+
224
+ def smart_tokenizer_and_embedding_resize(
225
+ special_tokens_dict: Dict,
226
+ tokenizer: transformers.PreTrainedTokenizer,
227
+ model: transformers.PreTrainedModel,
228
+ ):
229
+ """Resize tokenizer and embedding.
230
+
231
+ Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
232
+ """
233
+ num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
234
+ model.resize_token_embeddings(len(tokenizer))
235
+
236
+ if num_new_tokens > 0:
237
+ input_embeddings = model.get_input_embeddings().weight.data
238
+ output_embeddings = model.get_output_embeddings().weight.data
239
+
240
+ input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
241
+ dim=0, keepdim=True)
242
+ output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
243
+ dim=0, keepdim=True)
244
+
245
+ input_embeddings[-num_new_tokens:] = input_embeddings_avg
246
+ output_embeddings[-num_new_tokens:] = output_embeddings_avg
247
+
248
+
249
+ def _tokenize_fn(strings: Sequence[str],
250
+ tokenizer: transformers.PreTrainedTokenizer) -> Dict:
251
+ """Tokenize a list of strings."""
252
+ tokenized_list = [
253
+ tokenizer(
254
+ text,
255
+ return_tensors="pt",
256
+ padding="longest",
257
+ max_length=tokenizer.model_max_length,
258
+ truncation=True,
259
+ ) for text in strings
260
+ ]
261
+ input_ids = labels = [
262
+ tokenized.input_ids[0] for tokenized in tokenized_list
263
+ ]
264
+ input_ids_lens = labels_lens = [
265
+ tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item()
266
+ for tokenized in tokenized_list
267
+ ]
268
+ return dict(
269
+ input_ids=input_ids,
270
+ labels=labels,
271
+ input_ids_lens=input_ids_lens,
272
+ labels_lens=labels_lens,
273
+ )
274
+
275
+
276
+ def _mask_targets(target, tokenized_lens, speakers):
277
+ # cur_idx = 0
278
+ cur_idx = tokenized_lens[0]
279
+ tokenized_lens = tokenized_lens[1:]
280
+ target[:cur_idx] = IGNORE_INDEX
281
+ for tokenized_len, speaker in zip(tokenized_lens, speakers):
282
+ if speaker == "human":
283
+ target[cur_idx+2:cur_idx + tokenized_len] = IGNORE_INDEX
284
+ cur_idx += tokenized_len
285
+
286
+
287
+ def _add_speaker_and_signal(header, source, get_conversation=True):
288
+ """Add speaker and start/end signal on each round."""
289
+ BEGIN_SIGNAL = "### "
290
+ END_SIGNAL = "\n"
291
+ conversation = header
292
+ for sentence in source:
293
+ from_str = sentence["from"]
294
+ if from_str.lower() == "human":
295
+ from_str = conversation_lib.default_conversation.roles[0]
296
+ elif from_str.lower() == "gpt":
297
+ from_str = conversation_lib.default_conversation.roles[1]
298
+ else:
299
+ from_str = 'unknown'
300
+ sentence["value"] = (BEGIN_SIGNAL + from_str + ": " +
301
+ sentence["value"] + END_SIGNAL)
302
+ if get_conversation:
303
+ conversation += sentence["value"]
304
+ conversation += BEGIN_SIGNAL
305
+ return conversation
306
+
307
+
308
+ def preprocess_multimodal(
309
+ sources: Sequence[str],
310
+ data_args: DataArguments
311
+ ) -> Dict:
312
+ is_multimodal = data_args.is_multimodal
313
+ if not is_multimodal:
314
+ return sources
315
+
316
+ for source in sources:
317
+ for sentence in source:
318
+ if DEFAULT_IMAGE_TOKEN in sentence['value']:
319
+ sentence['value'] = sentence['value'].replace(DEFAULT_IMAGE_TOKEN, '').strip()
320
+ sentence['value'] = DEFAULT_IMAGE_TOKEN + '\n' + sentence['value']
321
+ sentence['value'] = sentence['value'].strip()
322
+ if "mmtag" in conversation_lib.default_conversation.version:
323
+ sentence['value'] = sentence['value'].replace(DEFAULT_IMAGE_TOKEN, '<Image>' + DEFAULT_IMAGE_TOKEN + '</Image>')
324
+ replace_token = DEFAULT_IMAGE_TOKEN
325
+ if data_args.mm_use_im_start_end:
326
+ replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
327
+ sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token)
328
+
329
+ return sources
330
+
331
+
332
+ def preprocess_llama_2(
333
+ sources,
334
+ tokenizer: transformers.PreTrainedTokenizer,
335
+ has_image: bool = False
336
+ ) -> Dict:
337
+ conv = conversation_lib.default_conversation.copy()
338
+ roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
339
+
340
+ # Apply prompt templates
341
+ conversations = []
342
+ for i, source in enumerate(sources):
343
+ if roles[source[0]["from"]] != conv.roles[0]:
344
+ # Skip the first one if it is not from human
345
+ source = source[1:]
346
+
347
+ conv.messages = []
348
+ for j, sentence in enumerate(source):
349
+ role = roles[sentence["from"]]
350
+ assert role == conv.roles[j % 2], f"{i}"
351
+ conv.append_message(role, sentence["value"])
352
+ conversations.append(conv.get_prompt())
353
+
354
+ # Tokenize conversations
355
+
356
+ if has_image:
357
+ input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
358
+ else:
359
+ input_ids = tokenizer(
360
+ conversations,
361
+ return_tensors="pt",
362
+ padding="longest",
363
+ max_length=tokenizer.model_max_length,
364
+ truncation=True,
365
+ ).input_ids
366
+
367
+ targets = input_ids.clone()
368
+
369
+ assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_2
370
+
371
+ # Mask targets
372
+ sep = "[/INST] "
373
+ for conversation, target in zip(conversations, targets):
374
+ total_len = int(target.ne(tokenizer.pad_token_id).sum())
375
+
376
+ rounds = conversation.split(conv.sep2)
377
+ cur_len = 1
378
+ target[:cur_len] = IGNORE_INDEX
379
+ for i, rou in enumerate(rounds):
380
+ if rou == "":
381
+ break
382
+
383
+ parts = rou.split(sep)
384
+ if len(parts) != 2:
385
+ break
386
+ parts[0] += sep
387
+
388
+ if has_image:
389
+ round_len = len(tokenizer_image_token(rou, tokenizer))
390
+ instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
391
+ else:
392
+ round_len = len(tokenizer(rou).input_ids)
393
+ instruction_len = len(tokenizer(parts[0]).input_ids) - 2
394
+
395
+ target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
396
+
397
+ cur_len += round_len
398
+ target[cur_len:] = IGNORE_INDEX
399
+
400
+ if cur_len < tokenizer.model_max_length:
401
+ if cur_len != total_len:
402
+ target[:] = IGNORE_INDEX
403
+ print(
404
+ f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
405
+ f" (ignored)"
406
+ )
407
+
408
+ return dict(
409
+ input_ids=input_ids,
410
+ labels=targets,
411
+ )
412
+
413
+
414
+ def preprocess_v1(
415
+ sources,
416
+ tokenizer: transformers.PreTrainedTokenizer,
417
+ has_image: bool = False
418
+ ) -> Dict:
419
+ conv = conversation_lib.default_conversation.copy()
420
+ roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
421
+
422
+ # Apply prompt templates
423
+ conversations = []
424
+ for i, source in enumerate(sources):
425
+ if roles[source[0]["from"]] != conv.roles[0]:
426
+ # Skip the first one if it is not from human
427
+ source = source[1:]
428
+
429
+ conv.messages = []
430
+ for j, sentence in enumerate(source):
431
+ role = roles[sentence["from"]]
432
+ assert role == conv.roles[j % 2], f"{i}"
433
+ conv.append_message(role, sentence["value"])
434
+ conversations.append(conv.get_prompt())
435
+
436
+ # Tokenize conversations
437
+
438
+ if has_image:
439
+ input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
440
+ else:
441
+ input_ids = tokenizer(
442
+ conversations,
443
+ return_tensors="pt",
444
+ padding="longest",
445
+ max_length=tokenizer.model_max_length,
446
+ truncation=True,
447
+ ).input_ids
448
+
449
+ targets = input_ids.clone()
450
+
451
+ assert conv.sep_style == conversation_lib.SeparatorStyle.TWO
452
+
453
+ # Mask targets
454
+ sep = conv.sep + conv.roles[1] + ": "
455
+ for conversation, target in zip(conversations, targets):
456
+ total_len = int(target.ne(tokenizer.pad_token_id).sum())
457
+
458
+ rounds = conversation.split(conv.sep2)
459
+ cur_len = 1
460
+ target[:cur_len] = IGNORE_INDEX
461
+ for i, rou in enumerate(rounds):
462
+ if rou == "":
463
+ break
464
+
465
+ parts = rou.split(sep)
466
+ if len(parts) != 2:
467
+ break
468
+ parts[0] += sep
469
+
470
+ if has_image:
471
+ round_len = len(tokenizer_image_token(rou, tokenizer))
472
+ instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
473
+ else:
474
+ round_len = len(tokenizer(rou).input_ids)
475
+ instruction_len = len(tokenizer(parts[0]).input_ids) - 2
476
+
477
+ if i != 0 and not tokenizer.legacy and IS_TOKENIZER_GREATER_THAN_0_14:
478
+ round_len -= 1
479
+ instruction_len -= 1
480
+
481
+ target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
482
+
483
+ cur_len += round_len
484
+ target[cur_len:] = IGNORE_INDEX
485
+
486
+ if cur_len < tokenizer.model_max_length:
487
+ if cur_len != total_len:
488
+ target[:] = IGNORE_INDEX
489
+ print(
490
+ f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
491
+ f" (ignored)"
492
+ )
493
+
494
+ return dict(
495
+ input_ids=input_ids,
496
+ labels=targets,
497
+ )
498
+
499
+
500
+ def preprocess_mpt(
501
+ sources,
502
+ tokenizer: transformers.PreTrainedTokenizer,
503
+ has_image: bool = False
504
+ ) -> Dict:
505
+ conv = conversation_lib.default_conversation.copy()
506
+ roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
507
+
508
+ # Apply prompt templates
509
+ conversations = []
510
+ for i, source in enumerate(sources):
511
+ if roles[source[0]["from"]] != conv.roles[0]:
512
+ # Skip the first one if it is not from human
513
+ source = source[1:]
514
+
515
+ conv.messages = []
516
+ for j, sentence in enumerate(source):
517
+ role = roles[sentence["from"]]
518
+ assert role == conv.roles[j % 2], f"{i}"
519
+ conv.append_message(role, sentence["value"])
520
+ conversations.append(conv.get_prompt())
521
+
522
+ # Tokenize conversations
523
+
524
+ if has_image:
525
+ input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
526
+ else:
527
+ input_ids = tokenizer(
528
+ conversations,
529
+ return_tensors="pt",
530
+ padding="longest",
531
+ max_length=tokenizer.model_max_length,
532
+ truncation=True,
533
+ ).input_ids
534
+
535
+ targets = input_ids.clone()
536
+ assert conv.sep_style == conversation_lib.SeparatorStyle.MPT
537
+
538
+ # Mask targets
539
+ sep = conv.sep + conv.roles[1]
540
+ for conversation, target in zip(conversations, targets):
541
+ total_len = int(target.ne(tokenizer.pad_token_id).sum())
542
+
543
+ rounds = conversation.split(conv.sep)
544
+ re_rounds = [conv.sep.join(rounds[:3])] # system + user + gpt
545
+ for conv_idx in range(3, len(rounds), 2):
546
+ re_rounds.append(conv.sep.join(rounds[conv_idx:conv_idx+2])) # user + gpt
547
+ cur_len = 0
548
+ target[:cur_len] = IGNORE_INDEX
549
+ for i, rou in enumerate(re_rounds):
550
+ if rou == "":
551
+ break
552
+
553
+ parts = rou.split(sep)
554
+ if len(parts) != 2:
555
+ break
556
+ parts[0] += sep
557
+
558
+ if has_image:
559
+ round_len = len(tokenizer_image_token(rou, tokenizer))
560
+ instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1
561
+ else:
562
+ round_len = len(tokenizer(rou).input_ids)
563
+ instruction_len = len(tokenizer(parts[0]).input_ids) - 1
564
+
565
+ if i != 0 and getattr(tokenizer, 'legacy', False) and IS_TOKENIZER_GREATER_THAN_0_14:
566
+ round_len += 1
567
+ instruction_len += 1
568
+
569
+ target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
570
+
571
+ cur_len += round_len
572
+ target[cur_len:] = IGNORE_INDEX
573
+
574
+ if cur_len < tokenizer.model_max_length:
575
+ if cur_len != total_len:
576
+ target[:] = IGNORE_INDEX
577
+ print(
578
+ f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
579
+ f" (ignored)"
580
+ )
581
+
582
+ return dict(
583
+ input_ids=input_ids,
584
+ labels=targets,
585
+ )
586
+
587
+
588
+ def preprocess_plain(
589
+ sources: Sequence[str],
590
+ tokenizer: transformers.PreTrainedTokenizer,
591
+ ) -> Dict:
592
+ # add end signal and concatenate together
593
+ conversations = []
594
+ for source in sources:
595
+ assert len(source) == 2
596
+ assert DEFAULT_IMAGE_TOKEN in source[0]['value']
597
+ source[0]['value'] = DEFAULT_IMAGE_TOKEN
598
+ conversation = source[0]['value'] + source[1]['value'] + conversation_lib.default_conversation.sep
599
+ conversations.append(conversation)
600
+ # tokenize conversations
601
+ input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations]
602
+ targets = copy.deepcopy(input_ids)
603
+ for target, source in zip(targets, sources):
604
+ tokenized_len = len(tokenizer_image_token(source[0]['value'], tokenizer))
605
+ target[:tokenized_len] = IGNORE_INDEX
606
+
607
+ return dict(input_ids=input_ids, labels=targets)
608
+
609
+
610
+ def preprocess(
611
+ sources: Sequence[str],
612
+ tokenizer: transformers.PreTrainedTokenizer,
613
+ has_image: bool = False
614
+ ) -> Dict:
615
+ """
616
+ Given a list of sources, each is a conversation list. This transform:
617
+ 1. Add signal '### ' at the beginning each sentence, with end signal '\n';
618
+ 2. Concatenate conversations together;
619
+ 3. Tokenize the concatenated conversation;
620
+ 4. Make a deepcopy as the target. Mask human words with IGNORE_INDEX.
621
+ """
622
+ if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.PLAIN:
623
+ return preprocess_plain(sources, tokenizer)
624
+ if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.LLAMA_2:
625
+ return preprocess_llama_2(sources, tokenizer, has_image=has_image)
626
+ if conversation_lib.default_conversation.version.startswith("v1"):
627
+ return preprocess_v1(sources, tokenizer, has_image=has_image)
628
+ if conversation_lib.default_conversation.version == "mpt":
629
+ return preprocess_mpt(sources, tokenizer, has_image=has_image)
630
+ # add end signal and concatenate together
631
+ conversations = []
632
+ for source in sources:
633
+ header = f"{conversation_lib.default_conversation.system}\n\n"
634
+ conversation = _add_speaker_and_signal(header, source)
635
+ conversations.append(conversation)
636
+ # tokenize conversations
637
+ def get_tokenize_len(prompts):
638
+ return [len(tokenizer_image_token(prompt, tokenizer)) for prompt in prompts]
639
+
640
+ if has_image:
641
+ input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations]
642
+ else:
643
+ conversations_tokenized = _tokenize_fn(conversations, tokenizer)
644
+ input_ids = conversations_tokenized["input_ids"]
645
+
646
+ targets = copy.deepcopy(input_ids)
647
+ for target, source in zip(targets, sources):
648
+ if has_image:
649
+ tokenized_lens = get_tokenize_len([header] + [s["value"] for s in source])
650
+ else:
651
+ tokenized_lens = _tokenize_fn([header] + [s["value"] for s in source], tokenizer)["input_ids_lens"]
652
+ speakers = [sentence["from"] for sentence in source]
653
+ _mask_targets(target, tokenized_lens, speakers)
654
+
655
+ return dict(input_ids=input_ids, labels=targets)
656
+
657
+
658
+ class LazySupervisedDataset(Dataset):
659
+ """Dataset for supervised fine-tuning."""
660
+
661
+ def __init__(self, data_path: str,
662
+ tokenizer: transformers.PreTrainedTokenizer,
663
+ data_args: DataArguments):
664
+ super(LazySupervisedDataset, self).__init__()
665
+ list_data_dict = json.load(open(data_path, "r"))
666
+
667
+ rank0_print("Formatting inputs...Skip in lazy mode")
668
+ self.tokenizer = tokenizer
669
+ self.list_data_dict = list_data_dict
670
+ self.data_args = data_args
671
+
672
+ def __len__(self):
673
+ return len(self.list_data_dict)
674
+
675
+ @property
676
+ def lengths(self):
677
+ length_list = []
678
+ for sample in self.list_data_dict:
679
+ img_tokens = 128 if 'image' in sample else 0
680
+ length_list.append(sum(len(conv['value'].split()) for conv in sample['conversations']) + img_tokens)
681
+ return length_list
682
+
683
+ @property
684
+ def modality_lengths(self):
685
+ length_list = []
686
+ for sample in self.list_data_dict:
687
+ cur_len = sum(len(conv['value'].split()) for conv in sample['conversations'])
688
+ cur_len = cur_len if 'image' in sample else -cur_len
689
+ length_list.append(cur_len)
690
+ return length_list
691
+
692
+ def __getitem__(self, i) -> Dict[str, torch.Tensor]:
693
+ sources = self.list_data_dict[i]
694
+ if isinstance(i, int):
695
+ sources = [sources]
696
+ assert len(sources) == 1, "Don't know why it is wrapped to a list" # FIXME
697
+ if 'image' in sources[0]:
698
+ image_file = self.list_data_dict[i]['image']
699
+ image_folder = self.data_args.image_folder
700
+ processor = self.data_args.image_processor
701
+ image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')
702
+ if self.data_args.image_aspect_ratio == 'pad':
703
+ def expand2square(pil_img, background_color):
704
+ width, height = pil_img.size
705
+ if width == height:
706
+ return pil_img
707
+ elif width > height:
708
+ result = Image.new(pil_img.mode, (width, width), background_color)
709
+ result.paste(pil_img, (0, (width - height) // 2))
710
+ return result
711
+ else:
712
+ result = Image.new(pil_img.mode, (height, height), background_color)
713
+ result.paste(pil_img, ((height - width) // 2, 0))
714
+ return result
715
+ image = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
716
+ image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
717
+ else:
718
+ image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
719
+ sources = preprocess_multimodal(
720
+ copy.deepcopy([e["conversations"] for e in sources]),
721
+ self.data_args)
722
+ else:
723
+ sources = copy.deepcopy([e["conversations"] for e in sources])
724
+ data_dict = preprocess(
725
+ sources,
726
+ self.tokenizer,
727
+ has_image=('image' in self.list_data_dict[i]))
728
+ if isinstance(i, int):
729
+ data_dict = dict(input_ids=data_dict["input_ids"][0],
730
+ labels=data_dict["labels"][0])
731
+
732
+ # image exist in the data
733
+ if 'image' in self.list_data_dict[i]:
734
+ data_dict['image'] = image
735
+ elif self.data_args.is_multimodal:
736
+ # image does not exist in the data, but the model is multimodal
737
+ crop_size = self.data_args.image_processor.crop_size
738
+ data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
739
+ return data_dict
740
+
741
+
742
+ @dataclass
743
+ class DataCollatorForSupervisedDataset(object):
744
+ """Collate examples for supervised fine-tuning."""
745
+
746
+ tokenizer: transformers.PreTrainedTokenizer
747
+
748
+ def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
749
+ input_ids, labels = tuple([instance[key] for instance in instances]
750
+ for key in ("input_ids", "labels"))
751
+ input_ids = torch.nn.utils.rnn.pad_sequence(
752
+ input_ids,
753
+ batch_first=True,
754
+ padding_value=self.tokenizer.pad_token_id)
755
+ labels = torch.nn.utils.rnn.pad_sequence(labels,
756
+ batch_first=True,
757
+ padding_value=IGNORE_INDEX)
758
+ input_ids = input_ids[:, :self.tokenizer.model_max_length]
759
+ labels = labels[:, :self.tokenizer.model_max_length]
760
+ batch = dict(
761
+ input_ids=input_ids,
762
+ labels=labels,
763
+ attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
764
+ )
765
+
766
+ if 'image' in instances[0]:
767
+ images = [instance['image'] for instance in instances]
768
+ if all(x is not None and x.shape == images[0].shape for x in images):
769
+ batch['images'] = torch.stack(images)
770
+ else:
771
+ batch['images'] = images
772
+
773
+ return batch
774
+
775
+
776
+ def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer,
777
+ data_args) -> Dict:
778
+ """Make dataset and collator for supervised fine-tuning."""
779
+ train_dataset = LazySupervisedDataset(tokenizer=tokenizer,
780
+ data_path=data_args.data_path,
781
+ data_args=data_args)
782
+ data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
783
+ return dict(train_dataset=train_dataset,
784
+ eval_dataset=None,
785
+ data_collator=data_collator)
786
+
787
+
788
+ def train(attn_implementation=None):
789
+ global local_rank
790
+
791
+ parser = transformers.HfArgumentParser(
792
+ (ModelArguments, DataArguments, TrainingArguments))
793
+ model_args, data_args, training_args = parser.parse_args_into_dataclasses()
794
+ local_rank = training_args.local_rank
795
+ compute_dtype = (torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))
796
+
797
+ bnb_model_from_pretrained_args = {}
798
+ if training_args.bits in [4, 8]:
799
+ from transformers import BitsAndBytesConfig
800
+ bnb_model_from_pretrained_args.update(dict(
801
+ device_map={"": training_args.device},
802
+ load_in_4bit=training_args.bits == 4,
803
+ load_in_8bit=training_args.bits == 8,
804
+ quantization_config=BitsAndBytesConfig(
805
+ load_in_4bit=training_args.bits == 4,
806
+ load_in_8bit=training_args.bits == 8,
807
+ llm_int8_skip_modules=["mm_projector"],
808
+ llm_int8_threshold=6.0,
809
+ llm_int8_has_fp16_weight=False,
810
+ bnb_4bit_compute_dtype=compute_dtype,
811
+ bnb_4bit_use_double_quant=training_args.double_quant,
812
+ bnb_4bit_quant_type=training_args.quant_type # {'fp4', 'nf4'}
813
+ )
814
+ ))
815
+
816
+ if model_args.vision_tower is not None:
817
+ if 'mpt' in model_args.model_name_or_path:
818
+ config = transformers.AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
819
+ config.attn_config['attn_impl'] = training_args.mpt_attn_impl
820
+ model = LlavaMptForCausalLM.from_pretrained(
821
+ model_args.model_name_or_path,
822
+ config=config,
823
+ cache_dir=training_args.cache_dir,
824
+ **bnb_model_from_pretrained_args
825
+ )
826
+ else:
827
+ model = LlavaLlamaForCausalLM.from_pretrained(
828
+ model_args.model_name_or_path,
829
+ cache_dir=training_args.cache_dir,
830
+ attn_implementation=attn_implementation,
831
+ torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
832
+ **bnb_model_from_pretrained_args
833
+ )
834
+ else:
835
+ model = transformers.LlamaForCausalLM.from_pretrained(
836
+ model_args.model_name_or_path,
837
+ cache_dir=training_args.cache_dir,
838
+ attn_implementation=attn_implementation,
839
+ torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
840
+ **bnb_model_from_pretrained_args
841
+ )
842
+ model.config.use_cache = False
843
+
844
+ if model_args.freeze_backbone:
845
+ model.model.requires_grad_(False)
846
+
847
+ if training_args.bits in [4, 8]:
848
+ from peft import prepare_model_for_kbit_training
849
+ model.config.torch_dtype=(torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))
850
+ model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
851
+
852
+ if training_args.gradient_checkpointing:
853
+ if hasattr(model, "enable_input_require_grads"):
854
+ model.enable_input_require_grads()
855
+ else:
856
+ def make_inputs_require_grad(module, input, output):
857
+ output.requires_grad_(True)
858
+ model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
859
+
860
+ if training_args.lora_enable:
861
+ from peft import LoraConfig, get_peft_model
862
+ lora_config = LoraConfig(
863
+ r=training_args.lora_r,
864
+ lora_alpha=training_args.lora_alpha,
865
+ target_modules=find_all_linear_names(model),
866
+ lora_dropout=training_args.lora_dropout,
867
+ bias=training_args.lora_bias,
868
+ task_type="CAUSAL_LM",
869
+ )
870
+ if training_args.bits == 16:
871
+ if training_args.bf16:
872
+ model.to(torch.bfloat16)
873
+ if training_args.fp16:
874
+ model.to(torch.float16)
875
+ rank0_print("Adding LoRA adapters...")
876
+ model = get_peft_model(model, lora_config)
877
+
878
+ if 'mpt' in model_args.model_name_or_path:
879
+ tokenizer = transformers.AutoTokenizer.from_pretrained(
880
+ model_args.model_name_or_path,
881
+ cache_dir=training_args.cache_dir,
882
+ model_max_length=training_args.model_max_length,
883
+ padding_side="right"
884
+ )
885
+ else:
886
+ tokenizer = transformers.AutoTokenizer.from_pretrained(
887
+ model_args.model_name_or_path,
888
+ cache_dir=training_args.cache_dir,
889
+ model_max_length=training_args.model_max_length,
890
+ padding_side="right",
891
+ use_fast=False,
892
+ )
893
+
894
+ if model_args.version == "v0":
895
+ if tokenizer.pad_token is None:
896
+ smart_tokenizer_and_embedding_resize(
897
+ special_tokens_dict=dict(pad_token="[PAD]"),
898
+ tokenizer=tokenizer,
899
+ model=model,
900
+ )
901
+ elif model_args.version == "v0.5":
902
+ tokenizer.pad_token = tokenizer.unk_token
903
+ else:
904
+ tokenizer.pad_token = tokenizer.unk_token
905
+ if model_args.version in conversation_lib.conv_templates:
906
+ conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
907
+ else:
908
+ conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]
909
+
910
+ if model_args.vision_tower is not None:
911
+ model.get_model().initialize_vision_modules(
912
+ model_args=model_args,
913
+ fsdp=training_args.fsdp
914
+ )
915
+
916
+ vision_tower = model.get_vision_tower()
917
+ vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
918
+
919
+ data_args.image_processor = vision_tower.image_processor
920
+ data_args.is_multimodal = True
921
+
922
+ model.config.image_aspect_ratio = data_args.image_aspect_ratio
923
+ model.config.tokenizer_padding_side = tokenizer.padding_side
924
+ model.config.tokenizer_model_max_length = tokenizer.model_max_length
925
+
926
+ model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
927
+ if model_args.tune_mm_mlp_adapter:
928
+ model.requires_grad_(False)
929
+ for p in model.get_model().mm_projector.parameters():
930
+ p.requires_grad = True
931
+
932
+ model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
933
+ if training_args.freeze_mm_mlp_adapter:
934
+ for p in model.get_model().mm_projector.parameters():
935
+ p.requires_grad = False
936
+
937
+ if training_args.bits in [4, 8]:
938
+ model.get_model().mm_projector.to(dtype=compute_dtype, device=training_args.device)
939
+
940
+ model.config.mm_use_im_start_end = data_args.mm_use_im_start_end = model_args.mm_use_im_start_end
941
+ model.config.mm_projector_lr = training_args.mm_projector_lr
942
+ training_args.use_im_start_end = model_args.mm_use_im_start_end
943
+ model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token
944
+ model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer)
945
+
946
+ if training_args.bits in [4, 8]:
947
+ from peft.tuners.lora import LoraLayer
948
+ for name, module in model.named_modules():
949
+ if isinstance(module, LoraLayer):
950
+ if training_args.bf16:
951
+ module = module.to(torch.bfloat16)
952
+ if 'norm' in name:
953
+ module = module.to(torch.float32)
954
+ if 'lm_head' in name or 'embed_tokens' in name:
955
+ if hasattr(module, 'weight'):
956
+ if training_args.bf16 and module.weight.dtype == torch.float32:
957
+ module = module.to(torch.bfloat16)
958
+
959
+ data_module = make_supervised_data_module(tokenizer=tokenizer,
960
+ data_args=data_args)
961
+ trainer = LLaVATrainer(model=model,
962
+ tokenizer=tokenizer,
963
+ args=training_args,
964
+ **data_module)
965
+
966
+ if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
967
+ trainer.train(resume_from_checkpoint=True)
968
+ else:
969
+ trainer.train()
970
+ trainer.save_state()
971
+
972
+ model.config.use_cache = True
973
+
974
+ if training_args.lora_enable:
975
+ state_dict = get_peft_state_maybe_zero_3(
976
+ model.named_parameters(), training_args.lora_bias
977
+ )
978
+ non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(
979
+ model.named_parameters()
980
+ )
981
+ if training_args.local_rank == 0 or training_args.local_rank == -1:
982
+ model.config.save_pretrained(training_args.output_dir)
983
+ model.save_pretrained(training_args.output_dir, state_dict=state_dict)
984
+ torch.save(non_lora_state_dict, os.path.join(training_args.output_dir, 'non_lora_trainables.bin'))
985
+ else:
986
+ safe_save_model_for_hf_trainer(trainer=trainer,
987
+ output_dir=training_args.output_dir)
988
+
989
+
990
+ if __name__ == "__main__":
991
+ train()
LLaVA/llava/train/train_mem.py ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ from llava.train.train import train
2
+
3
+ if __name__ == "__main__":
4
+ train(attn_implementation="flash_attention_2")
LLaVA/llava/train/train_xformers.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Make it more memory efficient by monkey patching the LLaMA model with xformers attention.
2
+
3
+ # Need to call this before importing transformers.
4
+ from llava.train.llama_xformers_attn_monkey_patch import (
5
+ replace_llama_attn_with_xformers_attn,
6
+ )
7
+
8
+ replace_llama_attn_with_xformers_attn()
9
+
10
+ from llava.train.train import train
11
+
12
+ if __name__ == "__main__":
13
+ train()
LLaVA/playground/data/prompts/complex_reasoning/000_caps.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ A man wearing multiple neck ties making a goofy face.
2
+ A man in a white shirt wearing very many ties.
3
+ a man with ties on poses for a picture
4
+ A man wearing multiple ties on his neck.
5
+ A young man smiles while wearing several ties.
6
+
7
+ tie: [0.574, 0.298, 0.752, 0.704]
8
+ tie: [0.464, 0.339, 0.639, 0.789]
9
+ tie: [0.349, 0.363, 0.563, 0.732]
10
+ tie: [0.259, 0.255, 0.668, 0.805]
11
+ person: [0.019, 0.065, 0.962, 0.988]
12
+ person: [0.0, 0.24, 0.214, 1.0]
13
+ tie: [0.316, 0.778, 0.443, 0.867]
14
+ tie: [0.386, 0.707, 0.496, 0.801]
15
+ tie: [0.251, 0.354, 0.402, 0.747]
16
+ tie: [0.44, 0.318, 0.63, 0.397]
17
+ tie: [0.566, 0.33, 0.655, 0.543]
18
+ tie: [0.25, 0.409, 0.359, 0.556]
LLaVA/playground/data/prompts/complex_reasoning/000_conv.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Question:
2
+ What is unusual about this photo?
3
+ ===
4
+ Answer:
5
+ In the photo, the man is wearing a total of ten ties around his neck. This is considered unusual because it is not common for someone to wear multiple ties at once. In a typical setting, a person would wear only one tie. The man's goofy facial expression also adds to the unconventional and humorous nature of the image.
LLaVA/playground/data/prompts/complex_reasoning/001_caps.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ A group of people standing outside of a black vehicle with various luggage.
2
+ Luggage surrounds a vehicle in an underground parking area
3
+ People try to fit all of their luggage in an SUV.
4
+ The sport utility vehicle is parked in the public garage, being packed for a trip
5
+ Some people with luggage near a van that is transporting it.
6
+
7
+ person: [0.681, 0.242, 0.774, 0.694]
8
+ person: [0.63, 0.222, 0.686, 0.516]
9
+ person: [0.444, 0.233, 0.487, 0.34]
10
+ backpack: [0.384, 0.696, 0.485, 0.914]
11
+ backpack: [0.755, 0.413, 0.846, 0.692]
12
+ suitcase: [0.758, 0.413, 0.845, 0.69]
13
+ suitcase: [0.1, 0.497, 0.173, 0.579]
14
+ bicycle: [0.282, 0.363, 0.327, 0.442]
15
+ car: [0.786, 0.25, 0.848, 0.322]
16
+ car: [0.783, 0.27, 0.827, 0.335]
17
+ car: [0.86, 0.254, 0.891, 0.3]
18
+ car: [0.261, 0.101, 0.787, 0.626]
LLaVA/playground/data/prompts/complex_reasoning/001_conv.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Question:
2
+ What challenges do these people face?
3
+ ===
4
+ Answer:
5
+ In the image, a group of people is standing outside a black SUV in a parking area, surrounded by various pieces of luggage, including suitcases and backpacks. They are facing the challenge of fitting all their luggage into the black SUV. There are multiple suitcases and backpacks to be packed, which suggests that the group has a significant amount of belongings to accommodate. They might have to strategize and arrange the luggage efficiently to ensure that everything fits properly into the vehicle. Additionally, they need to consider the comfort of the passengers and visibility while driving, so the placement of the luggage must not obstruct the driver's view or make the passengers uncomfortable during the trip.
LLaVA/playground/data/prompts/complex_reasoning/002_caps.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ There is a movie theater that displays the show times above the doors.
2
+ A red fire hydrant is deep in the snow.
3
+ The fire hydrant is in the snow near a recently plowed sidewalk.
4
+ This city has had a very hard winter with snow.
5
+ A hotel for dogs in the snow in winter.
6
+
7
+ fire hydrant: [0.326, 0.612, 0.426, 0.72]
LLaVA/playground/data/prompts/complex_reasoning/002_conv.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Question:
2
+ What challenges might this city face?
3
+ ===
4
+ Answer:
5
+ The city faces challenges due to the harsh winter conditions and heavy snowfall. In the image, a red fire hydrant is almost buried deep in the snow, which indicates the significant amount of snow the city has experienced. This can lead to various challenges such as difficulties in transportation, increased risk of accidents, and disruptions to daily life. For example, the recently plowed sidewalk near the fire hydrant shows that the city has to constantly clear snow from roads and sidewalks to maintain access and safety for pedestrians and vehicles. Moreover, emergency services, like firefighters, might face challenges accessing crucial equipment, such as fire hydrants, during emergencies due to the snow accumulation. This highlights the importance of effective snow management strategies and preparedness in such cities to minimize the impact of harsh winter conditions on residents and essential services.
LLaVA/playground/data/prompts/complex_reasoning/system_message.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ You are an AI visual assistant that can analyze a single image. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y.
2
+
3
+ The task is to use the provided caption and bounding box information, create a plausible question about the image, and provide the answer in detail.
4
+
5
+ Create complex questions beyond describing the scene.
6
+ To answer such questions, one should require first understanding the visual content, then based on the background knowledge or reasoning, either explain why the things are happening that way, or provide guides and help to user's request. Make the question challenging by not including the visual content details in the question so that the user needs to reason about that first.
7
+
8
+ Instead of directly mentioning the bounding box coordinates, utilize this data to explain the scene using natural language. Include details like object counts, position of the objects, relative position between the objects.
9
+
10
+ When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box. Always answer as if you are directly looking at the image.
LLaVA/playground/data/prompts/conversation/001_caps.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ A man is skiing in the open snow covered hills
2
+ A skier is making his way into the snow.
3
+ A skier on the foothills of a huge mountain range.
4
+ A skier looks at mountains while standing near a trail sign.
5
+ a single person skiing by an area with a lot of bushes
LLaVA/playground/data/prompts/conversation/001_conv.txt ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Question:
2
+ What is the skier doing?
3
+ ===
4
+ Answer:
5
+ The skier is skiing in the snow-covered hills, making his way through the snow and enjoying the beautiful mountain scenery. They are also standing near a trail sign, which indicates their location on the mountain.
6
+ ===
7
+ Question:
8
+ Are there any other people or skiers visible in the image?
9
+ ===
10
+ Answer:
11
+ No, there is only a single person skiing in the image, and they are surrounded by an area with a lot of bushes.
12
+ ===
13
+ Question:
14
+ What kind of skiing is the person doing? Cross-country or downhill?
15
+ ===
16
+ Answer:
17
+ Based on what I can see in the photo, it is not possible to definitively determine whether the skier is doing cross-country or downhill skiing. However, given the presence of a trail sign and the fact that they are skiing in the foothills, it is more likely that the skier is engaged in cross-country skiing. Downhill skiing typically takes place on more steep slopes and involves ski lifts, whereas cross-country skiing often occurs on relatively flatter terrain with marked trails.
18
+ ===
19
+ Question:
20
+ Can you explain cross-country and downhill skiing a bit more?
21
+ ===
22
+ Answer:
23
+ Cross-country skiing and downhill skiing are two distinct types of skiing, each with its own techniques, equipment, and terrain preferences.
24
+
25
+ 1. Cross-Country Skiing:
26
+ Also known as Nordic skiing, cross-country skiing involves traversing relatively flat terrain or gentle slopes using skis and poles. It is a popular recreational and competitive sport that focuses on endurance, balance, and aerobic fitness. Cross-country skiers use their own locomotion to move across snow-covered terrain, rather than relying on ski lifts or other forms of assistance.
27
+ There are two main techniques in cross-country skiing:
28
+ - Classic: This technique involves skiing in parallel tracks, using a forward-and-backward gliding motion while keeping the skis parallel to each other. The skier uses their poles to propel themselves forward.
29
+ - Skate skiing: This technique is similar to ice skating, where the skier pushes off one ski at an angle and glides on the other. It requires a smooth, groomed surface and is typically faster than the classic technique.
30
+ Cross-country ski equipment consists of lightweight skis, boots, bindings, and poles. The boots are more flexible than those used in downhill skiing, allowing for better ankle movement and control.
31
+
32
+ 2. Downhill Skiing:
33
+ Also known as alpine skiing, downhill skiing involves descending slopes at high speeds using skis and poles for balance and control. This sport is more focused on speed, technique, and navigating challenging terrain, including steep slopes, moguls, and even jumps.
34
+ Downhill skiing can be further categorized into several disciplines, such as slalom, giant slalom, super-G, and downhill racing. Each discipline has its own set of rules, courses, and ski equipment.
35
+ Downhill ski equipment includes heavier and stiffer skis, boots, bindings, and poles than those used in cross-country skiing. The boots are more rigid to provide better support and control during high-speed descents and sharp turns.
36
+
37
+ In summary, cross-country skiing is an endurance-based sport that involves traveling across flat or gently sloping terrain, while downhill skiing is focused on speed and technique as skiers navigate steeper slopes and challenging terrain. Both sports require specialized equipment and techniques, but they offer different experiences and challenges to participants.
LLaVA/playground/data/prompts/detail_description/000_caps.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ A harbor filled with lots of boats next to a building.
2
+ A bicycle parked in front of several boats at a dock.
3
+ A red bicycle in front of a line of docked white yachts
4
+ A bike sits before boats which sit before a long building.
5
+ A bicycle is a convenient means of land transportation when you live on a boat.
6
+
7
+ bicycle: [0.287, 0.641, 0.507, 0.874]
8
+ bicycle: [0.566, 0.667, 0.63, 0.731]
9
+ boat: [0.318, 0.579, 0.575, 0.724]
10
+ boat: [0.704, 0.607, 0.818, 0.727]
11
+ boat: [0.818, 0.601, 0.942, 0.744]
12
+ boat: [0.002, 0.53, 0.243, 0.71]
13
+ boat: [0.541, 0.611, 0.668, 0.731]
14
+ person: [0.778, 0.527, 0.797, 0.57]
15
+ cup: [0.708, 0.733, 0.724, 0.758]
16
+ boat: [0.236, 0.532, 0.404, 0.64]
17
+ boat: [0.81, 0.632, 0.836, 0.676]
18
+ boat: [0.957, 0.526, 1.0, 0.752]
LLaVA/playground/data/prompts/detail_description/000_conv.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ It is a harbor filled with numerous boats of various sizes docked next to a long building. Among the boats, there are a few white yachts lined up, standing out from the rest. There is a red bicycle prominently parked in front of the line of docked boats, serving as a convenient means of land transportation for those living on the boats. Another bicycle can be seen further back in the scene, near the middle of the harbor.
2
+
3
+ A person is visible near the right side of the harbor, possibly enjoying the view or attending to their boat. Additionally, there is a cup placed on a surface near the middle of the scene.
LLaVA/playground/data/prompts/detail_description/001_caps.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ A group of people standing outside of a black vehicle with various luggage.
2
+ Luggage surrounds a vehicle in an underground parking area
3
+ People try to fit all of their luggage in an SUV.
4
+ The sport utility vehicle is parked in the public garage, being packed for a trip
5
+ Some people with luggage near a van that is transporting it.
6
+
7
+ person: [0.681, 0.242, 0.774, 0.694]
8
+ person: [0.63, 0.222, 0.686, 0.516]
9
+ person: [0.444, 0.233, 0.487, 0.34]
10
+ backpack: [0.384, 0.696, 0.485, 0.914]
11
+ backpack: [0.755, 0.413, 0.846, 0.692]
12
+ suitcase: [0.758, 0.413, 0.845, 0.69]
13
+ suitcase: [0.1, 0.497, 0.173, 0.579]
14
+ bicycle: [0.282, 0.363, 0.327, 0.442]
15
+ car: [0.786, 0.25, 0.848, 0.322]
16
+ car: [0.783, 0.27, 0.827, 0.335]
17
+ car: [0.86, 0.254, 0.891, 0.3]
18
+ car: [0.261, 0.101, 0.787, 0.626]
LLaVA/playground/data/prompts/detail_description/001_conv.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ The image is an underground parking area with a black sport utility vehicle (SUV) parked. There are three people in the scene, with one person standing closer to the left side of the vehicle, another person in the middle, and the third person on the right side. They are all working together to pack their luggage into the SUV for a trip.
2
+
3
+ In the parking area, various luggage items are scattered around the vehicle. There are two backpacks, one located near the left rear wheel and the other closer to the right side of the vehicle. Additionally, there are two suitcases, one on the right side of the car and another further away near the center of the parking area. A bicycle can also be seen on the left side of the vehicle.
4
+
5
+ Other cars are parked around the main SUV, with one car positioned behind it and slightly to the left, another behind and slightly to the right, and the third car further behind on the right side.
LLaVA/playground/data/prompts/detail_description/002_caps.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ A man holds a Wii-mote above his head while another looks on.
2
+ A guy and his friend are playing Nintendo Wii.
3
+ A young man is holding a video game remote over his head.
4
+ two men standing in a room while one plays with a wii mote
5
+ Some guys standing and playing a video game.
6
+
7
+ couch: [0.697, 0.759, 0.995, 1.0]
8
+ dining table: [0.426, 0.755, 1.0, 0.987]
9
+ person: [0.082, 0.252, 0.342, 1.0]
10
+ person: [0.399, 0.085, 0.742, 0.982]
11
+ remote: [0.477, 0.135, 0.516, 0.187]
12
+ sink: [0.016, 0.501, 0.063, 0.52]
13
+ potted plant: [0.798, 0.384, 0.888, 0.645]
14
+ refrigerator: [0.305, 0.389, 0.414, 0.547]
15
+ chair: [0.72, 0.509, 0.858, 0.725]
LLaVA/playground/data/prompts/detail_description/002_conv.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ The image shows two men standing in a room, engaged in playing a video game on a Nintendo Wii console. One of the men is holding a Wii remote above his head with enthusiasm, while the other man looks on, likely enjoying the friendly competition.
2
+
3
+ The room appears to be a living space with a couch located in the background and a dining table nearby. A potted plant can be seen placed close to the couch, and a chair is situated in the middle of the room. The room also features a kitchen area with a sink and a refrigerator visible in the background.
LLaVA/playground/data/prompts/detail_description/system_message.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ You are an AI visual assistant that can analyze a single image. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y.
2
+
3
+ Using the provided caption and bounding box information, describe the scene in a detailed manner.
4
+
5
+ Instead of directly mentioning the bounding box coordinates, utilize this data to explain the scene using natural language. Include details like object counts, position of the objects, relative position between the objects.
6
+
7
+ When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box. Always answer as if you are directly looking at the image.