luodian commited on
Commit
0974734
1 Parent(s): d160b6c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -0
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
+ ---
6
+
7
+ # LLaVA Model Card
8
+
9
+ ## Model Details
10
+
11
+ Model type: LLaVA is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.
12
+
13
+ Base LLM: Qwen/Qwen1.5-72B-Chat
14
+
15
+ ### Model Description
16
+
17
+ **Repository:** https://github.com/EvolvingLMMs-Lab/LLaVA-NEXT
18
+
19
+ **Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.
20
+
21
+ **Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
22
+
23
+ ### License Notices
24
+
25
+ This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama-1/2 community license for LLaMA-2 and Vicuna-v1.5, Tongyi Qianwen RESEARCH LICENSE AGREEMENT and Llama-3 Research License). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
26
+
27
+ ## How to Get Started with the Model
28
+
29
+ Use the code below to get started with the model.
30
+
31
+ [More Information Needed]
32
+
33
+ ## Training Details
34
+
35
+ ### Training Procedure
36
+
37
+ We conducted the training on LLaVA-1.6's codebase with adding support of Llama-3 and Qwen model.
38
+
39
+ ### Training Hyperparameters
40
+
41
+ ```shell
42
+ LLM_VERSION="Qwen/Qwen1.5-72B-Chat"
43
+ LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
44
+ VISION_MODEL_VERSION="openai/clip-vit-large-patch14-336"
45
+ VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"
46
+
47
+ PROMPT_VERSION=plain
48
+ PRETRAIN_DATA_VERSION="blip558k"
49
+ ############### Pretrain ################
50
+
51
+ BASE_RUN_NAME="llavanext-${LLM_VERSION_CLEAN}-${VISION_MODEL_VERSION_CLEAN}-pretrain_${PRETRAIN_DATA_VERSION}_plain"
52
+ echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"
53
+
54
+ PROMPT_VERSION="qwen_1_5"
55
+ MID_RUN_NAME="llavanext-${LLM_VERSION_CLEAN}-${VISION_MODEL_VERSION_CLEAN}-pretrain_${PRETRAIN_DATA_VERSION}_plain-ft_la1_6mix_d32k"
56
+ echo "MID_RUN_NAME: ${MID_RUN_NAME}"
57
+
58
+ torchrun # with necessary torchrun information for distributed training\
59
+ llava/train/train_mem.py \
60
+ --deepspeed scripts/zero3.json \
61
+ --model_name_or_path $LLM_VERSION \
62
+ --version $PROMPT_VERSION \
63
+ --data_path="/path/to/data/llava_instruct/llava1_6mix.json" \
64
+ --image_folder /path/to/data/llava_data \
65
+ --pretrain_mm_mlp_adapter="./checkpoints/projectors/${BASE_RUN_NAME}/mm_projector.bin" \
66
+ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
67
+ --mm_vision_tower_lr=2e-6 \
68
+ --vision_tower ${VISION_MODEL_VERSION} \
69
+ --mm_projector_type mlp2x_gelu \
70
+ --mm_vision_select_layer -2 \
71
+ --mm_use_im_start_end False \
72
+ --mm_use_im_patch_token False \
73
+ --group_by_modality_length True \
74
+ --image_aspect_ratio anyres \
75
+ --image_grid_pinpoints "[(336, 672), (672, 336), (672, 672), (1008, 336), (336, 1008)]" \
76
+ --mm_patch_merge_type spatial_unpad \
77
+ --bf16 True \
78
+ --run_name $MID_RUN_NAME \
79
+ --output_dir ./checkpoints/$MID_RUN_NAME \
80
+ --num_train_epochs 1 \
81
+ --per_device_train_batch_size 1 \
82
+ --per_device_eval_batch_size 4 \
83
+ --gradient_accumulation_steps 2 \
84
+ --evaluation_strategy "no" \
85
+ --save_strategy "steps" \
86
+ --save_steps 3000 \
87
+ --save_total_limit 1 \
88
+ --learning_rate 1e-5 \
89
+ --weight_decay 0. \
90
+ --warmup_ratio 0.03 \
91
+ --lr_scheduler_type "cosine" \
92
+ --logging_steps 1 \
93
+ --tf32 True \
94
+ --model_max_length 32768 \
95
+ --gradient_checkpointing True \
96
+ --dataloader_num_workers 8 \
97
+ --lazy_preprocess True \
98
+ --report_to wandb \
99
+ --torch_compile True \
100
+ --torch_compile_backend "inductor"
101
+ --dataloader_drop_last True
102
+ ```
103
+
104
+ ### Training Data
105
+
106
+ - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
107
+ - 158K GPT-generated multimodal instruction-following data.
108
+ - 500K academic-task-oriented VQA data mixture.
109
+ - 50K GPT-4V data mixture.
110
+ - 40K ShareGPT data.
111
+
112
+ #### Speeds, Sizes, Times [optional]
113
+
114
+ The training cost is ~50-60 hours on 8 x 8 NVIDIA A100-SXM4-80GB (may vary due to hardware differences).
115
+
116
+ [More Information Needed]
117
+
118
+ ## Evaluation
119
+
120
+ The evaluation is conducted with the support of [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval)