kimyoungjune commited on
Commit
8e721c4
·
verified ·
1 Parent(s): 1b613ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -9
README.md CHANGED
@@ -18,7 +18,7 @@ library_name: transformers
18
 
19
  ## About the Model
20
 
21
- **VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports grounding—the ability to identify the locations of objects within an image—as well as OCR (Optical Character Recognition) to recognize text within images.
22
 
23
  - **Developed by:** NC Research, Multimodal Generation Team
24
  - **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
@@ -73,7 +73,7 @@ vision_tower = model.get_vision_tower()
73
  image_processor = vision_tower.image_processor
74
  ```
75
 
76
- Prepare the image and text input by preprocessing the image and tokenizing the text. Pass the processed inputs to the model to generate predictions.
77
 
78
  ```python
79
  import requests
@@ -123,9 +123,9 @@ print(outputs)
123
 
124
  ### Specialized Features
125
 
126
- To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.
127
-
128
- The following special tokens are used to define specific tasks, inputs and outputs for the model:
129
 
130
  - `<gro>`: Indicates that the model's response should include bounding box information.
131
  - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
@@ -135,8 +135,7 @@ The following special tokens are used to define specific tasks, inputs and outpu
135
  - `<delim>`: Represents multiple location points for a single object or text.
136
 
137
  #### Grounding
138
-
139
- Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
140
 
141
  ```python
142
  conversation = [
@@ -159,7 +158,7 @@ The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016,
159
 
160
  #### Referring
161
 
162
- VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.
163
 
164
  ```python
165
  conversation = [
@@ -225,7 +224,7 @@ conversation = [
225
 
226
  ## Citing the Model
227
 
228
- (*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following:
229
 
230
  ```bibtex
231
  @misc{ju2024varcovisionexpandingfrontierskorean,
 
18
 
19
  ## About the Model
20
 
21
+ **VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition).
22
 
23
  - **Developed by:** NC Research, Multimodal Generation Team
24
  - **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
 
73
  image_processor = vision_tower.image_processor
74
  ```
75
 
76
+ Prepare an image and a text input. You need to preprocess the image and tokenize the text. Pass the processed inputs to the model to generate predictions.
77
 
78
  ```python
79
  import requests
 
123
 
124
  ### Specialized Features
125
 
126
+ If a question is based on bounding boxes or require bounding boxes as an output, please include the special tokens in the input text.
127
+
128
+ The following special tokens are used to define specific tasks, inputs, and outputs for the model:
129
 
130
  - `<gro>`: Indicates that the model's response should include bounding box information.
131
  - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
 
135
  - `<delim>`: Represents multiple location points for a single object or text.
136
 
137
  #### Grounding
138
+ Grounding refers to a task where the model needs to identify specific locations within an image to provide an appropriate answer. To perform grounding, prepend the special token `<gro>` to the question.
 
139
 
140
  ```python
141
  conversation = [
 
158
 
159
  #### Referring
160
 
161
+ VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, make a conversation including the object of interest within `<obj>` and `</obj>` tags. You have to specify its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location. A bbox is represented in a form of (x1, y1, x2, y2). The first two values indicate the top-left position of a bbox, and the latter two values are the bottom-right position.
162
 
163
  ```python
164
  conversation = [
 
224
 
225
  ## Citing the Model
226
 
227
+ If you use VARCO-VISION-14B in your research, please cite the following:
228
 
229
  ```bibtex
230
  @misc{ju2024varcovisionexpandingfrontierskorean,