Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,7 @@ library_name: transformers
|
|
18 |
|
19 |
## About the Model
|
20 |
|
21 |
-
**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM)
|
22 |
|
23 |
- **Developed by:** NC Research, Multimodal Generation Team
|
24 |
- **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
|
@@ -73,7 +73,7 @@ vision_tower = model.get_vision_tower()
|
|
73 |
image_processor = vision_tower.image_processor
|
74 |
```
|
75 |
|
76 |
-
Prepare
|
77 |
|
78 |
```python
|
79 |
import requests
|
@@ -123,9 +123,9 @@ print(outputs)
|
|
123 |
|
124 |
### Specialized Features
|
125 |
|
126 |
-
|
127 |
-
|
128 |
-
The following special tokens are used to define specific tasks, inputs and outputs for the model:
|
129 |
|
130 |
- `<gro>`: Indicates that the model's response should include bounding box information.
|
131 |
- `<ocr>`: Specifies OCR tasks for recognizing text within an image.
|
@@ -135,8 +135,7 @@ The following special tokens are used to define specific tasks, inputs and outpu
|
|
135 |
- `<delim>`: Represents multiple location points for a single object or text.
|
136 |
|
137 |
#### Grounding
|
138 |
-
|
139 |
-
Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
|
140 |
|
141 |
```python
|
142 |
conversation = [
|
@@ -159,7 +158,7 @@ The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016,
|
|
159 |
|
160 |
#### Referring
|
161 |
|
162 |
-
VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks,
|
163 |
|
164 |
```python
|
165 |
conversation = [
|
@@ -225,7 +224,7 @@ conversation = [
|
|
225 |
|
226 |
## Citing the Model
|
227 |
|
228 |
-
|
229 |
|
230 |
```bibtex
|
231 |
@misc{ju2024varcovisionexpandingfrontierskorean,
|
|
|
18 |
|
19 |
## About the Model
|
20 |
|
21 |
+
**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition).
|
22 |
|
23 |
- **Developed by:** NC Research, Multimodal Generation Team
|
24 |
- **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
|
|
|
73 |
image_processor = vision_tower.image_processor
|
74 |
```
|
75 |
|
76 |
+
Prepare an image and a text input. You need to preprocess the image and tokenize the text. Pass the processed inputs to the model to generate predictions.
|
77 |
|
78 |
```python
|
79 |
import requests
|
|
|
123 |
|
124 |
### Specialized Features
|
125 |
|
126 |
+
If a question is based on bounding boxes or require bounding boxes as an output, please include the special tokens in the input text.
|
127 |
+
|
128 |
+
The following special tokens are used to define specific tasks, inputs, and outputs for the model:
|
129 |
|
130 |
- `<gro>`: Indicates that the model's response should include bounding box information.
|
131 |
- `<ocr>`: Specifies OCR tasks for recognizing text within an image.
|
|
|
135 |
- `<delim>`: Represents multiple location points for a single object or text.
|
136 |
|
137 |
#### Grounding
|
138 |
+
Grounding refers to a task where the model needs to identify specific locations within an image to provide an appropriate answer. To perform grounding, prepend the special token `<gro>` to the question.
|
|
|
139 |
|
140 |
```python
|
141 |
conversation = [
|
|
|
158 |
|
159 |
#### Referring
|
160 |
|
161 |
+
VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, make a conversation including the object of interest within `<obj>` and `</obj>` tags. You have to specify its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location. A bbox is represented in a form of (x1, y1, x2, y2). The first two values indicate the top-left position of a bbox, and the latter two values are the bottom-right position.
|
162 |
|
163 |
```python
|
164 |
conversation = [
|
|
|
224 |
|
225 |
## Citing the Model
|
226 |
|
227 |
+
If you use VARCO-VISION-14B in your research, please cite the following:
|
228 |
|
229 |
```bibtex
|
230 |
@misc{ju2024varcovisionexpandingfrontierskorean,
|