NCSOFT
/

VARCO-VISION-14B

@@ -18,7 +18,7 @@ library_name: transformers
 ## About the Model
-**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports grounding—the ability to identify the locations of objects within an image—as well as OCR (Optical Character Recognition) to recognize text within images.
 - **Developed by:** NC Research, Multimodal Generation Team
 - **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
@@ -73,7 +73,7 @@ vision_tower = model.get_vision_tower()
 image_processor = vision_tower.image_processor
 ```
-Prepare the image and text input by preprocessing the image and tokenizing the text. Pass the processed inputs to the model to generate predictions.
 ```python
 import requests
@@ -123,9 +123,9 @@ print(outputs)
 ### Specialized Features
-To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.
-The following special tokens are used to define specific tasks, inputs and outputs for the model:
 - `<gro>`: Indicates that the model's response should include bounding box information.
 - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
@@ -135,8 +135,7 @@ The following special tokens are used to define specific tasks, inputs and outpu
 - `<delim>`: Represents multiple location points for a single object or text.
 #### Grounding
-Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
 ```python
 conversation = [
@@ -159,7 +158,7 @@ The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016,
 #### Referring
-VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.
 ```python
 conversation = [
@@ -225,7 +224,7 @@ conversation = [
 ## Citing the Model
-(*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following:
 ```bibtex
 @misc{ju2024varcovisionexpandingfrontierskorean,

 ## About the Model
+**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models.  The model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition).
 - **Developed by:** NC Research, Multimodal Generation Team
 - **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
 image_processor = vision_tower.image_processor
 ```
+Prepare an image and a text input. You need to preprocess the image and tokenize the text. Pass the processed inputs to the model to generate predictions.
 ```python
 import requests
 ### Specialized Features
+If a question is based on bounding boxes or require bounding boxes as an output, please include the special tokens in the input text.
+The following special tokens are used to define specific tasks, inputs, and outputs for the model:
 - `<gro>`: Indicates that the model's response should include bounding box information.
 - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
 - `<delim>`: Represents multiple location points for a single object or text.
 #### Grounding
+Grounding refers to a task where the model needs to identify specific locations within an image to provide an appropriate answer. To perform grounding, prepend the special token `<gro>` to the question.
 ```python
 conversation = [
 #### Referring
+VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, make a conversation including the object of interest within `<obj>` and `</obj>` tags. You have to specify its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location. A bbox is represented in a form of (x1, y1, x2, y2). The first two values indicate the top-left position of a bbox, and the latter two values are the bottom-right position.
 ```python
 conversation = [
 ## Citing the Model
+If you use VARCO-VISION-14B in your research, please cite the following:
 ```bibtex
 @misc{ju2024varcovisionexpandingfrontierskorean,