NVIDIA Releases 3 Million Sample Dataset for OCR, Visual Question Answering, and Captioning Tasks
NVIDIA continues releasing permissive datasets in support of world class models
We’re excited to release Llama Nemotron VLM Dataset V1 with 3 million samples of high quality vision language model (VLM) training data, which focus on enterprise specific use cases such as optical character recognition (OCR), visual question answering (VQA) and captioning. This dataset is a compilation of high quality VLM post-training datasets that we used for training our recently released Llama 3.1 Nemotron Nano VL 8B V1. Llama Nemotron Nano VL model is an advanced AI model designed specifically for advanced intelligent document processing and understanding. The model topped the OCRBench V2 benchmark, highlighting the quality of this training dataset.
The release represents a significant move forward in openness and transparency in model development and improvement. By releasing training data, in addition to the training tools and final model weights, NVIDIA supports continued improvement of open source models.
What’s in the Dataset and How we built it?
At a high level, the Llama Nemotron VLM Dataset V1 is composed of 67.0% VQA samples, 28.4% OCR samples, and 4.6% Image Captioning samples. Developers can use the dataset entirely or curate it further using the NVIDIA NeMo Curator to process high-quality datasets for training highly accurate VLM models.
Let’s dive deeper into how we constructed this dataset.
This dataset introduces high-quality annotations that support world class vision-language understanding. An area of focus was to re-annotate popular visual question answering datasets with open-source technologies, such that the data can be permissively used for training. To extract additional rich information from input images that is not fully present in the existing QA annotations, we used commercially permissive models and pipelines to generate fine-grained descriptions of the images.
Examples of augmentation include: Adding chain of thought explanations, rule-based QA generation (e.g., with templates), expanding short answers into longer responses, as well as proper reformatting. More details can be found in Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models paper.
Optical character recognition (OCR) is an area particularly important to support document understanding, which involves tables and figures of various sources and with diverse layouts. OCR is also important for use cases like IT support, customer service. The VLM trained with dataset enables a deeper comprehension of images with text, tabular content, icons, images of documents with structural versatility. Details may be found in Éclair - Extracting Content and Layout with Integrated Reading Order for Documents. As part of this dataset release, we share
- Entirely or partially synthetic OCR datasets (annotation + images) for character-, word- and page-level English and Chinese OCR;
- A selection of curated annotations for publicly available table and document OCR datasets;
- An internally annotated table OCR dataset.
Llama Nemotron VLM Training Dataset was purpose-built to support enterprise use cases and production-grade AI development. It is ready for commercial use.
Data Example
Example 1: VQA_4, image: chartqa/train/png/multi_col_948.png. Shared under GPL-3.0 license.
Prompt: "What company ranked second in the microprocessor market in 2020? Please provide a detailed explanation for your answer.”
Response:
Problem Restatement: Determine the company that ranked second in the microprocessor market in 2020 based on the provided data.
**Step-by-Step Process:**
Identify the Market Share Data for 2020:
- Intel: 19.5%
- TSMC: 11.2%
- Qualcomm: 10.2%
- SK Hynix: 7.7
- Broadcom: 7
- Samsung: 6%
- Nvidia: 4.6
- Sony: 4.6%
- Micron: 4.4%
2. Ranking the Companies Based on Market Share:
- Intel has the highest market share at 19.5%.
- TSMC follows with a market share of 11.2%
3. Conclusion:
- Since Intel has the highest market share, the company with the second-highest market share is TSMC.
**Final Answer:** TSMC
Get Started
To summarize, NVIDIA has released a 3-million-sample dataset for OCR, visual question answering, and captioning tasks. Download the dataset from Hugging Face [here] and use it for your projects — we’re excited to see what you build!