LightOnOCR-1B-1025 / README.md

Adding `transformers` as the library (#10)

cf0a051 verified 4 days ago

6.42 kB

	---
	license: apache-2.0
	pipeline_tag: image-to-text
	language:
	- en
	- fr
	- de
	- es
	- it
	- nl
	- pt
	- sv
	- da
	library_name: transformers
	tags:
	- ocr
	- document-understanding
	- vision-language
	- pdf
	- tables
	- forms
	---

	<div align="center">
	<img src="lightonocr-banner.png" alt="LightOn OCR-1B Banner" width="400"/>
	</div>

	# LightOnOCR-1B-1025

	Full BF16 version of the model. We recommend this variant for inference and further fine-tuning.

	LightOnOCR-1B is a compact, end-to-end vision–language model for Optical Character Recognition (OCR) and document understanding. It achieves state-of-the-art accuracy in its weight class while being several times faster and cheaper than larger general-purpose VLMs.

	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/lightonai/LightOnOCR-1B-1025/blob/main/notebook.ipynb)

	📝 [Read the full blog post](https://huggingface.co/blog/lightonai/lightonocr/) \| 🚀 [Try the demo](https://huggingface.co/spaces/lightonai/LightOnOCR-1B-Demo) \| 📓 [Finetuning notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)

	Highlights

	* ⚡ Speed: 5× faster than dots.ocr, 2× faster than PaddleOCR-VL-0.9B, 1.73× faster than DeepSeekOCR
	* 💸 Efficiency: Processes 5.71 pages/s on a single H100 (~493k pages/day) for <$0.01 per 1,000 pages
	* 🧠 End-to-End: Fully differentiable, no external OCR pipeline
	* 🧾 Versatile: Handles tables, receipts, forms, multi-column layouts, and math notation
	* 🌍 Compact variants: 32k and 16k vocab options for European languages

	---

	## Model Overview

	LightOnOCR combines a Vision Transformer encoder(Pixtral-based) with a lightweight text decoder(Qwen3-based) distilled from high-quality open VLMs.
	It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.


	---

	## Benchmarks

	\| Model \| ArXiv \| Old Scans \| Math \| Tables \| Multi-Column \| Tiny Text \| Base \| Overall \|
	\| :----------------- \| :---: \| :-------: \| :--: \| :----: \| :----------: \| :-------: \| :--: \| :-----: \|
	\| [LightOnOCR-1B-1025](https://huggingface.co/lightonai/LightOnOCR-1B-1025) (151k vocab) \| 81.4 \| 71.6 \| 76.4 \| 35.2 \| 80.0 \| 88.7 \| 99.5 \| 76.1 \|
	\| [LightOnOCR-1B-32k](https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025) (32k vocab) \| 80.6 \| 66.2 \| 73.5 \| 33.5 \| 71.2 \| 87.6 \| 99.5 \| 73.1 \|
	\| [LightOnOCR-1B-16k](https://huggingface.co/lightonai/LightOnOCR-0.9B-16k-1025) (16k vocab) \| 82.3 \| 72.9 \| 75.3 \| 33.5 \| 78.6 \| 85.1 \| 99.8 \| 75.4 \|

	All benchmarks evaluated using vLLM on the Olmo-Bench.

	---

	## Installation

	```bash

	uv venv --python 3.12 --seed
	source .venv/bin/activate

	export VLLM_COMMIT=e88bdd60d9a25d985168c9f4a60ab10095236d7c
	uv pip install vllm \
	'triton-kernels @ git+https://github.com/triton-lang/[email protected]#subdirectory=python/triton_kernels' \
	--torch-backend=auto \
	--extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} \
	--prerelease=allow

	uv pip install pypdfium2 pillow requests
	```

	## Start Server

	```bash
	vllm serve lightonai/LightOnOCR-1B-1025 \
	--limit-mm-per-prompt '{"image": 1}' \
	--async-scheduling
	```

	## PDF Inference

	```python
	import base64
	import requests
	import pypdfium2 as pdfium
	import io

	ENDPOINT = "http://localhost:8000/v1/chat/completions"
	MODEL = "lightonai/LightOnOCR-1B-1025"

	# Download PDF from arXiv
	pdf_url = "https://arxiv.org/pdf/2412.13663"
	pdf_data = requests.get(pdf_url).content

	# Open PDF and convert first page to image
	pdf = pdfium.PdfDocument(pdf_data)
	page = pdf[0]
	# Render at 200 DPI (scale factor = 200/72 ≈ 2.77)
	pil_image = page.render(scale=2.77).to_pil()

	# Convert to base64
	buffer = io.BytesIO()
	pil_image.save(buffer, format="PNG")
	image_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')

	# Make request
	payload = {
	"model": MODEL,
	"messages": [{
	"role": "user",
	"content": [{
	"type": "image_url",
	"image_url": {"url": f"data:image/png;base64,{image_base64}"}
	}]
	}],
	"max_tokens": 4096,
	"temperature": 0.2,
	"top_p": 0.9,
	}

	response = requests.post(ENDPOINT, json=payload)
	text = response.json()['choices'][0]['message']['content']
	print(text)
	```
	---

	## Rendering and Preprocessing Tips

	* Render PDFs to PNG or JPEG at a target longest dimension of 1540px
	* Maintain aspect ratio to preserve text geometry
	* Use one image per page; batching supported by vLLM

	---

	## Variants

	\| Variant \| Description \|
	\| :--------------------------------------------------------------------------------- \| :-------------------------------------------- \|
	\| [LightOnOCR-1B-1025](https://huggingface.co/lightonai/LightOnOCR-1B-1025) \| Full multilingual model (default) \|
	\| [LightOnOCR-1B-32k](https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025) \| Fastest pruned-vocabulary version (32k tokens) optimized for European languages \|
	\| [LightOnOCR-1B-16k](https://huggingface.co/lightonai/LightOnOCR-0.9B-16k-1025) \| Most compact variant with smallest vocabulary \|

	---

	## Fine-tuning

	Transformers integration is coming soon for training and inference.

	LightOnOCR is fully differentiable and supports:

	* LoRA fine-tuning
	* Domain adaptation (receipts, scientific articles, forms, etc.)
	* Multilingual fine-tuning with task-specific corpora

	📓 [Finetuning notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)

	---

	## Data

	Trained on a diverse large-scale PDF corpus covering:

	* Scientific papers, books, receipts, invoices, tables, forms, and handwritten text
	* Multiple languages (Latin alphabet dominant)
	* Real and synthetic document scans

	The dataset will be released under an open license.

	---

	## License

	Apache License 2.0

	---

	## Citation

	```
	@misc{lightonocr2025,
	title = {LightOnOCR-1B: End-to-End and Efficient Domain-Specific Vision-Language Models for OCR},
	author = {Said Taghadouini and Baptiste Aubertin and Adrien Cavaillès},
	year = {2025},
	howpublished = {\url{https://huggingface.co/blog/lightonai/lightonocr}}
	}
	```