arxiv:2510.17800

Glyph: Scaling Context Windows via Visual-Text Compression

Published on Oct 20

· Submitted by

Jiale Cheng on Oct 21

#3 Paper of the day

Upvote

Authors:

Abstract

Glyph compresses long textual inputs into images using vision-language models, achieving significant token compression and improved performance in long-context tasks.

AI-generated summary

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

View arXiv page View PDF Add to collection

Community

CCCCCC

Paper submitter 1 day ago

🪶 Glyph: Scaling Context Windows via Visual-Text Compression

Glyph introduces a new paradigm for long-context LLMs through visual-text compression — rendering text as images and processing them with VLMs to boost information density.

🧩 Efficient compression:
Achieves 3–4× token reduction with minimal accuracy loss on long-context benchmarks.

⚡ Faster inference & training:
Up to 4× faster inference and 2× faster SFT training.

📏 Extreme compression capability:
Enables 128K-context models to handle 1M-token tasks through highly compact visual representations.

🌐 Open-source soon:
Code and models are on the way to foster vision-driven context scaling.

Azamorn

1 day ago

Is this not exactly what the Deepseek OCR paper was about?

EladofWar

about 16 hours ago

~3-4× token compression with comparable accuracy on long-context benchmarks; faster inference/training; aim to handle up to ~1 M token-level tasks via extreme compression.
Hugging Face

Primary domain: more general “document understanding / long text” rather than strictly OCR or converting arbitrary scanned docs into text. The image representation is a strategy for long-context LLMs.

similar ideas, but this implementation is more practical for inference and training and practical tasks.