---
datasets:
- HuggingFaceTB/smollm-corpus
language:
- en
---
# Outlier-Safe Pre-Training
[](https://arxiv.org/abs/2506.19697)
[](https://huggingface.co/collections/dmis-lab/outlier-safe-pre-training-osp-685bda10aa1e8a19fcb58ea8)
[](https://github.com/dmis-lab/Outlier-Safe-Pre-Training)
## Introduction
Quantization plays a crucial role in deploying Large Language Models (LLMs) in resource-constrained environments. However, the presence of outlier features significantly hinders low-bit quantization. While many studies address this problem in a post-hoc manner to make use of already pre-trained models, the importance of handling outliers during pre-training is often underestimated.
Our work, **Outlier-Safe Pre-Training (OSP)**, proposes a practical approach to training models that are robust to outliers from the start, without sacrificing performance or efficiency. Specifically, OSP focuses on the following goals:
1. 📈**Scaling to production-level training requirements**
Prior methods for quantization-friendly pre-training are often limited to small-scale experiments (e.g., models under 1B parameters or 100B tokens). In contrast, we train a 1.4B-parameter model on 1 trillion tokens, demonstrating that OSP is effective at production scale.
2. âš¡**Maintaining computational efficiency comparable to standard training**
A method that prevents outliers but significantly reduces efficiency is unlikely to gain adoption. OSP introduces only a ~2% slowdown while reducing GPU memory usage, making it appealing for those seeking to train quantization-friendly foundation models from scratch.
3. 🧩**Ensuring full compatibility with existing inference pipelines**
We prioritize compatibility with widely adopted inference frameworks such as vLLM and SGLang. Rather than introducing architectural changes that break compatibility, OSP preserves computational invariance, allowing models to be directly integrated into existing pipelines without additional effort.
## Model Checkpoints
### Final Models
The models were trained on 1 trillion tokens, following the pre-training recipe of [SmolLM](https://huggingface.co/blog/smollm). Specifically, training was conducted using the [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), a mixture of FineWeb-Edu, Cosmopedia, and Python-Edu.
- [🤗 OSP-1.4B-1T-Adam](https://huggingface.co/dmis-lab/OSP-1.4B-1T-Adam): Trained on the standard Adam optimizer, without any modifications.
- [🤗 OSP-1.4B-1T-Muon-SSNorm-EmbProj](https://huggingface.co/dmis-lab/OSP-1.4B-1T-Muon-SSNorm-EmbProj): Trained on the OSP framework. This is our final model.
### Ablation Models
Model | Optimizer | SSNorm | EmbProj | Ex. Kurt. | Had. | 4-4-4 | |
---|---|---|---|---|---|---|---|
Avg. | PPL | ||||||
🤗 OSP-1.4B-100B-Adam | Adam | ✗ | ✗ | 1818.56 | ✗ ✔ |
26.8 26.9 |
8e4 3e4 |
🤗 OSP-1.4B-100B-Muon-Only | Muon† (w/o Adam) |
✗ | ✗ | 361.35 | ✗ ✔ |
26.3 33.1 |
8e5 24.8 |
🤗 OSP-1.4B-100B-Muon | Muon | ✗ | ✗ | 1575.12 | ✗ ✔ |
29.0 38.4 |
1e4 15.8 |
🤗 OSP-1.4B-100B-Muon-SSNorm | Muon | ✔ | ✗ | 66.69 | ✗ ✔ |
36.4 38.3 |
44.2 34.1 |
🤗 OSP-1.4B-100B-Muon-EmbProj | Muon | ✗ | ✔ | 703.23 | ✗ ✔ |
30.4 36.2 |
114.6 22.3 |
🤗 OSP-1.4B-100B-Muon-SSNorm-EmbProj | Muon | ✔ | ✔ | 0.04 | ✗ ✔ |
37.5 38.9 |
19.6 13.5 |