
Pico Language Model
AI & ML interests
None defined yet.
Recent Activity
Pico: A Lightweight Framework for Studying Language Model Learning Dynamics
Welcome to the pico-lm organization on Hugging Face! Pico is designed to demystify how language models learn by:
- Training a family of language models at different scales using a transparent, minimally opinionated codebase.
- Analyzing these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.
For full documentation and code, visit our two main repositories:
- pico-train: Minimalist training framework for language models.
- pico-analyze: Tools for measuring and visualizing model learning dynamics across checkpoints.
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
All code and artifacts are licensed under a permissive Apache-2.0 license.
Pro Tip 🚀 : To learn more about these libraries and explore detailed tutorials, visit our official website picolm.io and get fully acquainted with the Pico ecosystem.
🤗 HuggingFace Resources (You Are Here)
1. Pre-trained Model Suite
Our complete suite of models from 11M to 570M parameters trained with Pico:
- pico-decoder-tiny (11M parameters)
- pico-decoder-small (65M parameters)
- pico-decoder-medium (181M parameters)
- pico-decoder-large (570M parameters)
🚧 Disclaimer These models are still under construction. The models released in this repository have been trained for 50,000 steps (corresponding to ~100B tokens). Training will finalize after 200,000 steps.
🚧 Coming Soon! pico-decoder-xl (1B+ parameters) Watch this space or star our GitHub repository for updates!
All models are on the pretokenized-dolma dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
In each model repository, we version control checkpoints every 1000 steps that contain:
- Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
- Model activations and gradients
- The batch of training data observed at the given training step
We visualize the learning process in our Wandb.
Model Details:
Aspect | Details |
---|---|
Architecture | - Llama-style transformer (decoder-only) - RMSNorm normalization - RoPE (Rotary Positional Embeddings) - Multi-head attention with KV-cache - SwiGLU activation function |
Sequence Length | 2048 |
Batch Size | 1024 |
Optimizer | AdamW |
Learning Rate | 3e-4 (one-cycle warmup) |
Gradient Clipping | 1.0 |
Precision | Mixed precision training |
Vocabulary Size | 50,280 |
2. Datasets
-
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the DOLMA corpus
- We use this dataset to train our model suite
-
- A smaller version of the pretokenized-dolma corpus for quick experiments
-
- A tokenized and shuffled version of the Paloma evaluation corpus
- The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
- We use this corpus to evaluate the perplexity of our models
-
- A sub-sampled version of the pretokenized-dolma corpus
All datasets are tokenized using the OLMo Tokenizer
🔍 Citation
If you use Pico in academic or professional work, please cite it:
@software{pico2025,
author = {Diehl Martinez, Richard},
title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
year = {2025,
url = {https://github.com/pico-lm}
}
Thanks for checking out Pico!
Star our GitHub repositories or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!