Pico Language Model

university

https://www.picolm.io

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

suchirsalhan authored a paper 5 days ago

ByteSpan: Information-Driven Subword Tokenisation

rdiehlmartinez updated a Space 6 days ago

pico-lm/README

rdiehlmartinez updated a model 7 days ago

pico-lm/pico-decoder-large

View all activity

Organization Card

Community About org cards

Pico: A Lightweight Framework for Studying Language Model Learning Dynamics

Welcome to the pico-lm organization on Hugging Face! Pico is designed to demystify how language models learn by:

Training a family of language models at different scales using a transparent, minimally opinionated codebase.
Analyzing these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.

For full documentation and code, visit our two main repositories:

pico-train: Minimalist training framework for language models.
pico-analyze: Tools for measuring and visualizing model learning dynamics across checkpoints.

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.

All code and artifacts are licensed under a permissive Apache-2.0 license.

Pro Tip 🚀 : To learn more about these libraries and explore detailed tutorials, visit our official website picolm.io and get fully acquainted with the Pico ecosystem.

🤗 HuggingFace Resources (You Are Here)

1. Pre-trained Model Suite

Our complete suite of models from 11M to 570M parameters trained with Pico:

pico-decoder-tiny (11M parameters)
pico-decoder-small (65M parameters)
pico-decoder-medium (181M parameters)
pico-decoder-large (570M parameters)

🚧 Disclaimer These models are still under construction. The models released in this repository have been trained for 125,000 steps (corresponding to ~250B tokens). Training will finalize after 200,000 steps.

🚧 Coming Soon! pico-decoder-xl (1B+ parameters) Watch this space or star our GitHub repository for updates!

All models are on the pretokenized-dolma dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.

In each model repository, we version control checkpoints every 1000 steps that contain:

Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
Model activations and gradients
The batch of training data observed at the given training step

We visualize the learning process in our Wandb.

Model Details:

Aspect	Details
Architecture	- Llama-style transformer (decoder-only) - RMSNorm normalization - RoPE (Rotary Positional Embeddings) - Multi-head attention with KV-cache - SwiGLU activation function
Sequence Length	2048
Batch Size	1024
Optimizer	AdamW
Learning Rate	3e-4 (one-cycle warmup)
Gradient Clipping	1.0
Precision	Mixed precision training
Vocabulary Size	50,280

2. Datasets

pretokenized-dolma
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the DOLMA corpus
- We use this dataset to train our model suite
pretokenized-dolma-tinsy
- A smaller version of the pretokenized-dolma corpus for quick experiments
pretokenized-paloma
- A tokenized and shuffled version of the Paloma evaluation corpus
- The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
- We use this corpus to evaluate the perplexity of our models
pretokenized-paloma-tinsy
- A sub-sampled version of the pretokenized-dolma corpus

All datasets are tokenized using the OLMo Tokenizer

🔍 Citation

If you use Pico in academic or professional work, please cite it:

@software{pico2025,
    author = {Diehl Martinez, Richard},
    title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
    year = {2025,
    url = {https://github.com/pico-lm}
}

Thanks for checking out Pico!
Star our GitHub repositories or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!