Pico Language Model

university

AI & ML interests

None defined yet.

Recent Activity

rdiehlmartinez  updated a Space 2 days ago
pico-lm/README
rdiehlmartinez  updated a model 3 days ago
pico-lm/demo
rdiehlmartinez  published a model 3 days ago
pico-lm/demo
View all activity

Pico: A Lightweight Framework for Studying Language Model Learning Dynamics

Welcome to the pico-lm organization on Hugging Face! Pico is designed to demystify how language models learn by:

  1. Training a family of language models at different scales using a transparent, minimally opinionated codebase.
  2. Analyzing these models’ learning behaviors using checkpoints enriched with activations, gradients, and evaluation metrics.

For full documentation and code, visit our two main repositories:

  • pico-train: Minimalist training framework for language models.
  • pico-analyze: Tools for measuring and visualizing model learning dynamics across checkpoints.

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.

All code and artifacts are licensed under a permissive Apache-2.0 license.

Pro Tip 🚀 : To learn more about these libraries and explore detailed tutorials, visit our official website picolm.io and get fully acquainted with the Pico ecosystem.


🤗 HuggingFace Resources (You Are Here)

1. Pre-trained Model Suite

Our complete suite of models from 11M to 570M parameters trained with Pico:

🚧 Disclaimer These models are still under construction. The models released in this repository have been trained for 50,000 steps (corresponding to ~100B tokens). Training will finalize after 200,000 steps.

🚧 Coming Soon! pico-decoder-xl (1B+ parameters) Watch this space or star our GitHub repository for updates!

All models are on the pretokenized-dolma dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.

In each model repository, we version control checkpoints every 1000 steps that contain:

  • Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
  • Model activations and gradients
  • The batch of training data observed at the given training step

We visualize the learning process in our Wandb.

Model Details:

Aspect Details
Architecture - Llama-style transformer (decoder-only)
- RMSNorm normalization
- RoPE (Rotary Positional Embeddings)
- Multi-head attention with KV-cache
- SwiGLU activation function
Sequence Length 2048
Batch Size 1024
Optimizer AdamW
Learning Rate 3e-4 (one-cycle warmup)
Gradient Clipping 1.0
Precision Mixed precision training
Vocabulary Size 50,280

2. Datasets

  1. pretokenized-dolma

    • 420B tokens of pre-processed, tokenized and shuffled text extraced from the DOLMA corpus
    • We use this dataset to train our model suite
  2. pretokenized-dolma-tinsy

    • A smaller version of the pretokenized-dolma corpus for quick experiments
  3. pretokenized-paloma

    • A tokenized and shuffled version of the Paloma evaluation corpus
    • The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
    • We use this corpus to evaluate the perplexity of our models
  4. pretokenized-paloma-tinsy

    • A sub-sampled version of the pretokenized-dolma corpus

All datasets are tokenized using the OLMo Tokenizer


🔍 Citation

If you use Pico in academic or professional work, please cite it:

@software{pico2025,
    author = {Diehl Martinez, Richard},
    title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
    year = {2025,
    url = {https://github.com/pico-lm}
}

Thanks for checking out Pico!
Star our GitHub repositories or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!