Organization Card

Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research

Welcome to Pico LM 👋, a research initiative dedicated to demystifying language model learning.

We create two complementary frameworks (pico-train and pico-analyze) for training and analyzing small to mid-scale language models (1M–1B parameters). Our mission is to provide a transparent, research-oriented workflow that illuminates how these models learn.

For full documentation and code, visit our two main repositories:

pico-train: Minimalist training framework for language models.
pico-analyze: Tools for measuring and visualizing model learning dynamics across checkpoints.

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.

All code and artifacts are licensed under a permissive Apache-2.0 license.

Pro Tip 🚀 : To learn more about these libraries and explore detailed tutorials, visit our official website picolm.io and get fully acquainted with the Pico ecosystem.

🤗 HuggingFace Resources (You Are Here)

1. Pre-trained Model Suite

Our complete suite of models from 11M to 570M parameters trained with Pico:

pico-decoder-tiny (11M parameters)
pico-decoder-small (65M parameters)
pico-decoder-medium (181M parameters)
pico-decoder-large (570M parameters)

🚧 Disclaimer These models are still under construction. The models released in this repository have been trained for 125,000 steps (corresponding to ~250B tokens). Training will finalize after 200,000 steps.

🚧 Coming Soon! pico-decoder-xl (1B+ parameters) Watch this space or star our GitHub repository for updates!

All models are on the pretokenized-dolma dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.

In each model repository, we version control checkpoints every 1000 steps that contain:

Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
Model activations and gradients
The batch of training data observed at the given training step

We visualize the learning process in our Wandb.

Model Details:

Aspect	Details
Architecture	- Llama-style transformer (decoder-only) - RMSNorm normalization - RoPE (Rotary Positional Embeddings) - Multi-head attention with KV-cache - SwiGLU activation function
Sequence Length	2048
Batch Size	1024
Optimizer	AdamW
Learning Rate	3e-4 (one-cycle warmup)
Gradient Clipping	1.0
Precision	Mixed precision training
Vocabulary Size	50,280

2. Datasets

pretokenized-dolma
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the DOLMA corpus
- We use this dataset to train our model suite
pretokenized-dolma-tinsy
- A smaller version of the pretokenized-dolma corpus for quick experiments
pretokenized-paloma
- A tokenized and shuffled version of the Paloma evaluation corpus
- The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
- We use this corpus to evaluate the perplexity of our models
pretokenized-paloma-tinsy
- A sub-sampled version of the pretokenized-dolma corpus

All datasets are tokenized using the OLMo Tokenizer

🔍 Citation

If you use Pico in academic or professional work, please cite it:

@inproceedings{diehl-martinez-etal-2025-pico,
    title = "Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research",
    author = "Diehl Martinez, Richard  and
      Africa, David Demitri  and
      Weiss, Yuval  and
      Salhan, Suchir  and
      Daniels, Ryan  and
      Buttery, Paula",
    editor = {Habernal, Ivan  and
      Schulam, Peter  and
      Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
}

Thanks for checking out Pico!
Star our GitHub repositories or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!