OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Abstract
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
Community
We release OLMoTrace, a tool that lets you trace the outputs of language models back to their full, multi-trillion-token training data in real time. We developed OLMoTrace to raise transparency & trust in LLMs.
On top of a standard chatbot experience, OLMoTrace highlights long pieces of LLM outputs that appear verbatim in the model’s training data, and shows the matching training documents. With OLMoTrace, you can see how LLMs may have learned to generate certain sequences of tokens. OLMoTrace is useful for fact checking ✅, understanding hallucinations 🎃, tracing LLM-generated “creative” expressions 🧑🎨, tracing reasoning capabilities 🧮, or just generally helping you understand why LLMs say certain things.
OLMoTrace is now available for the OLMo 2 and OLMoE family of models on Ai2 Playground. We also open-source our code so that anyone can enable OLMoTrace with their model’s training data.
Paper: https://allenai.org/papers/olmotrace
Blog: https://allenai.org/blog/olmotrace
Try OLMoTrace on Ai2 Playground: https://playground.allenai.org
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SuperBPE: Space Travel for Language Models (2025)
- Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi (2025)
- REFIND at SemEval-2025 Task 3: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models (2025)
- One ruler to measure them all: Benchmarking multilingual long-context language models (2025)
- MTLM: an Innovative Language Model Training Paradigm for ASR (2025)
- Language Models May Verbatim Complete Text They Were Not Explicitly Trained On (2025)
- PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper