TokSuite – XGLM

Model Summary

TokSuite–XGLM is part of TokSuite, a controlled suite of language models designed to isolate and measure the impact of tokenizer choice on language model behavior.

This model is architecturally identical to the other 13 TokSuite models and differs only in its tokenizer. All TokSuite models share:

the same model architecture,
the same training data,
the same training budget,
and a shared initialization for overlapping vocabulary items.

As a result, any behavioral differences observed between TokSuite models can be attributed directly to tokenizer design, rather than confounding factors such as data scale, architecture, or optimization.

Tokenizer

Tokenizer: XGLM
Tokenization method: Unigram
Vocabulary size: 256,008
Out-of-vocabulary handling: Byte-fallback
Language coverage: Multilingual
Pretokenization source: SentencePiece

Processing details:

Numbers: Learned
Contractions: Learned
Unicode normalization: NFKC
Whitespace / boundary markers: Normalized
Zerowidth chars: Normalized/Removed

Why XGLM?

XGLM was included in TokSuite to represent large-vocabulary multilingual Unigram tokenization trained with SentencePiece. As described in the tokenizer selection rationale of the TokSuite paper, XGLM exemplifies a design point that combines a very large learned vocabulary with probabilistic subword segmentation.

Including XGLM enables TokSuite to study tokenizer behavior in settings where:

vocabulary capacity is maximized,
segmentation is determined by a Unigram language model,
and multilingual coverage is prioritized at the tokenizer level.

This makes XGLM a representative example of modern multilingual tokenizer design.

Model Architecture

Architecture: Decoder-only Transformer (Lingua's Llama-3.2-1B configuration)
Non-embedding parameters: ~1B
Context length: 4096 tokens
Framework: Meta Lingua
Initialization: Shared super-vocabulary initialization across TokSuite models

The architecture and training setup are identical across all TokSuite models; only the tokenizer differs.

Training Data

The model was trained on a multilingual corpus totaling approximately 100B tokens, consisting of:

English: 40B tokens from FineWeb-Edu
Multilingual: 60B tokens evenly distributed across:
- Chinese (ZH)
- Turkish (TR)
- Italian (IT)
- Farsi (FA)

You can find the pretraining dataset here: toksuite/toksuite_pretraining_data

All models are trained with a fixed token budget, meaning that tokenizers with larger vocabularies and lower compression efficiency may observe less raw text (bytes/documents) during training.

Training Procedure

Training steps: 100,000
Sequence length: 4096
Batch size: 256 sequences
Optimizer: AdamW
Peak learning rate: 1e-3
Schedule: Cosine decay with 2,000 warm-up steps
Weight decay: 0.1

Evaluation

Canonical Benchmarks

The model was evaluated on standard base-language-model benchmarks:

HellaSwag
ARC
PIQA
XNLI

TokSuite Logo

These evaluations verify that the model exhibits reasonable base language modeling behavior at its scale and training budget.

TokSuite Robustness Benchmark

TokSuite–XGLM is evaluated on the TokSuite multilingual robustness benchmark, which probes real-world perturbations including:

orthographic and spelling errors,
diacritics presence/absence,
keyboard and input-method noise,
Unicode styling and homoglyphs,
OCR and spacing artifacts,
LaTeX and STEM formatting.

Tokenization Robustness under Multilingual Text Perturbations
Values represent relative performance drop, computed as (Acc_clean − Acc_perturbed) / Acc_clean, where lower values indicate greater robustness.

Perturbation types include:

Input: non-native keyboard input and romanization
Diacr.: optional diacritics
Orth.& Gram.: orthographic and grammatical errors
Morph: morphological variations including derivations, inflections, and contractions
Noise: homoglyph substitutions, OCR artifacts, typos, and spacing errors
LaTeX: LaTeX-style mathematical formatting
STEM: scientific diagrams and notational conventions
Unic.: Unicode styling characters

NEN denotes non-English inputs and EN denotes English inputs. The Avg column reports the average relative performance drop across all perturbation categories.

Model	Input (NEN)	Diacr. (NEN)	Orth. & Gram. (EN)	Orth. & Gram. (NEN)	Morph (EN)	Morph (NEN)	Noise (EN)	Noise (NEN)	LaTeX (EN)	STEM (EN)	Unic. (EN)	Avg ↓
TokenMonster	0.23	0.33	0.08	0.01	0.23	-0.07	0.10	0.18	0.21	0.10	0.51	0.17
XGLM	0.34	0.49	0.10	0.11	0.25	0.07	0.12	0.22	0.29	0.29	0.11	0.22
BLOOM	0.30	0.34	0.13	0.07	0.18	0.11	0.18	0.18	0.24	0.11	0.57	0.22
ByT5	0.30	0.44	0.04	0.06	0.27	0.04	0.14	0.18	0.17	0.29	0.53	0.22
Comma	0.28	0.43	0.05	0.07	0.18	0.00	0.11	0.20	0.23	0.29	0.61	0.22
mBERT	0.33	0.44	0.11	0.11	0.23	0.06	0.18	0.22	0.14	0.22	0.61	0.24
GPT-4o	0.30	0.51	0.08	0.05	0.21	0.05	0.16	0.19	0.24	0.33	0.55	0.24
GPT-2	0.34	0.46	0.07	0.10	0.25	0.06	0.14	0.21	0.24	0.35	0.53	0.25
Phi-3	0.33	0.46	0.16	0.09	0.27	0.08	0.17	0.21	0.24	0.22	0.55	0.25
Gemma-2	0.32	0.42	0.14	0.15	0.24	0.03	0.16	0.25	0.22	0.36	0.57	0.26
Qwen-3	0.36	0.42	0.14	0.11	0.25	0.06	0.16	0.23	0.26	0.29	0.57	0.26
Llama-3.2	0.33	0.55	0.11	0.10	0.25	0.08	0.15	0.24	0.17	0.30	0.59	0.26
Aya	0.31	0.46	0.14	0.10	0.22	0.03	0.19	0.25	0.21	0.38	0.58	0.26
Tekken	0.33	0.47	0.18	0.03	0.31	0.10	0.14	0.21	0.27	0.43	0.54	0.27
Avg	0.31	0.44	0.11	0.08	0.24	0.04	0.15	0.21	0.22	0.28	0.53	0.24

Tokenizer-Specific Findings

In TokSuite evaluations, the XGLM tokenizer exhibits:

Strong robustness to:
- Unicode styling and formatting perturbations,
- visually similar homoglyph characters,
- script-level variations across languages.

These strengths stem largely from aggressive Unicode normalization, which reduces sensitivity to superficial formatting changes.

However, this design introduces notable weaknesses:

High fragility on:
- LaTeX-style mathematical expressions,
- STEM content requiring precise symbol placement,
- structurally meaningful whitespace and formatting.
Lossy preprocessing, where normalization removes distinctions that are semantically critical in technical domains.

This highlights a key trade-off: normalization improves robustness to stylistic noise but can destroy structural information essential for technical reasoning.

Intended Use

This model is intended for:

research on tokenization and robustness,
multilingual NLP analysis,
controlled ablation studies,
benchmarking tokenizer behavior under noise.

It is not instruction-tuned, aligned, or optimized for deployment.

Limitations

Trained on a limited set of five languages.
Not optimized for instruction following or dialogue.
Fixed token budget constrains exposure to raw text depending on tokenization efficiency.
Intended strictly for research purposes.

Ethical Considerations

TokSuite models are released strictly for research purposes.
They inherit biases present in large-scale web data and should not be used in high-stakes or sensitive applications without additional alignment and evaluation.

Citation

If you use this model, please cite:

@article{toksuite2025,
  title={TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior},
  author={Altıntaş, Gul Sena and Ehghaghi, Malikeh and Lester, Brian and Liu, Fengyuan and Zhao, Wanru and Ciccone, Marco and Raffel, Colin},
  year={2025},
  arxiv={https://arxiv.org/abs/2512.20757},
}

Downloads last month: 79

Safetensors

Model size

2B params

Tensor type

BF16

Dataset used to train toksuite/supertoken_models-llama_facebook-xglm-564M

Collection including toksuite/supertoken_models-llama_facebook-xglm-564M

TokSuite Model Collection

Collection

14 items • Updated 21 days ago

Paper for toksuite/supertoken_models-llama_facebook-xglm-564M

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Paper • 2512.20757 • Published 15 days ago • 16