kl3m-004-correction-001

kl3m-004-correction-001 is a small, ~500M parameter language model designed to assist in the correction of common typing, spelling, OCR, and format issues in English text, especially in the financial and legal domains.

Notably, this model has been trained on the alea-institute/kl3m-004-char-8k-cased tokenizer, which is a BPE tokenizer trained with a 3-character maximum token constraint. This character-level tokenization enables the model to effectively handle character-level corrections and transformations.

Model Details

  • Architecture: LlamaForCausalLM
  • Size: 478.2M parameters
  • Hidden Size: 1024
  • Layers: 32
  • Attention Heads: 16
  • Key-Value Heads: 16
  • Intermediate Size: 4096
  • Max Position Embeddings: 512
  • Context Window: 512 tokens (no ROPE)
  • Language(s): Primarily English
  • Tokenizer: kl3m-004-char-8k-cased BPE tokenizer (8,192 tokens, between 1-3 characters each)
  • Training Objective: Next token prediction with special tokens for correction
  • Developed by: ALEA Institute
  • License: CC-BY 4.0
  • Hardware Requirements: Runs real-time in fp32 on CPU or consumer NV/AMD GPUs

Use Cases

kl3m-004-correction-001 is particularly effective for:

  • Correcting common typing or spelling errors
  • Correcting common OCR errors
  • Correcting common formatting errors
  • Text normalization
  • Character-level transformations

Key Features

  • Character-Level Processing: Utilizes a character-based tokenizer (1-3 chars per token) for fine-grained text manipulation
  • Clean Training Data: Built on what was originally referred to as the Kelvin Legal DataPack, ensuring all training data is ethically sourced and legally permissible
  • Low Toxicity: Empirically lower toxicity and bias
  • Enterprise Focus: Specifically designed for legal, regulatory, and financial workflows
  • Efficient Deployment: Optimized for real-time inference on consumer hardware

Usage

Simply prompt the model with the original text, followed by the <|sep|> token, and wait for stop token (<|end|>) generation. You can use pipeline to handle this for you.

Deterministic Correction

In many situations, deterministic correction (i.e., most probable logit sequence) is fine.

from transformers import pipeline
p = pipeline('text-generation', 'alea-institute/kl3m-004-correction-001', device='cpu')

text = "Tne Uni+ed 5tates is nct responsib|e for 5uch pr0duction"

correction = p(text + "<|sep|>", max_new_tokens=512, return_full_text=False)[0]['generated_text']

# Output: The United States is not responsible for such production

Sampled with Frequency Weighting

In other situations, it can be useful to generate multiple corrections with a sampler and evaluate the distribution. For example:

  • using a string or token-based distance metric to score or rank corrections
  • showing multiple suggestions to a user with frequency-weighted order
from transformers import pipeline
from collections import Counter

p = pipeline('text-generation', 'alea-institute/kl3m-004-correction-001', device='cuda')

text = "Tne Uni+ed 5tates is nct responsib|e for 5uch pr0duction"

corrections = Counter(
  [
    g['generated_text']
    for g in p(
      text + "<|sep|>",
      max_new_tokens=512,
      return_full_text=False,
      temperature=0.5,
      # top_p, top_k, custom sampler, etc.
      do_sample=True,
      num_return_sequences=10
    )
  ]
).most_common(3)

# Output: [('The United States is not responsible for such production', 7), ('the United States is not responsible for such production', 3)]

Generation Parameters

The model supports various parameters to control the generation process:

  • temperature: Controls randomness (lower = more deterministic)
  • top_p: Nucleus sampling parameter (lower = more focused)
  • top_k: Limits vocabulary selection to top k tokens
  • max_new_tokens: Maximum number of tokens to generate
  • do_sample: Whether to use sampling vs. greedy decoding
  • num_return_sequences: Number of different sequences to generate

Training

This model was originally trained for 3 days on 1xRTX3090, and a larger ~3B parameter MoE version is pending release.

Training Data

This model was trained on a dataset generated with the KL3M data collection and the alea-data-generator library, which can create realistic synthetic samples using traditional (non-generative) techniques.

The source code to retrieve and process this dataset is available here: https://github.com/alea-institute/kl3m-data

Some pre-tokenized subsets of the KL3M data collection are available on Hugging Face: https://huggingface.co/datasets?sort=most_rows&search=kl3m-data

Complete, raw data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a zero-cost distribution model as soon as we can obtain additional support.

Special Tokens

kl3m-004-correction-001 uses the following special tokens from the kl3m-004-char-8k-cased tokenizer:

  • <|start|>: Token ID 0
  • <|end|>: Token ID 1
  • <|pad|>: Token ID 2
  • <|unk|>: Token ID 3
  • <|sep|>: Token ID 4 (used as a separator between the input text and corrected output)
  • <|cls|>: Token ID 5
  • <|mask|>: Token ID 6

Intended Usage

This model is intended for use in:

  • Text preprocessing pipelines for legal and financial documents
  • OCR post-processing for document digitization
  • Autocorrection features in text editors
  • Data cleaning workflows
  • Quality assurance for document management systems

Limitations

  • Limited to a 512 token context window
  • Primarily focused on English language corrections
  • May not handle complex semantic errors or context-dependent corrections
  • Not designed for translation or content generation
  • Limited to character-level and simple syntax corrections

Source

https://github.com/alea-institute/kl3m-model-research

References

Citation

@misc{kl3m-004-correction-001,
  author = {ALEA Institute},
  title = {kl3m-004-correction-001: A Character-Level Text Correction Model},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/alea-institute/kl3m-004-correction-001}}
}

@article{bommarito2025kl3m,
  title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2503.17247},
  year={2025}
}

License

Model weights are released under the CC-BY 4.0 License.

Contact

The KL3M model family is maintained by the ALEA Institute. For technical support, collaboration opportunities, or general inquiries:

logo

Downloads last month
42
Safetensors
Model size
478M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support