CharBoundary medium (default) Model

This is the medium (default) model for the CharBoundary library (v0.5.0), a fast character-based sentence and paragraph boundary detection system optimized for legal text.

Model Details

Size: medium (default)
Model Size: 13.0 MB (SKOPS compressed)
Memory Usage: 1897 MB at runtime
Training Data: Legal text with ~500,000 samples from KL3M dataset
Model Type: Random Forest (64 trees, max depth 20)
Format: scikit-learn model (serialized with skops)
Task: Character-level boundary detection for text segmentation
License: MIT
Throughput: ~587K characters/second

Usage

Important: When loading models from Hugging Face Hub, you must set trust_model=True to allow loading custom class types.

Security Note: The ONNX model variants are recommended in security-sensitive environments as they don't require bypassing skops security measures with trust_model=True. See the [ONNX versions](https://huggingface.co/alea-institute/charboundary-medium (default)-onnx) for a safer alternative.

# pip install charboundary
from huggingface_hub import hf_hub_download
from charboundary import TextSegmenter

# Download the model
model_path = hf_hub_download(repo_id="alea-institute/charboundary-medium (default)", filename="model.pkl")

# Load the model (trust_model=True is required when loading from external sources)
segmenter = TextSegmenter.load(model_path, trust_model=True)

# Use the model
text = "This is a test sentence. Here's another one!"
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ['This is a test sentence.', " Here's another one!"]

# Segment to spans
sentence_spans = segmenter.get_sentence_spans(text)
print(sentence_spans)
# Output: [(0, 24), (24, 44)]

Performance

The model uses a character-based random forest classifier with the following configuration:

Window Size: 5 characters before, 3 characters after potential boundary
Accuracy: 0.9980
F1 Score: 0.7790
Precision: 0.7570
Recall: 0.9910

Dataset-specific Performance

Dataset	Precision	F1	Recall
ALEA SBD Benchmark	0.631	0.722	0.842
SCOTUS	0.938	0.775	0.661
Cyber Crime	0.961	0.853	0.767
BVA	0.957	0.875	0.806
Intellectual Property	0.948	0.889	0.837

Available Models

CharBoundary comes in three sizes, balancing accuracy and efficiency:

Model	Format	Size (MB)	Memory (MB)	Throughput (chars/sec)	F1 Score
Small	SKOPS / ONNX	3.0 / 0.5	1,026	~748K	0.773
Medium	SKOPS / ONNX	13.0 / 2.6	1,897	~587K	0.779
Large	SKOPS / ONNX	60.0 / 13.0	5,734	~518K	0.782

Paper and Citation

This model is part of the research presented in the following paper:

@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}

For more details on the model architecture, training, and evaluation, please see:

Contact

This model is developed and maintained by the ALEA Institute.

For technical support, collaboration opportunities, or general inquiries:

GitHub: https://github.com/alea-institute/kl3m-model-research
Email: [email protected]
Website: https://aleainstitute.ai

For any questions, please contact ALEA Institute at [email protected] or create an issue on this repository or GitHub.

alea-institute
/

charboundary-medium