CharBoundary medium (default) Model
This is the medium (default) model for the CharBoundary library (v0.5.0), a fast character-based sentence and paragraph boundary detection system optimized for legal text.
Model Details
- Size: medium (default)
- Model Size: 13.0 MB (SKOPS compressed)
- Memory Usage: 1897 MB at runtime
- Training Data: Legal text with ~500,000 samples from KL3M dataset
- Model Type: Random Forest (64 trees, max depth 20)
- Format: scikit-learn model (serialized with skops)
- Task: Character-level boundary detection for text segmentation
- License: MIT
- Throughput: ~587K characters/second
Usage
Important: When loading models from Hugging Face Hub, you must set
trust_model=True
to allow loading custom class types.Security Note: The ONNX model variants are recommended in security-sensitive environments as they don't require bypassing skops security measures with
trust_model=True
. See the [ONNX versions](https://huggingface.co/alea-institute/charboundary-medium (default)-onnx) for a safer alternative.
# pip install charboundary
from huggingface_hub import hf_hub_download
from charboundary import TextSegmenter
# Download the model
model_path = hf_hub_download(repo_id="alea-institute/charboundary-medium (default)", filename="model.pkl")
# Load the model (trust_model=True is required when loading from external sources)
segmenter = TextSegmenter.load(model_path, trust_model=True)
# Use the model
text = "This is a test sentence. Here's another one!"
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ['This is a test sentence.', " Here's another one!"]
# Segment to spans
sentence_spans = segmenter.get_sentence_spans(text)
print(sentence_spans)
# Output: [(0, 24), (24, 44)]
Performance
The model uses a character-based random forest classifier with the following configuration:
- Window Size: 5 characters before, 3 characters after potential boundary
- Accuracy: 0.9980
- F1 Score: 0.7790
- Precision: 0.7570
- Recall: 0.9910
Dataset-specific Performance
Dataset | Precision | F1 | Recall |
---|---|---|---|
ALEA SBD Benchmark | 0.631 | 0.722 | 0.842 |
SCOTUS | 0.938 | 0.775 | 0.661 |
Cyber Crime | 0.961 | 0.853 | 0.767 |
BVA | 0.957 | 0.875 | 0.806 |
Intellectual Property | 0.948 | 0.889 | 0.837 |
Available Models
CharBoundary comes in three sizes, balancing accuracy and efficiency:
Model | Format | Size (MB) | Memory (MB) | Throughput (chars/sec) | F1 Score |
---|---|---|---|---|---|
Small | SKOPS / ONNX | 3.0 / 0.5 | 1,026 | ~748K | 0.773 |
Medium | SKOPS / ONNX | 13.0 / 2.6 | 1,897 | ~587K | 0.779 |
Large | SKOPS / ONNX | 60.0 / 13.0 | 5,734 | ~518K | 0.782 |
Paper and Citation
This model is part of the research presented in the following paper:
@article{bommarito2025precise,
title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
journal={arXiv preprint arXiv:2504.04131},
year={2025}
}
For more details on the model architecture, training, and evaluation, please see:
- Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"
- CharBoundary GitHub repository
- Annotated training data
Contact
This model is developed and maintained by the ALEA Institute.
For technical support, collaboration opportunities, or general inquiries:
- GitHub: https://github.com/alea-institute/kl3m-model-research
- Email: [email protected]
- Website: https://aleainstitute.ai
For any questions, please contact ALEA Institute at [email protected] or create an issue on this repository or GitHub.