alea-institute
/

charboundary-small-onnx

+---
+language:
+  - en
+tags:
+  - charboundary
+  - sentence-boundary-detection
+  - paragraph-detection
+  - legal-text
+  - legal-nlp
+  - text-segmentation
+  - onnx
+  - cpu
+  - document-processing
+  - rag
+  - optimized-inference
+license: mit
+library_name: charboundary
+pipeline_tag: text-classification
+datasets:
+  - alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
+  - alea-institute/kl3m-data-snapshot-20250324
+metrics:
+  - accuracy
+  - f1
+  - precision
+  - recall
+  - throughput
+papers:
+  - https://arxiv.org/abs/2504.04131
+---
+# CharBoundary small ONNX Model
+This is the small ONNX model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
+a fast character-based sentence and paragraph boundary detection system optimized for legal text.
+## Model Details
+- **Size**: small
+- **Model Size**: 0.6 MB (ONNX compressed)
+- **Memory Usage**: 1026 MB at runtime (non-ONNX version)
+- **Training Data**: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
+- **Model Type**: Random Forest (32 trees, max depth 16) converted to ONNX
+- **Format**: ONNX optimized for inference
+- **Task**: Character-level boundary detection for text segmentation
+- **License**: MIT
+- **Throughput**: ~748K characters/second (base model; ONNX is typically 2-4x faster)
+## Usage
+> **Security Advantage:** This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with `trust_model=True`. ONNX models are the recommended option for security-sensitive environments.
+```python
+from huggingface_hub import hf_hub_download
+from charboundary import TextSegmenter
+from charboundary.onnx_support import enable_onnx
+# Enable ONNX support
+enable_onnx()
+# Download the model
+model_path = hf_hub_download(repo_id="alea-institute/charboundary-small-onnx",
+                            filename="model.onnx")
+# Load the model (ONNX models don't require trust_model parameter)
+segmenter = TextSegmenter.load(model_path)
+# Use the model
+text = "This is a test sentence. Here's another one!"
+sentences = segmenter.segment_to_sentences(text)
+print(sentences)
+# Segment to paragraphs
+paragraphs = segmenter.segment_to_paragraphs(text)
+print(paragraphs)
+# Get character-level spans
+sentence_spans = segmenter.segment_to_sentence_spans(text)
+print(sentence_spans)  # [(0, 24), (25, 42)]
+```
+## Performance
+ONNX models provide significantly faster inference compared to the standard scikit-learn models
+while maintaining the same accuracy metrics. The performance differences between model sizes are shown below.
+### Base Model Performance
+| Dataset | Precision | F1 | Recall |
+|---------|-----------|-------|--------|
+| ALEA SBD Benchmark | 0.624 | 0.718 | 0.845 |
+| SCOTUS | 0.926 | 0.773 | 0.664 |
+| Cyber Crime | 0.939 | 0.837 | 0.755 |
+| BVA | 0.937 | 0.870 | 0.812 |
+| Intellectual Property | 0.927 | 0.883 | 0.843 |
+### Size and Speed Comparison
+| Model | Format | Size (MB) | Memory Usage | Throughput (chars/sec) | F1 Score |
+|-------|--------|-----------|--------------|------------------------|----------|
+| Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 MB | ~748K | 0.773 |
+| Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 MB | ~587K | 0.779 |
+| Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 MB | ~518K | 0.782 |
+## Paper and Citation
+This model is part of the research presented in the following paper:
+```
+@article{bommarito2025precise,
+  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
+  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
+  journal={arXiv preprint arXiv:2504.04131},
+  year={2025}
+}
+```
+For more details on the model architecture, training, and evaluation, please see:
+- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
+- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
+- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)