Update README for small ONNX model
Browse files
README.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
tags:
|
| 5 |
+
- charboundary
|
| 6 |
+
- sentence-boundary-detection
|
| 7 |
+
- paragraph-detection
|
| 8 |
+
- legal-text
|
| 9 |
+
- legal-nlp
|
| 10 |
+
- text-segmentation
|
| 11 |
+
- onnx
|
| 12 |
+
- cpu
|
| 13 |
+
- document-processing
|
| 14 |
+
- rag
|
| 15 |
+
- optimized-inference
|
| 16 |
+
license: mit
|
| 17 |
+
library_name: charboundary
|
| 18 |
+
pipeline_tag: text-classification
|
| 19 |
+
datasets:
|
| 20 |
+
- alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
|
| 21 |
+
- alea-institute/kl3m-data-snapshot-20250324
|
| 22 |
+
metrics:
|
| 23 |
+
- accuracy
|
| 24 |
+
- f1
|
| 25 |
+
- precision
|
| 26 |
+
- recall
|
| 27 |
+
- throughput
|
| 28 |
+
papers:
|
| 29 |
+
- https://arxiv.org/abs/2504.04131
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
# CharBoundary small ONNX Model
|
| 33 |
+
|
| 34 |
+
This is the small ONNX model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
|
| 35 |
+
a fast character-based sentence and paragraph boundary detection system optimized for legal text.
|
| 36 |
+
|
| 37 |
+
## Model Details
|
| 38 |
+
|
| 39 |
+
- **Size**: small
|
| 40 |
+
- **Model Size**: 0.6 MB (ONNX compressed)
|
| 41 |
+
- **Memory Usage**: 1026 MB at runtime (non-ONNX version)
|
| 42 |
+
- **Training Data**: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
|
| 43 |
+
- **Model Type**: Random Forest (32 trees, max depth 16) converted to ONNX
|
| 44 |
+
- **Format**: ONNX optimized for inference
|
| 45 |
+
- **Task**: Character-level boundary detection for text segmentation
|
| 46 |
+
- **License**: MIT
|
| 47 |
+
- **Throughput**: ~748K characters/second (base model; ONNX is typically 2-4x faster)
|
| 48 |
+
|
| 49 |
+
## Usage
|
| 50 |
+
|
| 51 |
+
> **Security Advantage:** This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with `trust_model=True`. ONNX models are the recommended option for security-sensitive environments.
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
from huggingface_hub import hf_hub_download
|
| 55 |
+
from charboundary import TextSegmenter
|
| 56 |
+
from charboundary.onnx_support import enable_onnx
|
| 57 |
+
|
| 58 |
+
# Enable ONNX support
|
| 59 |
+
enable_onnx()
|
| 60 |
+
|
| 61 |
+
# Download the model
|
| 62 |
+
model_path = hf_hub_download(repo_id="alea-institute/charboundary-small-onnx",
|
| 63 |
+
filename="model.onnx")
|
| 64 |
+
|
| 65 |
+
# Load the model (ONNX models don't require trust_model parameter)
|
| 66 |
+
segmenter = TextSegmenter.load(model_path)
|
| 67 |
+
|
| 68 |
+
# Use the model
|
| 69 |
+
text = "This is a test sentence. Here's another one!"
|
| 70 |
+
sentences = segmenter.segment_to_sentences(text)
|
| 71 |
+
print(sentences)
|
| 72 |
+
|
| 73 |
+
# Segment to paragraphs
|
| 74 |
+
paragraphs = segmenter.segment_to_paragraphs(text)
|
| 75 |
+
print(paragraphs)
|
| 76 |
+
|
| 77 |
+
# Get character-level spans
|
| 78 |
+
sentence_spans = segmenter.segment_to_sentence_spans(text)
|
| 79 |
+
print(sentence_spans) # [(0, 24), (25, 42)]
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
## Performance
|
| 83 |
+
|
| 84 |
+
ONNX models provide significantly faster inference compared to the standard scikit-learn models
|
| 85 |
+
while maintaining the same accuracy metrics. The performance differences between model sizes are shown below.
|
| 86 |
+
|
| 87 |
+
### Base Model Performance
|
| 88 |
+
|
| 89 |
+
| Dataset | Precision | F1 | Recall |
|
| 90 |
+
|---------|-----------|-------|--------|
|
| 91 |
+
| ALEA SBD Benchmark | 0.624 | 0.718 | 0.845 |
|
| 92 |
+
| SCOTUS | 0.926 | 0.773 | 0.664 |
|
| 93 |
+
| Cyber Crime | 0.939 | 0.837 | 0.755 |
|
| 94 |
+
| BVA | 0.937 | 0.870 | 0.812 |
|
| 95 |
+
| Intellectual Property | 0.927 | 0.883 | 0.843 |
|
| 96 |
+
|
| 97 |
+
### Size and Speed Comparison
|
| 98 |
+
|
| 99 |
+
| Model | Format | Size (MB) | Memory Usage | Throughput (chars/sec) | F1 Score |
|
| 100 |
+
|-------|--------|-----------|--------------|------------------------|----------|
|
| 101 |
+
| Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 MB | ~748K | 0.773 |
|
| 102 |
+
| Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 MB | ~587K | 0.779 |
|
| 103 |
+
| Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 MB | ~518K | 0.782 |
|
| 104 |
+
|
| 105 |
+
## Paper and Citation
|
| 106 |
+
|
| 107 |
+
This model is part of the research presented in the following paper:
|
| 108 |
+
|
| 109 |
+
```
|
| 110 |
+
@article{bommarito2025precise,
|
| 111 |
+
title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
|
| 112 |
+
author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
|
| 113 |
+
journal={arXiv preprint arXiv:2504.04131},
|
| 114 |
+
year={2025}
|
| 115 |
+
}
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
For more details on the model architecture, training, and evaluation, please see:
|
| 119 |
+
- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
|
| 120 |
+
- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
|
| 121 |
+
- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)
|