alea-institute commited on
Commit
d2239f3
·
verified ·
1 Parent(s): 8fdd5bc

Update README for small ONNX model

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - charboundary
6
+ - sentence-boundary-detection
7
+ - paragraph-detection
8
+ - legal-text
9
+ - legal-nlp
10
+ - text-segmentation
11
+ - onnx
12
+ - cpu
13
+ - document-processing
14
+ - rag
15
+ - optimized-inference
16
+ license: mit
17
+ library_name: charboundary
18
+ pipeline_tag: text-classification
19
+ datasets:
20
+ - alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
21
+ - alea-institute/kl3m-data-snapshot-20250324
22
+ metrics:
23
+ - accuracy
24
+ - f1
25
+ - precision
26
+ - recall
27
+ - throughput
28
+ papers:
29
+ - https://arxiv.org/abs/2504.04131
30
+ ---
31
+
32
+ # CharBoundary small ONNX Model
33
+
34
+ This is the small ONNX model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
35
+ a fast character-based sentence and paragraph boundary detection system optimized for legal text.
36
+
37
+ ## Model Details
38
+
39
+ - **Size**: small
40
+ - **Model Size**: 0.6 MB (ONNX compressed)
41
+ - **Memory Usage**: 1026 MB at runtime (non-ONNX version)
42
+ - **Training Data**: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
43
+ - **Model Type**: Random Forest (32 trees, max depth 16) converted to ONNX
44
+ - **Format**: ONNX optimized for inference
45
+ - **Task**: Character-level boundary detection for text segmentation
46
+ - **License**: MIT
47
+ - **Throughput**: ~748K characters/second (base model; ONNX is typically 2-4x faster)
48
+
49
+ ## Usage
50
+
51
+ > **Security Advantage:** This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with `trust_model=True`. ONNX models are the recommended option for security-sensitive environments.
52
+
53
+ ```python
54
+ from huggingface_hub import hf_hub_download
55
+ from charboundary import TextSegmenter
56
+ from charboundary.onnx_support import enable_onnx
57
+
58
+ # Enable ONNX support
59
+ enable_onnx()
60
+
61
+ # Download the model
62
+ model_path = hf_hub_download(repo_id="alea-institute/charboundary-small-onnx",
63
+ filename="model.onnx")
64
+
65
+ # Load the model (ONNX models don't require trust_model parameter)
66
+ segmenter = TextSegmenter.load(model_path)
67
+
68
+ # Use the model
69
+ text = "This is a test sentence. Here's another one!"
70
+ sentences = segmenter.segment_to_sentences(text)
71
+ print(sentences)
72
+
73
+ # Segment to paragraphs
74
+ paragraphs = segmenter.segment_to_paragraphs(text)
75
+ print(paragraphs)
76
+
77
+ # Get character-level spans
78
+ sentence_spans = segmenter.segment_to_sentence_spans(text)
79
+ print(sentence_spans) # [(0, 24), (25, 42)]
80
+ ```
81
+
82
+ ## Performance
83
+
84
+ ONNX models provide significantly faster inference compared to the standard scikit-learn models
85
+ while maintaining the same accuracy metrics. The performance differences between model sizes are shown below.
86
+
87
+ ### Base Model Performance
88
+
89
+ | Dataset | Precision | F1 | Recall |
90
+ |---------|-----------|-------|--------|
91
+ | ALEA SBD Benchmark | 0.624 | 0.718 | 0.845 |
92
+ | SCOTUS | 0.926 | 0.773 | 0.664 |
93
+ | Cyber Crime | 0.939 | 0.837 | 0.755 |
94
+ | BVA | 0.937 | 0.870 | 0.812 |
95
+ | Intellectual Property | 0.927 | 0.883 | 0.843 |
96
+
97
+ ### Size and Speed Comparison
98
+
99
+ | Model | Format | Size (MB) | Memory Usage | Throughput (chars/sec) | F1 Score |
100
+ |-------|--------|-----------|--------------|------------------------|----------|
101
+ | Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 MB | ~748K | 0.773 |
102
+ | Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 MB | ~587K | 0.779 |
103
+ | Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 MB | ~518K | 0.782 |
104
+
105
+ ## Paper and Citation
106
+
107
+ This model is part of the research presented in the following paper:
108
+
109
+ ```
110
+ @article{bommarito2025precise,
111
+ title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
112
+ author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
113
+ journal={arXiv preprint arXiv:2504.04131},
114
+ year={2025}
115
+ }
116
+ ```
117
+
118
+ For more details on the model architecture, training, and evaluation, please see:
119
+ - [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
120
+ - [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
121
+ - [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)