SetFit with nomic-ai/nomic-embed-text-v1.5

This is a SetFit model that can be used for Text Classification. This SetFit model uses nomic-ai/nomic-embed-text-v1.5 as the Sentence Transformer embedding model. A LogisticRegression instance is used for classification.

The model has been trained using an efficient few-shot learning technique that involves:

Fine-tuning a Sentence Transformer with contrastive learning.
Training a classification head with features from the fine-tuned Sentence Transformer.

Model Details

Model Description

Model Type: SetFit
Sentence Transformer body: nomic-ai/nomic-embed-text-v1.5
Classification head: a LogisticRegression instance
Maximum Sequence Length: 8192 tokens
Number of Classes: 7 classes

Model Sources

Repository: SetFit on GitHub
Paper: Efficient Few-Shot Learning Without Prompts
Blogpost: SetFit: Efficient Few-Shot Learning Without Prompts

Model Labels

Label	Examples
Term of Art Interpretations & Application	'How do courts in Illinois define "constructive eviction"?' 'How do Pennsylvania courts define "reasonable suspicion" in DUI cases?' 'definition of ex parte'
Out of Scope	'Has Capt. Ashley Heiberger ever testified as an expert witness?' 'Have you recently attended any weddings or special celebrations?' 'Have you seen any good movies lately?'
SDR	'Gonzalez et al. v. Mexico' '2021 U.S. Dist. LEXIS 14890' 'Elizabeth Holmes Theranos ORDER DENYING MOTION FOR RELEASE PENDING APPEAL'
Identify Current Law	'Does Michigan have a statute of repose?' 'Mississippi law concerning challenges to changes made in updated HOA regulations' 'cases on nurse liability for making medication dosage mistake in kentucky'
Agent decision	'Search for USPTO Patent Decisions: BPAI and PTAB discussing the integration of a judicial exception into practical applications' 'Are there any EPA Environmental Appeals Board Decisions regarding the guidelines for establishing a "critical habitat" for wildlife?' 'Find Merit Systems Protection Board decisions regarding when the plain language of a statute must be treated as controlling'
Q&A - Complex	'Are bloodhounds considered reliable for establishing probable cause in Idaho?' 'What are the requirements to file a class action lawsuit in Florida?' 'Can a corporation be held liable for damages caused by an employee driving under the influence of alcohol in New York?'
Practical Guidance	'What does an "Election of Remedy" clause involve in an indemnity agreement? T' 'Where is Private Company Corporate Governance Board Resolutions Resource Kit T' 'If I start a law firm in Michigan, what types of employee leave do I need to provide compared to my current firm in Ohio? T'

Uses

Direct Use for Inference

First install the SetFit library:

pip install setfit

Then you can load this model and run inference.

from setfit import SetFitModel

# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("tonyshaw/setfit_pg_70h_nomic-v1.5")
# Run inference
preds = model("Ohio aggravated arson cases")

Training Details

Training Set Metrics

Training set	Min	Median	Max
Word count	1	11.2193	98

Label	Training Sample Count
Agent decision	130
Identify Current Law	500
Out of Scope	100
Practical Guidance	41
Q&A - Complex	500
SDR	500
Term of Art Interpretations & Application	500

Training Hyperparameters

batch_size: (16, 16)
num_epochs: (1, 1)
max_steps: -1
sampling_strategy: oversampling
num_iterations: 10
body_learning_rate: (2e-05, 2e-05)
head_learning_rate: 2e-05
loss: CosineSimilarityLoss
distance_metric: cosine_distance
margin: 0.25
end_to_end: False
use_amp: False
warmup_proportion: 0.1
l2_weight: 0.01
seed: 42
eval_max_steps: -1
load_best_model_at_end: False

Training Results

Epoch	Step	Training Loss	Validation Loss
0.0004	1	0.2703	-
0.0176	50	0.2289	-
0.0352	100	0.2032	-
0.0528	150	0.0951	-
0.0704	200	0.0434	-
0.0881	250	0.026	-
0.1057	300	0.0299	-
0.1233	350	0.02	-
0.1409	400	0.0136	-
0.1585	450	0.013	-
0.1761	500	0.0147	-
0.1937	550	0.0144	-
0.2113	600	0.0052	-
0.2290	650	0.0067	-
0.2466	700	0.0021	-
0.2642	750	0.0038	-
0.2818	800	0.006	-
0.2994	850	0.0039	-
0.3170	900	0.0007	-
0.3346	950	0.0003	-
0.3522	1000	0.0002	-
0.3698	1050	0.0026	-
0.3875	1100	0.0027	-
0.4051	1150	0.0003	-
0.4227	1200	0.0012	-
0.4403	1250	0.0022	-
0.4579	1300	0.0027	-
0.4755	1350	0.0014	-
0.4931	1400	0.0008	-
0.5107	1450	0.0001	-
0.5284	1500	0.0013	-
0.5460	1550	0.0001	-
0.5636	1600	0.0011	-
0.5812	1650	0.0	-
0.5988	1700	0.001	-
0.6164	1750	0.0001	-
0.6340	1800	0.0002	-
0.6516	1850	0.0	-
0.6692	1900	0.0	-
0.6869	1950	0.0	-
0.7045	2000	0.0	-
0.7221	2050	0.0	-
0.7397	2100	0.0	-
0.7573	2150	0.0	-
0.7749	2200	0.0	-
0.7925	2250	0.001	-
0.8101	2300	0.0	-
0.8278	2350	0.0	-
0.8454	2400	0.0013	-
0.8630	2450	0.0	-
0.8806	2500	0.0001	-
0.8982	2550	0.0004	-
0.9158	2600	0.0	-
0.9334	2650	0.0001	-
0.9510	2700	0.0	-
0.9687	2750	0.0	-
0.9863	2800	0.0	-

Framework Versions

Python: 3.11.11
SetFit: 1.1.1
Sentence Transformers: 3.4.1
Transformers: 4.48.3
PyTorch: 2.6.0+cu124
Datasets: 3.4.1
Tokenizers: 0.21.1

Citation

BibTeX

@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}

tonyshaw
/

setfit_pg_70h_nomic-v1.5